Welcome to WebmasterWorld Guest from 54.162.155.183

Message Too Old, No Replies

Broken and truncated URLs in SERPS creating 404 nightmare

     
10:35 am on Oct 11, 2011 (gmt 0)

5+ Year Member



Webmasters tools is showing hundreds of broken and truncated incoming links from SERPs and custom SERPS (ie google custom search) results, such as
mysite.com/suburl1/suburl2/index.ht
mysite.com/suburl1/suburl2/....
mysite.com/suburl/Ö.
mysite.com/Ö
. (this last shown as
mysite.com/Ö./suburl2/index.html
in google's SERP)

Googlebot is following the truncated or otherwise abbreviated-for-space lines (most frequently used tags
<cite>
,
<div>
and
<span>
) that appear at the end of the search result snippet and reading them as bad URLs.

The problem, first detected in August, is ongoing.

This problem is polluting the crawl errors report to such a degree that itís near impossible to sort through the rubbish.

A few folks have suggested implementing a 301 redirect. I don't think this is the right solution. Not only is it costly (for many it would cause a database lookup), it would require action by the collection community of webmasters. Whatís more, for large websites, a 301 canít fix mysite.com/Ö./suburl2/index.html, the abbreviated format used by Google!

Any ideas on how we can get google to fix this problem?
1:14 pm on Oct 11, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Are you sure that Google SERPs were the original source of these bad links (with truncated URLs)? The examples I've seen originated on social networking sites, and that is where Google found them.
2:24 pm on Oct 11, 2011 (gmt 0)

5+ Year Member



Hi aristotle,

All of our bad links with truncated or abbreviated urls are coming from SERPs, including custom search facilities such as google custom search (Yes, they are actually trying to follow their own abbreviated urls!) and the ask equivalent.

From what I've read in other threads the problem is analogous to the examples you've seen with social networking sites, ie Googlebot is following the truncated or otherwise abbreviated-for-space lines (most frequently used tags <cite>, <div> and <span>) that appear as a reference but not as a link and reading them as bad URLs.
3:26 pm on Oct 11, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



There is a type of auto-generated spam, in which webpages are created from scraped copies of Google SERPs. Is it possible that this is what is happening?
6:35 pm on Oct 11, 2011 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



If the URL actually returns a 404 status in the HTTP header, and it's supposed to be 404 - them why is this a "nightmare"?
7:34 pm on Oct 11, 2011 (gmt 0)



tedster wrote:
If the URL actually returns a 404 status in the HTTP header, and it's supposed to be 404 - them why is this a "nightmare"?

If the information is meaningful, then it needs to be acted upon. If it's just a meaningless side-effect of Google's automated experimentation, then it shouldn't even be in Webmaster Tools reports. It lowers the signal-to-noise ratio, potentially to the point of uselessness.

Edit: I suppose that problem could be solved if WMT separated internal linking errors from external linking errors. As it is now, though, you have to wade through plenty of errors outside of your control to locate errors that are within your control.

--
Ryan
8:29 pm on Oct 11, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



If the URL actually returns a 404 status in the HTTP header, and it's supposed to be 404 - them why is this a "nightmare"?



tedster -- If Marfola is correct when he says that the links to his pages in the Google SERPs have truncated URLs, then people who try to click through to his pages from the SERPs will get 404s. That would be a nightmare.

However, I'm still not convinced that Google SERPs are the original source of these bad links. That's why I asked him about auto-generated spam pages that use Google's SERPs for their content.
10:23 pm on Oct 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I see a lot of this in my logs too. It appears that google is trying to crawl non-linked text that looks like it might be a url. That probably isn't an unreasonable assumption some of the time, but it leads to 404 pages being crawled.

I don't think Google views this as a problem worth fixing. I don't think it will cause any problems for your rankings to return 404s for these.

If it really bothers you (and it does me) then I would recommend the redirect route. I base mine on heuristics that don't require db lookups. It often means that I redirect to a 404, but I can live with that. I have redirect rules on my server that do the following:

Remove any unrecognized characters from the path and redirect. I only allow [a-zA-Z0-9\-\_\.\/] in my file names.

Redirect urls that end in .h .ht or .htm to end with .html

Redirect urls with multiple consecutive slashes to a single slash version (eg //web//foo.html to /web/foo.html)

Redirect urls with directory navigation dots in the url to the correct location (eg /web/../foo.html to /foo.html)

Redirect away from any junk after .html (eg /foo.htmltexthere to /foo.html)

That takes care of the majority of the badly formed urls that get requested. I still get the occasional truncation. I put in a redirect for them on a case by case basis if they get enough requests.
11:04 pm on Oct 11, 2011 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month





del

[edited by: g1smd at 11:10 pm (utc) on Oct 11, 2011]

11:09 pm on Oct 11, 2011 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



If Marfola is correct when he says that the links to his pages in the Google SERPs have truncated URLs, then people who try to click through to his pages from the SERPs will get 404s. That would be a nightmare.

In many, perhaps most cases, there are actually no a href links with the duff format. The truncated formats are seen in plain text or in anchor text and for whatever reason Google is now keeping note of each of these and requesting them from the server. You'll never see a real visitor requesting these malformed URLs, only Googlebot.


It often means that I redirect to a 404

Never redirect to a 404. The 404 status must be returned at the originally requested URL.

I only allow [a-zA-Z0-9\-\_\.\/] in my file names.

You have way too much escaping, use
[a-zA-Z0-9/.-_]
or, better still use
[a-z0-9/.-_]
with the
[NC]
flag.
11:18 pm on Oct 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Never redirect to a 404. The 404 status must be returned at the originally requested URL.


I don't see why not. Its often too much work/code to check if the url exists before issuing the redirect.

You have way too much escaping, use [a-zA-Z0-9/.-_] or, better still use [a-z0-9/.-_] with the [NC] flag.


Any literal character except a-zA-Z0-9 should be escaped in regular expressions. While other characters may work unescaped fine today, they are all reserved for for future use as special characters. I'd like my regex to be future compatible.
11:36 pm on Oct 11, 2011 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Any literal character except a-zA-Z0-9 should be escaped in regular expressions.

Not so. Only a few characters need to be escaped in RegEx. You're maybe thinking of Javascript or something.

Additionally, the rules are different for a "character group", as shown here.
1:08 am on Oct 12, 2011 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Any literal character except a-zA-Z0-9 should be escaped in regular expressions.

Where on earth did you hear that? In the specific context of rewrites or redirects, you then get the opposite risk: that the \ will be read as the literal backslash character, thereby breaking your whole rewrite.

Besides, you don't have . and / in your file names. The directory delimiter / is not part of the name. And . should never occur except in filename extensions. (Also in your domain name, but that's generally not a rewrite concern.) Putting them into a one-size-fits-all group is just asking for trouble.
4:47 am on Oct 12, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



I see this as a problem, too.

- I find it annoying as it's showing in WMT under web crawl errors while those are not classic errors like when a bad link has been put up somewhere.
- The pages that host links actually link correctly but Google picks incorrectly.

The pages are on the sites that build content automatically by sweeping through the results of other search engines. In the example that I just looked into the results seem to come from MSN (Bing).

What happens is that there's a title that links correctly, there's some text, and then there's display URL which is text only. That display URL gets truncated if it's too long, and Google picks it up as a URL. That's a mistake on Google's side.

Just few days ago I manually entered 65 of those into .htaccess of one of my sites just to see them gone as I would like to have clear space in that section and be able to notice a real crawl problem if it ever arises. Otherwise, I cannot read through all of them every time.
7:44 am on Oct 12, 2011 (gmt 0)

5+ Year Member



There is a type of auto-generated spam, in which webpages are created from scraped copies of Google SERPs. Is it possible that this is what is happening?

Most definitely.

I don't think it will cause any problems for your rankings to return 404s for these.

I donít think so either which is the main reason I donít want to create a 301 for a 404 which should return a 404.

You'll never see a real visitor requesting these malformed URLs, only Googlebot.

Couldnít agree more.

If the URL actually returns a 404 status in the HTTP header, and it's supposed to be 404 - them why is this a "nightmare"?

Because there are hundreds of these in my crawl errors report, wading through this junk to find Ďrealí crawl errors now takes significantly longer. If multiplied across the webmaster community thatís a lot of time wasted on a Ďsmallí problem for google and as rlange points it because the noise level is so high it reduces the value of the report.

If the bad links from social networking sites are created in the same way, ie truncated urls seen in plain or anchor text, thereís an even greater need for google to fix the problem or for the html5 community to come up with an appropriate tag for referenced urls.
2:34 pm on Oct 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I can see that this might be seen at Google as a way of ensuring that sites don't try to manipulate results by not giving a page the benefit of a link. If the page is useful enough to be referred to in text then it ought to be have a hyperlink to it. The problem is where in Ask for example they do have a hyperlink but the designers have also included a truncated visible text rendition of the URL with no anchor.

I've spent a time over the last couple of days adding these as 301 redirects in my .htaccess files. This may be a coincidence but one site that lost its sitelinks in SERPS a couple of weeks ago has regained them this morning. If it isn't a coincidence perhaps it is something to do with trust.

Cheers

Sid
11:32 am on Oct 25, 2011 (gmt 0)

5+ Year Member



Webmaster tools has again dumped more than a hundred of the same broken and truncated URLs into crawl errors report.

These incoming links are from auto-generated spam, webpages created from scraped copies of Google SERPs.

Is it really too much to ask google refrain from reporting truncated or otherwise abbreviated urls seen in plain text (there is no a href links in any of these)?

I prefer not to implement 301 redirects for the following reasons:
I donít want credit for back links from auto-generated spam.
The urls are bad and hence should return a 404.
12:09 pm on Oct 25, 2011 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member



These incoming links are from auto-generated spam, webpages created from scraped copies of Google SERPs.


It's a real pain in the arse! I don't like seeing them in GWT and we do what we can to redirect those that may have some value. Frackin scrapers suck! Any URI that is truncated visually is going to get scraped exactly as you see it. I just started seeing this in the past few months. It wasn't happening before.
12:10 pm on Oct 25, 2011 (gmt 0)

10+ Year Member Top Contributors Of The Month



I've also been seeing these links for literally thousands of non-existant pages. This morning I noticed hundreds more, but not just your *typical* scraper. There are plenty of them from google.com/m/search!

As a side note: If you click through one of those links from mobile search results, all of your page content is displayed with ads being stripped. Did anybody authorize their entire content to be copied and served by Google?
5:57 pm on Oct 25, 2011 (gmt 0)



Is it ok to return 400 for these instead of 404?
10:05 pm on Oct 25, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I wonder hopw far that rot has set in.

I had a very rare visit from google chrome instant browser today, about half a dozen hits from the same IP. On each successive hit the browser tried a bit more of a querystring - goo, good+ba etc, getting a 404 rejection each time. There was no referer at all.
10:23 pm on Oct 25, 2011 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Is it ok to return 400 for these instead of 404?

I have no experience on that front, but it sounds like a good idea technically. It sends the clearest message, IMO.
11:13 pm on Oct 25, 2011 (gmt 0)



Thanks Tedster I am going to stick with it, thats what I was thinking as well
12:25 pm on Oct 26, 2011 (gmt 0)

5+ Year Member



Good idea Joshmc. Iíve made the change; urls with a malformed syntax now return a 400 not a 404. Iím curious to learn how this will impact both google bot and GWT.

One question, should 400 errors return a custom error page or standard error page?
5:40 pm on Oct 26, 2011 (gmt 0)

5+ Year Member



I am seeing hundreds of truncated urls in webmaster tools as well. They all link back to quite poorly done scraper sites reference.com, qybrd.com, ask.reference.com, et al. In the last few days Google is starting to show the truncated links "Linked From" as "unavailable". Perhaps Google is learning that the truncated urls are actually wrong.

I sure hope so as we don't need a hit from Google in the serps do to these jerks with the scraper sites.

Cheers,
Bill
9:29 pm on Oct 26, 2011 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



One question, should 400 errors return a custom error page or standard error page?

Custom error pages are for humans. So you only need a 400 page if the phony links are making it as far as the SERPs, and people are really clicking on them.

Resist the temptation to have the page say "You got here because g### can't tell an URL from a hole in the ground" ;) Easiest approach is to send them to the same page that 404s get. I've never had a human get a 400, but I do it with 410s.
11:49 am on Oct 28, 2011 (gmt 0)

5+ Year Member



Thanks lucy24.

In the last few days Google is starting to show the truncated links "Linked From" as "unavailable"

Our report is a also now showing 'unavailable' first thing in the morning. Unfortunately, it later updates with an url.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month