homepage Welcome to WebmasterWorld Guest from 23.23.28.23
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Broken and truncated URLs in SERPS creating 404 nightmare
Marfola




msg:4373167
 10:35 am on Oct 11, 2011 (gmt 0)

Webmasters tools is showing hundreds of broken and truncated incoming links from SERPs and custom SERPS (ie google custom search) results, such as
mysite.com/suburl1/suburl2/index.ht
mysite.com/suburl1/suburl2/....
mysite.com/suburl/Ö.
mysite.com/Ö
. (this last shown as mysite.com/Ö./suburl2/index.html in google's SERP)

Googlebot is following the truncated or otherwise abbreviated-for-space lines (most frequently used tags
<cite>, <div> and <span>) that appear at the end of the search result snippet and reading them as bad URLs.

The problem, first detected in August, is ongoing.

This problem is polluting the crawl errors report to such a degree that itís near impossible to sort through the rubbish.

A few folks have suggested implementing a 301 redirect. I don't think this is the right solution. Not only is it costly (for many it would cause a database lookup), it would require action by the collection community of webmasters. Whatís more, for large websites, a 301 canít fix mysite.com/Ö./suburl2/index.html, the abbreviated format used by Google!

Any ideas on how we can get google to fix this problem?

 

aristotle




msg:4373218
 1:14 pm on Oct 11, 2011 (gmt 0)

Are you sure that Google SERPs were the original source of these bad links (with truncated URLs)? The examples I've seen originated on social networking sites, and that is where Google found them.

Marfola




msg:4373245
 2:24 pm on Oct 11, 2011 (gmt 0)

Hi aristotle,

All of our bad links with truncated or abbreviated urls are coming from SERPs, including custom search facilities such as google custom search (Yes, they are actually trying to follow their own abbreviated urls!) and the ask equivalent.

From what I've read in other threads the problem is analogous to the examples you've seen with social networking sites, ie Googlebot is following the truncated or otherwise abbreviated-for-space lines (most frequently used tags <cite>, <div> and <span>) that appear as a reference but not as a link and reading them as bad URLs.

aristotle




msg:4373272
 3:26 pm on Oct 11, 2011 (gmt 0)

There is a type of auto-generated spam, in which webpages are created from scraped copies of Google SERPs. Is it possible that this is what is happening?

tedster




msg:4373348
 6:35 pm on Oct 11, 2011 (gmt 0)

If the URL actually returns a 404 status in the HTTP header, and it's supposed to be 404 - them why is this a "nightmare"?

rlange




msg:4373356
 7:34 pm on Oct 11, 2011 (gmt 0)

tedster wrote:
If the URL actually returns a 404 status in the HTTP header, and it's supposed to be 404 - them why is this a "nightmare"?

If the information is meaningful, then it needs to be acted upon. If it's just a meaningless side-effect of Google's automated experimentation, then it shouldn't even be in Webmaster Tools reports. It lowers the signal-to-noise ratio, potentially to the point of uselessness.

Edit: I suppose that problem could be solved if WMT separated internal linking errors from external linking errors. As it is now, though, you have to wade through plenty of errors outside of your control to locate errors that are within your control.

--
Ryan

aristotle




msg:4373378
 8:29 pm on Oct 11, 2011 (gmt 0)

If the URL actually returns a 404 status in the HTTP header, and it's supposed to be 404 - them why is this a "nightmare"?



tedster -- If Marfola is correct when he says that the links to his pages in the Google SERPs have truncated URLs, then people who try to click through to his pages from the SERPs will get 404s. That would be a nightmare.

However, I'm still not convinced that Google SERPs are the original source of these bad links. That's why I asked him about auto-generated spam pages that use Google's SERPs for their content.

deadsea




msg:4373422
 10:23 pm on Oct 11, 2011 (gmt 0)

I see a lot of this in my logs too. It appears that google is trying to crawl non-linked text that looks like it might be a url. That probably isn't an unreasonable assumption some of the time, but it leads to 404 pages being crawled.

I don't think Google views this as a problem worth fixing. I don't think it will cause any problems for your rankings to return 404s for these.

If it really bothers you (and it does me) then I would recommend the redirect route. I base mine on heuristics that don't require db lookups. It often means that I redirect to a 404, but I can live with that. I have redirect rules on my server that do the following:

Remove any unrecognized characters from the path and redirect. I only allow [a-zA-Z0-9\-\_\.\/] in my file names.

Redirect urls that end in .h .ht or .htm to end with .html

Redirect urls with multiple consecutive slashes to a single slash version (eg //web//foo.html to /web/foo.html)

Redirect urls with directory navigation dots in the url to the correct location (eg /web/../foo.html to /foo.html)

Redirect away from any junk after .html (eg /foo.htmltexthere to /foo.html)

That takes care of the majority of the badly formed urls that get requested. I still get the occasional truncation. I put in a redirect for them on a case by case basis if they get enough requests.

g1smd




msg:4373434
 11:04 pm on Oct 11, 2011 (gmt 0)



del

[edited by: g1smd at 11:10 pm (utc) on Oct 11, 2011]

g1smd




msg:4373436
 11:09 pm on Oct 11, 2011 (gmt 0)

If Marfola is correct when he says that the links to his pages in the Google SERPs have truncated URLs, then people who try to click through to his pages from the SERPs will get 404s. That would be a nightmare.

In many, perhaps most cases, there are actually no a href links with the duff format. The truncated formats are seen in plain text or in anchor text and for whatever reason Google is now keeping note of each of these and requesting them from the server. You'll never see a real visitor requesting these malformed URLs, only Googlebot.


It often means that I redirect to a 404

Never redirect to a 404. The 404 status must be returned at the originally requested URL.

I only allow [a-zA-Z0-9\-\_\.\/] in my file names.

You have way too much escaping, use
[a-zA-Z0-9/.-_] or, better still use [a-z0-9/.-_] with the [NC] flag.
deadsea




msg:4373438
 11:18 pm on Oct 11, 2011 (gmt 0)

Never redirect to a 404. The 404 status must be returned at the originally requested URL.


I don't see why not. Its often too much work/code to check if the url exists before issuing the redirect.

You have way too much escaping, use [a-zA-Z0-9/.-_] or, better still use [a-z0-9/.-_] with the [NC] flag.


Any literal character except a-zA-Z0-9 should be escaped in regular expressions. While other characters may work unescaped fine today, they are all reserved for for future use as special characters. I'd like my regex to be future compatible.

g1smd




msg:4373445
 11:36 pm on Oct 11, 2011 (gmt 0)

Any literal character except a-zA-Z0-9 should be escaped in regular expressions.

Not so. Only a few characters need to be escaped in RegEx. You're maybe thinking of Javascript or something.

Additionally, the rules are different for a "character group", as shown here.

lucy24




msg:4373467
 1:08 am on Oct 12, 2011 (gmt 0)

Any literal character except a-zA-Z0-9 should be escaped in regular expressions.

Where on earth did you hear that? In the specific context of rewrites or redirects, you then get the opposite risk: that the \ will be read as the literal backslash character, thereby breaking your whole rewrite.

Besides, you don't have . and / in your file names. The directory delimiter / is not part of the name. And . should never occur except in filename extensions. (Also in your domain name, but that's generally not a rewrite concern.) Putting them into a one-size-fits-all group is just asking for trouble.

smallcompany




msg:4373497
 4:47 am on Oct 12, 2011 (gmt 0)

I see this as a problem, too.

- I find it annoying as it's showing in WMT under web crawl errors while those are not classic errors like when a bad link has been put up somewhere.
- The pages that host links actually link correctly but Google picks incorrectly.

The pages are on the sites that build content automatically by sweeping through the results of other search engines. In the example that I just looked into the results seem to come from MSN (Bing).

What happens is that there's a title that links correctly, there's some text, and then there's display URL which is text only. That display URL gets truncated if it's too long, and Google picks it up as a URL. That's a mistake on Google's side.

Just few days ago I manually entered 65 of those into .htaccess of one of my sites just to see them gone as I would like to have clear space in that section and be able to notice a real crawl problem if it ever arises. Otherwise, I cannot read through all of them every time.

Marfola




msg:4373551
 7:44 am on Oct 12, 2011 (gmt 0)

There is a type of auto-generated spam, in which webpages are created from scraped copies of Google SERPs. Is it possible that this is what is happening?

Most definitely.

I don't think it will cause any problems for your rankings to return 404s for these.

I donít think so either which is the main reason I donít want to create a 301 for a 404 which should return a 404.

You'll never see a real visitor requesting these malformed URLs, only Googlebot.

Couldnít agree more.

If the URL actually returns a 404 status in the HTTP header, and it's supposed to be 404 - them why is this a "nightmare"?

Because there are hundreds of these in my crawl errors report, wading through this junk to find Ďrealí crawl errors now takes significantly longer. If multiplied across the webmaster community thatís a lot of time wasted on a Ďsmallí problem for google and as rlange points it because the noise level is so high it reduces the value of the report.

If the bad links from social networking sites are created in the same way, ie truncated urls seen in plain or anchor text, thereís an even greater need for google to fix the problem or for the html5 community to come up with an appropriate tag for referenced urls.

Hissingsid




msg:4373637
 2:34 pm on Oct 12, 2011 (gmt 0)

I can see that this might be seen at Google as a way of ensuring that sites don't try to manipulate results by not giving a page the benefit of a link. If the page is useful enough to be referred to in text then it ought to be have a hyperlink to it. The problem is where in Ask for example they do have a hyperlink but the designers have also included a truncated visible text rendition of the URL with no anchor.

I've spent a time over the last couple of days adding these as 301 redirects in my .htaccess files. This may be a coincidence but one site that lost its sitelinks in SERPS a couple of weeks ago has regained them this morning. If it isn't a coincidence perhaps it is something to do with trust.

Cheers

Sid

Marfola




msg:4379091
 11:32 am on Oct 25, 2011 (gmt 0)

Webmaster tools has again dumped more than a hundred of the same broken and truncated URLs into crawl errors report.

These incoming links are from auto-generated spam, webpages created from scraped copies of Google SERPs.

Is it really too much to ask google refrain from reporting truncated or otherwise abbreviated urls seen in plain text (there is no a href links in any of these)?

I prefer not to implement 301 redirects for the following reasons:
I donít want credit for back links from auto-generated spam.
The urls are bad and hence should return a 404.

pageoneresults




msg:4379111
 12:09 pm on Oct 25, 2011 (gmt 0)

These incoming links are from auto-generated spam, webpages created from scraped copies of Google SERPs.


It's a real pain in the arse! I don't like seeing them in GWT and we do what we can to redirect those that may have some value. Frackin scrapers suck! Any URI that is truncated visually is going to get scraped exactly as you see it. I just started seeing this in the past few months. It wasn't happening before.

Andem




msg:4379112
 12:10 pm on Oct 25, 2011 (gmt 0)

I've also been seeing these links for literally thousands of non-existant pages. This morning I noticed hundreds more, but not just your *typical* scraper. There are plenty of them from google.com/m/search!

As a side note: If you click through one of those links from mobile search results, all of your page content is displayed with ads being stripped. Did anybody authorize their entire content to be copied and served by Google?

Joshmc




msg:4379253
 5:57 pm on Oct 25, 2011 (gmt 0)

Is it ok to return 400 for these instead of 404?

dstiles




msg:4379352
 10:05 pm on Oct 25, 2011 (gmt 0)

I wonder hopw far that rot has set in.

I had a very rare visit from google chrome instant browser today, about half a dozen hits from the same IP. On each successive hit the browser tried a bit more of a querystring - goo, good+ba etc, getting a 404 rejection each time. There was no referer at all.

tedster




msg:4379368
 10:23 pm on Oct 25, 2011 (gmt 0)

Is it ok to return 400 for these instead of 404?

I have no experience on that front, but it sounds like a good idea technically. It sends the clearest message, IMO.

Joshmc




msg:4379386
 11:13 pm on Oct 25, 2011 (gmt 0)

Thanks Tedster I am going to stick with it, thats what I was thinking as well

Marfola




msg:4379544
 12:25 pm on Oct 26, 2011 (gmt 0)

Good idea Joshmc. Iíve made the change; urls with a malformed syntax now return a 400 not a 404. Iím curious to learn how this will impact both google bot and GWT.

One question, should 400 errors return a custom error page or standard error page?

Bill_H




msg:4379668
 5:40 pm on Oct 26, 2011 (gmt 0)

I am seeing hundreds of truncated urls in webmaster tools as well. They all link back to quite poorly done scraper sites reference.com, qybrd.com, ask.reference.com, et al. In the last few days Google is starting to show the truncated links "Linked From" as "unavailable". Perhaps Google is learning that the truncated urls are actually wrong.

I sure hope so as we don't need a hit from Google in the serps do to these jerks with the scraper sites.

Cheers,
Bill

lucy24




msg:4379775
 9:29 pm on Oct 26, 2011 (gmt 0)

One question, should 400 errors return a custom error page or standard error page?

Custom error pages are for humans. So you only need a 400 page if the phony links are making it as far as the SERPs, and people are really clicking on them.

Resist the temptation to have the page say "You got here because g### can't tell an URL from a hole in the ground" ;) Easiest approach is to send them to the same page that 404s get. I've never had a human get a 400, but I do it with 410s.

Marfola




msg:4380631
 11:49 am on Oct 28, 2011 (gmt 0)

Thanks lucy24.

In the last few days Google is starting to show the truncated links "Linked From" as "unavailable"

Our report is a also now showing 'unavailable' first thing in the morning. Unfortunately, it later updates with an url.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved