aristotle

msg:4373218 | 1:14 pm on Oct 11, 2011 (gmt 0) |
Are you sure that Google SERPs were the original source of these bad links (with truncated URLs)? The examples I've seen originated on social networking sites, and that is where Google found them.
|
Marfola

msg:4373245 | 2:24 pm on Oct 11, 2011 (gmt 0) |
Hi aristotle, All of our bad links with truncated or abbreviated urls are coming from SERPs, including custom search facilities such as google custom search (Yes, they are actually trying to follow their own abbreviated urls!) and the ask equivalent. From what I've read in other threads the problem is analogous to the examples you've seen with social networking sites, ie Googlebot is following the truncated or otherwise abbreviated-for-space lines (most frequently used tags <cite>, <div> and <span>) that appear as a reference but not as a link and reading them as bad URLs.
|
aristotle

msg:4373272 | 3:26 pm on Oct 11, 2011 (gmt 0) |
There is a type of auto-generated spam, in which webpages are created from scraped copies of Google SERPs. Is it possible that this is what is happening?
|
tedster

msg:4373348 | 6:35 pm on Oct 11, 2011 (gmt 0) |
If the URL actually returns a 404 status in the HTTP header, and it's supposed to be 404 - them why is this a "nightmare"?
|
rlange

msg:4373356 | 7:34 pm on Oct 11, 2011 (gmt 0) |
tedster wrote: If the URL actually returns a 404 status in the HTTP header, and it's supposed to be 404 - them why is this a "nightmare"? |
| If the information is meaningful, then it needs to be acted upon. If it's just a meaningless side-effect of Google's automated experimentation, then it shouldn't even be in Webmaster Tools reports. It lowers the signal-to-noise ratio, potentially to the point of uselessness. Edit: I suppose that problem could be solved if WMT separated internal linking errors from external linking errors. As it is now, though, you have to wade through plenty of errors outside of your control to locate errors that are within your control. -- Ryan
|
aristotle

msg:4373378 | 8:29 pm on Oct 11, 2011 (gmt 0) |
If the URL actually returns a 404 status in the HTTP header, and it's supposed to be 404 - them why is this a "nightmare"? |
| tedster -- If Marfola is correct when he says that the links to his pages in the Google SERPs have truncated URLs, then people who try to click through to his pages from the SERPs will get 404s. That would be a nightmare. However, I'm still not convinced that Google SERPs are the original source of these bad links. That's why I asked him about auto-generated spam pages that use Google's SERPs for their content.
|
deadsea

msg:4373422 | 10:23 pm on Oct 11, 2011 (gmt 0) |
I see a lot of this in my logs too. It appears that google is trying to crawl non-linked text that looks like it might be a url. That probably isn't an unreasonable assumption some of the time, but it leads to 404 pages being crawled. I don't think Google views this as a problem worth fixing. I don't think it will cause any problems for your rankings to return 404s for these. If it really bothers you (and it does me) then I would recommend the redirect route. I base mine on heuristics that don't require db lookups. It often means that I redirect to a 404, but I can live with that. I have redirect rules on my server that do the following: Remove any unrecognized characters from the path and redirect. I only allow [a-zA-Z0-9\-\_\.\/] in my file names. Redirect urls that end in .h .ht or .htm to end with .html Redirect urls with multiple consecutive slashes to a single slash version (eg //web//foo.html to /web/foo.html) Redirect urls with directory navigation dots in the url to the correct location (eg /web/../foo.html to /foo.html) Redirect away from any junk after .html (eg /foo.htmltexthere to /foo.html) That takes care of the majority of the badly formed urls that get requested. I still get the occasional truncation. I put in a redirect for them on a case by case basis if they get enough requests.
|
g1smd

msg:4373434 | 11:04 pm on Oct 11, 2011 (gmt 0) |
del [edited by: g1smd at 11:10 pm (utc) on Oct 11, 2011]
|
g1smd

msg:4373436 | 11:09 pm on Oct 11, 2011 (gmt 0) |
| If Marfola is correct when he says that the links to his pages in the Google SERPs have truncated URLs, then people who try to click through to his pages from the SERPs will get 404s. That would be a nightmare. |
| In many, perhaps most cases, there are actually no a href links with the duff format. The truncated formats are seen in plain text or in anchor text and for whatever reason Google is now keeping note of each of these and requesting them from the server. You'll never see a real visitor requesting these malformed URLs, only Googlebot. | It often means that I redirect to a 404 |
| Never redirect to a 404. The 404 status must be returned at the originally requested URL. | I only allow [a-zA-Z0-9\-\_\.\/] in my file names. |
| You have way too much escaping, use [a-zA-Z0-9/.-_] or, better still use [a-z0-9/.-_] with the [NC] flag.
|
deadsea

msg:4373438 | 11:18 pm on Oct 11, 2011 (gmt 0) |
| Never redirect to a 404. The 404 status must be returned at the originally requested URL. |
| I don't see why not. Its often too much work/code to check if the url exists before issuing the redirect. | You have way too much escaping, use [a-zA-Z0-9/.-_] or, better still use [a-z0-9/.-_] with the [NC] flag. |
| Any literal character except a-zA-Z0-9 should be escaped in regular expressions. While other characters may work unescaped fine today, they are all reserved for for future use as special characters. I'd like my regex to be future compatible.
|
g1smd

msg:4373445 | 11:36 pm on Oct 11, 2011 (gmt 0) |
| Any literal character except a-zA-Z0-9 should be escaped in regular expressions. |
| Not so. Only a few characters need to be escaped in RegEx. You're maybe thinking of Javascript or something. Additionally, the rules are different for a "character group", as shown here.
|
lucy24

msg:4373467 | 1:08 am on Oct 12, 2011 (gmt 0) |
| Any literal character except a-zA-Z0-9 should be escaped in regular expressions. |
| Where on earth did you hear that? In the specific context of rewrites or redirects, you then get the opposite risk: that the \ will be read as the literal backslash character, thereby breaking your whole rewrite. Besides, you don't have . and / in your file names. The directory delimiter / is not part of the name. And . should never occur except in filename extensions. (Also in your domain name, but that's generally not a rewrite concern.) Putting them into a one-size-fits-all group is just asking for trouble.
|
smallcompany

msg:4373497 | 4:47 am on Oct 12, 2011 (gmt 0) |
I see this as a problem, too. - I find it annoying as it's showing in WMT under web crawl errors while those are not classic errors like when a bad link has been put up somewhere. - The pages that host links actually link correctly but Google picks incorrectly. The pages are on the sites that build content automatically by sweeping through the results of other search engines. In the example that I just looked into the results seem to come from MSN (Bing). What happens is that there's a title that links correctly, there's some text, and then there's display URL which is text only. That display URL gets truncated if it's too long, and Google picks it up as a URL. That's a mistake on Google's side. Just few days ago I manually entered 65 of those into .htaccess of one of my sites just to see them gone as I would like to have clear space in that section and be able to notice a real crawl problem if it ever arises. Otherwise, I cannot read through all of them every time.
|
Marfola

msg:4373551 | 7:44 am on Oct 12, 2011 (gmt 0) |
| There is a type of auto-generated spam, in which webpages are created from scraped copies of Google SERPs. Is it possible that this is what is happening? |
| Most definitely. | I don't think it will cause any problems for your rankings to return 404s for these. |
| I don’t think so either which is the main reason I don’t want to create a 301 for a 404 which should return a 404. | You'll never see a real visitor requesting these malformed URLs, only Googlebot. |
| Couldn’t agree more. | If the URL actually returns a 404 status in the HTTP header, and it's supposed to be 404 - them why is this a "nightmare"? |
| Because there are hundreds of these in my crawl errors report, wading through this junk to find ‘real’ crawl errors now takes significantly longer. If multiplied across the webmaster community that’s a lot of time wasted on a ‘small’ problem for google and as rlange points it because the noise level is so high it reduces the value of the report. If the bad links from social networking sites are created in the same way, ie truncated urls seen in plain or anchor text, there’s an even greater need for google to fix the problem or for the html5 community to come up with an appropriate tag for referenced urls.
|
Hissingsid

msg:4373637 | 2:34 pm on Oct 12, 2011 (gmt 0) |
I can see that this might be seen at Google as a way of ensuring that sites don't try to manipulate results by not giving a page the benefit of a link. If the page is useful enough to be referred to in text then it ought to be have a hyperlink to it. The problem is where in Ask for example they do have a hyperlink but the designers have also included a truncated visible text rendition of the URL with no anchor. I've spent a time over the last couple of days adding these as 301 redirects in my .htaccess files. This may be a coincidence but one site that lost its sitelinks in SERPS a couple of weeks ago has regained them this morning. If it isn't a coincidence perhaps it is something to do with trust. Cheers Sid
|
Marfola

msg:4379091 | 11:32 am on Oct 25, 2011 (gmt 0) |
Webmaster tools has again dumped more than a hundred of the same broken and truncated URLs into crawl errors report. These incoming links are from auto-generated spam, webpages created from scraped copies of Google SERPs. Is it really too much to ask google refrain from reporting truncated or otherwise abbreviated urls seen in plain text (there is no a href links in any of these)? I prefer not to implement 301 redirects for the following reasons: I don’t want credit for back links from auto-generated spam. The urls are bad and hence should return a 404.
|
pageoneresults

msg:4379111 | 12:09 pm on Oct 25, 2011 (gmt 0) |
| These incoming links are from auto-generated spam, webpages created from scraped copies of Google SERPs. |
| It's a real pain in the arse! I don't like seeing them in GWT and we do what we can to redirect those that may have some value. Frackin scrapers suck! Any URI that is truncated visually is going to get scraped exactly as you see it. I just started seeing this in the past few months. It wasn't happening before.
|
Andem

msg:4379112 | 12:10 pm on Oct 25, 2011 (gmt 0) |
I've also been seeing these links for literally thousands of non-existant pages. This morning I noticed hundreds more, but not just your *typical* scraper. There are plenty of them from google.com/m/search! As a side note: If you click through one of those links from mobile search results, all of your page content is displayed with ads being stripped. Did anybody authorize their entire content to be copied and served by Google?
|
Joshmc

msg:4379253 | 5:57 pm on Oct 25, 2011 (gmt 0) |
Is it ok to return 400 for these instead of 404?
|
dstiles

msg:4379352 | 10:05 pm on Oct 25, 2011 (gmt 0) |
I wonder hopw far that rot has set in. I had a very rare visit from google chrome instant browser today, about half a dozen hits from the same IP. On each successive hit the browser tried a bit more of a querystring - goo, good+ba etc, getting a 404 rejection each time. There was no referer at all.
|
tedster

msg:4379368 | 10:23 pm on Oct 25, 2011 (gmt 0) |
| Is it ok to return 400 for these instead of 404? |
| I have no experience on that front, but it sounds like a good idea technically. It sends the clearest message, IMO.
|
Joshmc

msg:4379386 | 11:13 pm on Oct 25, 2011 (gmt 0) |
Thanks Tedster I am going to stick with it, thats what I was thinking as well
|
Marfola

msg:4379544 | 12:25 pm on Oct 26, 2011 (gmt 0) |
Good idea Joshmc. I’ve made the change; urls with a malformed syntax now return a 400 not a 404. I’m curious to learn how this will impact both google bot and GWT. One question, should 400 errors return a custom error page or standard error page?
|
Bill_H

msg:4379668 | 5:40 pm on Oct 26, 2011 (gmt 0) |
I am seeing hundreds of truncated urls in webmaster tools as well. They all link back to quite poorly done scraper sites reference.com, qybrd.com, ask.reference.com, et al. In the last few days Google is starting to show the truncated links "Linked From" as "unavailable". Perhaps Google is learning that the truncated urls are actually wrong. I sure hope so as we don't need a hit from Google in the serps do to these jerks with the scraper sites. Cheers, Bill
|
lucy24

msg:4379775 | 9:29 pm on Oct 26, 2011 (gmt 0) |
| One question, should 400 errors return a custom error page or standard error page? |
| Custom error pages are for humans. So you only need a 400 page if the phony links are making it as far as the SERPs, and people are really clicking on them. Resist the temptation to have the page say "You got here because g### can't tell an URL from a hole in the ground" ;) Easiest approach is to send them to the same page that 404s get. I've never had a human get a 400, but I do it with 410s.
|
Marfola

msg:4380631 | 11:49 am on Oct 28, 2011 (gmt 0) |
Thanks lucy24. | In the last few days Google is starting to show the truncated links "Linked From" as "unavailable" |
| Our report is a also now showing 'unavailable' first thing in the morning. Unfortunately, it later updates with an url.
|
|