Yes. This has been the case for at least 6 months now.
Yes, often. There are some recent threads about it. (Anyone got a good memory for Forums search terms?) It's infuriating but there's generally nothing you can do about it. You can fix your own links-- but when it isn't yours and it isn't even a link, about all you can do is swear at g### ;)
I've been seeing this a lot lately, but more so with truncations of actual URLs on my site, with ... appended to the end.
The linking pages that GWT lists as the source of these malformed links are scraper and mashup sites that use Bing's search results. These pages include an actual hyperlink to my site, as well as the truncated url...
I have also been noticing more 404 traffic coming to strange URLs - URLs that are very similar to actual URLs, but with slight changes. I haven't seen any of these show up GWT, though.
If it is a truncation or has some trailing punctuation that is causing a 404 I generally put in a 301 redirect. If googlebot can't figure it out, then many users that copy and paste probably can't figure it out either. Another good reason to have short urls. Fewer ways to mess them up.
Interesting, i wasn't aware that this has been the case for a few months now. I've only started seeing them.
These sites do seem like scrapers that use span tags with css class 'msnresulturl' on truncation urls of my pages.
So does Google follow all urls or just truncation ones?
I've had this issue since summer (at least). In my case I put link url in youtube video description. Looks like this url then a blank line then a text. Youtube video scrappers put the video on their website and googlebot concatenates the text below the url to the end of it... So now I put an extra space after the url before leaving a blank line below... Hope this will help
It appears that the Googlebot will follow any link, anything in text or an actual link, in our case we have "..." at the end, anchor text mixed in with the link, html code included and a bad case of a trailing / after filename.htm. Sometimes these actually resolve - which then create a load of file not founds as the relative paths are incorrect.
I would have thought they would have discounted these at once as they come from off-site, but no, the bot aimlessly follows them.
Although annoying, it is not the problem if such URLs are returning 404.
However, if such partial URL returns 200 OK (and you do not have a canonical link element implemented) then you might have a duplicate content problem on your hand that can quickly escallate, so watch Duplicate Titles/Description section of WMT for it.
I can see a bit of a problem here. If you have in your htaccess a line to redirect hits to the non-www version of your site to the www version, and then some scraper site links to a non-existent page without the www, Googlebot will get a 301 response followed by a 404. This then looks like it's a problem with your site, not the external link.
Would people say it would be fine to just send any non-www hits to the home page with the www? If, say, you knew that virtually no-one is linking to anything other than your home page with valid non-www links anyway?
A 301 redirect of a page that has never actually existed to a 404 page isn't a problem. Google will tolerate the behavior without thinking your site has problems. You should still further redirect such requests when you find them.
I'm seeing them try to link shortened links
where the dots are how possibly some forum shortens it, but the href tag is unchanged. The link in the forum works, but if they just spider the text instead of the actual href tag it's obviously broken.
John Mueller says at at Google Help Forum [google.com] they are using non-hyper-texted URLs to find new content and it sometimes causes 404 in WMT.
Doesn't sound good. Potential for sabotage here?
suzukik - Thanks for the link to the Google Help Forum thread... [google.com...]
To keep this discussion self-contained, it's worth quoting some of John Mueller's comments from the thread here....
My emphasis below...
|Those links appear to come from text on those pages. We've started picking up text that looks like URLs on HTML pages and seeing if they lead to new content. Sometimes those links are truncated and useless, but it's easy enough to try them and forget them if they lead nowhere, so we've started doing checking them to be sure. We primarily use these kinds of URLs for discovering new content. |
I realize that this can lead to a somewhat cluttered crawl errors section in Webmaster Tools, so we're looking into ways of making that a bit clearer....
John also points out that 404s for URLs that don't exist are normal and should present no problems. Regarding the 301s, he says...
|At any rate, you don't need to "fix" this problem (eg with a 301 redirect), if you're sure that the URL should really not exist. |
We tag outlinks on pages in Google Analytics using _TrackPageView, they follow those without hyperlinks which is pretty stupid.
Doesn't really boost their case for "Please let us crawl your .js files" does it :(
So far, the worst I've got is "linked from" ... a page that itself no longer exists. (Darn! If I paid closer attention before clicking "Fixed", I would notice if they ever had two long-gone pages linking only to each other.) And google's perennial favorite, attributing nonexistent pages to a sitemap from 2008 even though it's "no data available" when you press for details.
I'm getting a lot right now because I de-blocked a bunch of pages I deleted a year ago so they're picking up 410s right and left. Happened to coincide with a chance in their way of recording "errors". My goodness, that googlebot has a long memory.
Still think it would be nice if they told me the names of those 78 pages they profess to have been roboted-away from. Then at least I could be sure they were those same long-deleted pages, and not other pages that google isn't even supposed to know about. (I found three back when they were naming names. They're only linked from one place in the world-- and that place is also securely roboted-out, so how does g### find out about them? Don't answer that. It's going to be one of those search-engine technicalities that I can't wrap my brain around, isn't it.)
|A 301 redirect of a page that has never actually existed to a 404 page isn't a problem |
If the bot sees it, it may create a soft error in GWT. Do a valid redirect to a valid page, or a 4x right away.
|they are using non-hyper-texted URLs to find new content and it sometimes causes 404 in WMT. |
I've replaced the 404s years ago with redirects so I don't get any errors even if the bot finds something that looks like a link. Looks like infinite 404 error space may hurt after all.