Forum Moderators: Robert Charlton & goodroi
Thanks and excuse me for my bad english, Franco
1. IE removes the extra character, but apparently still looks for the bogus domain name because I get a 404
2. Firefox keeps the extra character, and of course I get a 404
4. Opera removes the extra character and then requests the corrected domain name, which resolves.
I am not sure why Google would keep these erroneous urls in their index, unless perhaps the server is resolving it anyway -- or perhaps googlebot's crawling code corrects the urls, in the manner that Opera does.
But where is the character coming from in the first place? Is it a glitch in Google? Competitive sabotage against the domain? A typo in an inbound link somewhere?
<Sorry, absolutely no personal urls, per Forum Charter [webmasterworld.com]>
With the strange command 1 all the pages, disappeared in normal cache, return.
In the google search the section web, before april 2006, gets me about 10.000 visits, now zero visits and in this section i use adsense ...
This is a relevant problem for me as you can understand.
Thanks, Franco
[edited by: tedster at 6:00 am (utc) on Sep. 8, 2006]
I held this description of the googlebot-bug on the 17th of August as apparently I was the only one to have this problem, now here is what I commented, and still happening regularly on our site:
------access log-----------
66.249.72.1 - - [17/Aug/2006:06:39:47 +0000] "GET /any-folder/any-sub-folder/any-file-name.%20... HTTP/1.1" 404 1736 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
------end access log-------
Essentially, it tries to spider the file
widgets/blue-ones/list.%20...
instead of the correct, existing, indexed on google already:
widgets/blueones/list.html
This is happening in several websites of ours (at least 4-6 times every day since the middle of August)
Another case is attempting to fetch (get)
GET /index.html')
instead of the correct
GET /index.html
adding ') at the end of the string...
Of course we have never had such a badly name file(s) on our web servers, therefore this is an "almost right" information that Googlebot has about our URLs.
Please note that the wrong attempted URL is also reported in Google Sitemap in the Diagnostic / Web Craw / Not Found section.
The way it's done, looks like to me that Googebot takes the "abbreviated listing" of our actual URLs to be used in the spidering phase, which is crazy...
you could try to
grep '%20...' accesslog.filename
from your unix shell (if you use unix/linux, of course).
Thanks
Nope.
If someone else (even not internally to my domains) had linked me for
my-domain.com/invalid-path
I would have detected it in the referral logs that we carefully analize every day, one by one, even if it was a 404 like in this case.
Never in the last 4 years someone (but googlebot) has searched anything with those wrong characters in the URL....
Also, the mistake is too common and too new for not having left any trace in the access/error log, if not the googlebot signs...
The other sources are links on other pages, both from your site(s) and other site(s).
Mistookes in links happen and when picked up in googles crawling they get stored for later crawling.
The thing about the site the starter of this post has, is that clicking on the page title in the serps returns the actual page, that is the one without the %c0 in it. The %c0 is not in the link that Google has for the page. The %c0 is however shown in the green url display. The %c0 also is used in the link to Googles cache entry. Following the cache link results in the cache entry not being found. However removing the %c0 from the cache link gets you to the cache copy. That cache copy is a bit messed up as well.
I use mozilla as my browser and when the %co is used in any of the links on the site I get the www.example.com%co was not found message, which IIRC comes from a DNS lookup failure.
In other words ya can't even get there using %c0 in that manner.
Too many things to contemplate, too little time, and not enough real information.
[edited by: theBear at 2:30 pm (utc) on Sep. 8, 2006]