Sorry for my reply, but today another discovery:
1) site:www.mysite.com/directory_disappeared --> not pages in cache
2) site:www.mysite.com%c0/directory_disappeared --> All pages in cache
Can anyone help me?
Same problem with my site.
www.mysite%cO.com is showing cached page in google.
but www.mysite.com doesn't.
Pls help us.
Just pulling up the post, I find it interesting.
I think that Google has serious basic issues at the moment, I wonder why this one happened.
I was trying to research this -- to see how various servers handled the extra character after the domain extension. What I discovered is that different BROWSERS do different things:
1. IE removes the extra character, but apparently still looks for the bogus domain name because I get a 404
2. Firefox keeps the extra character, and of course I get a 404
4. Opera removes the extra character and then requests the corrected domain name, which resolves.
I am not sure why Google would keep these erroneous urls in their index, unless perhaps the server is resolving it anyway -- or perhaps googlebot's crawling code corrects the urls, in the manner that Opera does.
But where is the character coming from in the first place? Is it a glitch in Google? Competitive sabotage against the domain? A typo in an inbound link somewhere?
I have decided to post the web address of my website because i need your help.
In Google write:
<Sorry, absolutely no personal urls, per Forum Charter [webmasterworld.com]>
With the strange command 1 all the pages, disappeared in normal cache, return.
In the google search the section web, before april 2006, gets me about 10.000 visits, now zero visits and in this section i use adsense ...
This is a relevant problem for me as you can understand.
[edited by: tedster at 6:00 am (utc) on Sep. 8, 2006]
That appears to be a Google display problem that %c0 isn't in the actual url it however gets used to attempt location of the cache entry which it won't find.
I'd classify that as a Googlebug ...
I try to contact google team about this "googlebug" but without success.
What can i do now?
I assume you have already looked into this, but I still think I'll say it. Do whatever you can to make sure that the bizarre url is not resolved by your server -- through any kind of misconfiguration of the server itself or through a problem with your DNS. If googlebot gets a 200 code returned along with content when it tries to spider the bad urls, that could make fixing things a lot more complicated.
The others search engines (yahoo, msn, ...) have not problem to find the correct web address.
It's possible a server misconfiguration active just for googlebot, if yes how can i verify.
A similar error that is going on out there is with urls that end with an underscore or two or more underscores(ex.domain.com/dir.html__) - for some reason googlebot does correct these and list them as they seem to be able to get a 200 returned - although doing it yourself you will get a server error(404) - Ive actually added a line to my htaccess that will resolve and 301 those types of urls although I did find two larger spammers linking to me that way (not sure if it was on purpose)
Same problem for me.
I held this description of the googlebot-bug on the 17th of August as apparently I was the only one to have this problem, now here is what I commented, and still happening regularly on our site:
188.8.131.52 - - [17/Aug/2006:06:39:47 +0000] "GET /any-folder/any-sub-folder/any-file-name.%20... HTTP/1.1" 404 1736 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
------end access log-------
Essentially, it tries to spider the file
instead of the correct, existing, indexed on google already:
This is happening in several websites of ours (at least 4-6 times every day since the middle of August)
Another case is attempting to fetch (get)
instead of the correct
adding ') at the end of the string...
Of course we have never had such a badly name file(s) on our web servers, therefore this is an "almost right" information that Googlebot has about our URLs.
Please note that the wrong attempted URL is also reported in Google Sitemap in the Diagnostic / Web Craw / Not Found section.
The way it's done, looks like to me that Googebot takes the "abbreviated listing" of our actual URLs to be used in the spidering phase, which is crazy...
you could try to
grep '%20...' accesslog.filename
from your unix shell (if you use unix/linux, of course).
Interesting discussion folks. I thought always thought it was just me having these issues.
I am seeing %C0 on one site but %5C with another. As far as I remember %5C is an actual \ backslash.
I also see the report in Sitemaps too.
Those extra characters could be from a bad link from your site or someone elses site.
If someone else (even not internally to my domains) had linked me for
I would have detected it in the referral logs that we carefully analize every day, one by one, even if it was a 404 like in this case.
Never in the last 4 years someone (but googlebot) has searched anything with those wrong characters in the URL....
Also, the mistake is too common and too new for not having left any trace in the access/error log, if not the googlebot signs...
It could be coming from someones browser via the google toolbar. That would support tedster's theory and make a lot of sense.
Why a Google tool bar?
This is what goole says
"This page lists URLs from your site that Googlebot had trouble crawling."
This is not coming from a Googlebar: this is googlebot trying to crawl that wrong URL.
Please explain.. thanks.
The bot works from a url database, one of the sources of the urls is IIRC from the google toolbar. In other words the toolbar phones home. I don't run the toolbar, such items are not installed on my systems.
The other sources are links on other pages, both from your site(s) and other site(s).
Mistookes in links happen and when picked up in googles crawling they get stored for later crawling.
The thing about the site the starter of this post has, is that clicking on the page title in the serps returns the actual page, that is the one without the %c0 in it. The %c0 is not in the link that Google has for the page. The %c0 is however shown in the green url display. The %c0 also is used in the link to Googles cache entry. Following the cache link results in the cache entry not being found. However removing the %c0 from the cache link gets you to the cache copy. That cache copy is a bit messed up as well.
I use mozilla as my browser and when the %co is used in any of the links on the site I get the www.example.com%co was not found message, which IIRC comes from a DNS lookup failure.
In other words ya can't even get there using %c0 in that manner.
Too many things to contemplate, too little time, and not enough real information.
[edited by: theBear at 2:30 pm (utc) on Sep. 8, 2006]