Welcome to WebmasterWorld Guest from 107.20.75.63

Message Too Old, No Replies

Added %c0 in urls - problem in google or in my server?

     
2:06 pm on Aug 22, 2006 (gmt 0)

New User

10+ Year Member

joined:Aug 10, 2005
posts:5
votes: 0


Two directories of my website are not in google cache since april 2006.
Now some pages of these directories returned in google search but the address is strange.
Example:
- the good address is: www.mywebsite.com/web/musica/povia.asp
- the address in google search is:
www.mywebsite.com%c0/web/musica/povia.asp
What's the problem, it's a google problem or a problem with my
webserver?
All the others directories of my website have not this problem in
google search.

Thanks and excuse me for my bad english, Franco

10:07 am on Aug 23, 2006 (gmt 0)

New User

10+ Year Member

joined:Aug 10, 2005
posts:5
votes: 0


Sorry for my reply, but today another discovery:

1) site:www.mysite.com/directory_disappeared --> not pages in cache

2) site:www.mysite.com%c0/directory_disappeared --> All pages in cache

Can anyone help me?

5:28 pm on Sept 6, 2006 (gmt 0)

New User

5+ Year Member

joined:Aug 31, 2006
posts: 2
votes: 0


Hi,

Same problem with my site.

www.mysite%cO.com is showing cached page in google.

but www.mysite.com doesn't.

Pls help us.

Thanks/

5:59 am on Sept 7, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:Aug 14, 2004
posts:602
votes: 0



Just pulling up the post, I find it interesting.

I think that Google has serious basic issues at the moment, I wonder why this one happened.

6:16 am on Sept 7, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


I was trying to research this -- to see how various servers handled the extra character after the domain extension. What I discovered is that different BROWSERS do different things:

1. IE removes the extra character, but apparently still looks for the bogus domain name because I get a 404
2. Firefox keeps the extra character, and of course I get a 404
4. Opera removes the extra character and then requests the corrected domain name, which resolves.

I am not sure why Google would keep these erroneous urls in their index, unless perhaps the server is resolving it anyway -- or perhaps googlebot's crawling code corrects the urls, in the manner that Opera does.

But where is the character coming from in the first place? Is it a glitch in Google? Competitive sabotage against the domain? A typo in an inbound link somewhere?

5:59 pm on Sept 7, 2006 (gmt 0)

New User

10+ Year Member

joined:Aug 10, 2005
posts:5
votes: 0


I have decided to post the web address of my website because i need your help.
In Google write:

<Sorry, absolutely no personal urls, per Forum Charter [webmasterworld.com]>

With the strange command 1 all the pages, disappeared in normal cache, return.
In the google search the section web, before april 2006, gets me about 10.000 visits, now zero visits and in this section i use adsense ...
This is a relevant problem for me as you can understand.
Thanks, Franco

[edited by: tedster at 6:00 am (utc) on Sep. 8, 2006]

6:06 pm on Sept 7, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


That appears to be a Google display problem that %c0 isn't in the actual url it however gets used to attempt location of the cache entry which it won't find.

I'd classify that as a Googlebug ...

5:47 am on Sept 8, 2006 (gmt 0)

New User

10+ Year Member

joined:Aug 10, 2005
posts: 5
votes: 0


I try to contact google team about this "googlebug" but without success.
What can i do now?
Thanks
6:16 am on Sept 8, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


I assume you have already looked into this, but I still think I'll say it. Do whatever you can to make sure that the bizarre url is not resolved by your server -- through any kind of misconfiguration of the server itself or through a problem with your DNS. If googlebot gets a 200 code returned along with content when it tries to spider the bad urls, that could make fixing things a lot more complicated.
9:26 am on Sept 8, 2006 (gmt 0)

New User

10+ Year Member

joined:Aug 10, 2005
posts:5
votes: 0


The others search engines (yahoo, msn, ...) have not problem to find the correct web address.
It's possible a server misconfiguration active just for googlebot, if yes how can i verify.
10:19 am on Sept 8, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:Oct 30, 2002
posts:404
votes: 0


A similar error that is going on out there is with urls that end with an underscore or two or more underscores(ex.domain.com/dir.html__) - for some reason googlebot does correct these and list them as they seem to be able to get a 200 returned - although doing it yourself you will get a server error(404) - Ive actually added a line to my htaccess that will resolve and 301 those types of urls although I did find two larger spammers linking to me that way (not sure if it was on purpose)
12:13 pm on Sept 8, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Oct 29, 2005
posts:45
votes: 0


Same problem for me.

I held this description of the googlebot-bug on the 17th of August as apparently I was the only one to have this problem, now here is what I commented, and still happening regularly on our site:

------access log-----------
66.249.72.1 - - [17/Aug/2006:06:39:47 +0000] "GET /any-folder/any-sub-folder/any-file-name.%20... HTTP/1.1" 404 1736 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
------end access log-------

Essentially, it tries to spider the file

widgets/blue-ones/list.%20...

instead of the correct, existing, indexed on google already:

widgets/blueones/list.html

This is happening in several websites of ours (at least 4-6 times every day since the middle of August)

Another case is attempting to fetch (get)

GET /index.html')

instead of the correct

GET /index.html

adding ') at the end of the string...

Of course we have never had such a badly name file(s) on our web servers, therefore this is an "almost right" information that Googlebot has about our URLs.

Please note that the wrong attempted URL is also reported in Google Sitemap in the Diagnostic / Web Craw / Not Found section.

The way it's done, looks like to me that Googebot takes the "abbreviated listing" of our actual URLs to be used in the spidering phase, which is crazy...

you could try to

grep '%20...' accesslog.filename

from your unix shell (if you use unix/linux, of course).

Thanks

12:18 pm on Sept 8, 2006 (gmt 0)

New User

5+ Year Member

joined:July 20, 2006
posts:33
votes: 0


Interesting discussion folks. I thought always thought it was just me having these issues.

I am seeing %C0 on one site but %5C with another. As far as I remember %5C is an actual \ backslash.

I also see the report in Sitemaps too.

12:26 pm on Sept 8, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Jan 5, 2006
posts:2094
votes: 2


Those extra characters could be from a bad link from your site or someone elses site.
12:47 pm on Sept 8, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Oct 29, 2005
posts:45
votes: 0


Trinorthlighting,

Nope.

If someone else (even not internally to my domains) had linked me for

my-domain.com/invalid-path

I would have detected it in the referral logs that we carefully analize every day, one by one, even if it was a 404 like in this case.

Never in the last 4 years someone (but googlebot) has searched anything with those wrong characters in the URL....

Also, the mistake is too common and too new for not having left any trace in the access/error log, if not the googlebot signs...

1:01 pm on Sept 8, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Jan 5, 2006
posts:2094
votes: 2


Ahh,

It could be coming from someones browser via the google toolbar. That would support tedster's theory and make a lot of sense.

1:10 pm on Sept 8, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Oct 29, 2005
posts:45
votes: 0


Why a Google tool bar?

This is what goole says

"This page lists URLs from your site that Googlebot had trouble crawling."

This is not coming from a Googlebar: this is googlebot trying to crawl that wrong URL.

Please explain.. thanks.

2:26 pm on Sept 8, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


The bot works from a url database, one of the sources of the urls is IIRC from the google toolbar. In other words the toolbar phones home. I don't run the toolbar, such items are not installed on my systems.

The other sources are links on other pages, both from your site(s) and other site(s).

Mistookes in links happen and when picked up in googles crawling they get stored for later crawling.

The thing about the site the starter of this post has, is that clicking on the page title in the serps returns the actual page, that is the one without the %c0 in it. The %c0 is not in the link that Google has for the page. The %c0 is however shown in the green url display. The %c0 also is used in the link to Googles cache entry. Following the cache link results in the cache entry not being found. However removing the %c0 from the cache link gets you to the cache copy. That cache copy is a bit messed up as well.

I use mozilla as my browser and when the %co is used in any of the links on the site I get the www.example.com%co was not found message, which IIRC comes from a DNS lookup failure.

In other words ya can't even get there using %c0 in that manner.

Too many things to contemplate, too little time, and not enough real information.

[edited by: theBear at 2:30 pm (utc) on Sep. 8, 2006]