Goooglebot getting multiple comma seperated urls

Forum Moderators: DixonJones

Message Too Old, No Replies

Goooglebot getting multiple comma seperated urls

Has anyone seen this before?

phaze

8:09 am on Aug 16, 2004 (gmt 0)

I'm seeing the following in my logs:

[16/Aug/2004:06:21:25 +0000] "GET /base/Training/Teachers/,/base/Training/Teachers/ HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"

I've never seen a request that has two URI's seperated by commas. I've google'd it and can't find anything. Even checked out RFC1945 and no luck. Has anyone seen this before?

Thanks,

Mark.

Lord Majestic

8:17 am on Aug 16, 2004 (gmt 0)

If it is meant to signify request for 2 comma delimited URLs then its not a valid HTTP GET request. I think its just a human error.

phaze

8:28 am on Aug 16, 2004 (gmt 0)

Humans? The only human involved in this was me.

I'm getting tons of this in my logs and it's worrying the hell out of me. We're finally got deep crawled by googlebot on 1 August and by 3 august were getting over 600 new users per day from google. So life is pretty good right now and I dont want it to stop being good. Should I worry about this comma seperated stuff, or just count my blessings and not question things too much?

Lord Majestic

8:31 am on Aug 16, 2004 (gmt 0)

Humans? The only human involved in this was me.

No, not just you - humans control the bot, humans might have linked to your site incorrectly, which was picked by bot, possibly incorrect logging format (in which case all entries should be like that) etc etc.

Either way its not your mistake (unless you have comma delimited links on your site) and you can't do much about it really.

phaze

8:49 am on Aug 16, 2004 (gmt 0)

No, not just you - humans control the bot, humans might have linked to your site incorrectly, which was picked by bot, possibly incorrect logging format (in which case all entries should be like that) etc etc.
Either way its not your mistake (unless you have comma delimited links on your site) and you can't do much about it really.

Point taken. Thanks.

The following makes me believe it's not human error:
1. No one links deep within our directory.
2. We have over 4000 pages within our directory and this morning alone we've seen 281 requests for seperate distinct pages, all with a comma in the URL.
3. All those requests came from googlebot.
4. There is not a single instance of a human browser making a similar request
5. I've grepped the entire document tree for ",/" and various permutations, and nothing came up.

Lord Majestic

8:52 am on Aug 16, 2004 (gmt 0)

3. All those requests came from googlebot.

Then its googlebot's issue - comma separated URL is the same, human errors do happen, people edit source code at live boxes etc etc. If look at URL it does not consists of two different URLs, which may have implied they tried to request 2 URLs in one go. Even if they did mean it that would have been an illegal HTTP GET request.

I think its a red herring - just keep an eye on it and relax.

p.s. oh, was the IP address for requests matching those known to be used by Googlebot? You know anyone can write a bot with fake useragent (and whoever did that is more likely to have made mistake like that than Google men).

phaze

9:53 am on Aug 16, 2004 (gmt 0)

For anyone who's interested, here's the output of google's request using a packet sniffer (names changed to protect the innocent...):

GET /somebase/Travel-Food-Service/,/somebase/Travel-Food-Service/ HTTP/1.0
If-Modified-Since: Thu, 29 Jul 2004 08:00:00 GMT
User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html)
From: googlebot(at)google.com
Accept: text/html,text/plain,application/*
Host: www.somewhere.com

We're getting heavily crawled now and most requests are good old single url's, but I'm seeing alot of this comma seperated junk. No idea why.

py9jmas

10:04 am on Aug 16, 2004 (gmt 0)

Your server is not responding to these requests with an error (ie 404 Not Found). It responded with 302 Found. This won't help Googlebot notice the error.

It's probably Apache's ErrorDocument - have a read of
[httpd.apache.org...]

Note that when you specify an ErrorDocument that points to a remote URL (ie. anything with a method such as "http" in front of it), Apache will send a redirect to the client to tell it where to find the document, even if the document ends up being on the same server. This has several implications, the most important being that the client will not receive the original error status code, but instead will receive a redirect status code. This in turn can confuse web robots and other clients which try to determine if a URL is valid using the status code.

phaze

10:28 am on Aug 16, 2004 (gmt 0)

Your server is not responding to these requests with an error (ie 404 Not Found). It responded with 302 Found. This won't help Googlebot notice the error.

Thanks. I'm running a custom error handler under mod_perl. Because it's a heirarchical directory that is updated daily, we often have pages that aren't there anymore. We use an errorhandler to send the user or google up one level in the directory. These comma seperated thingy's should be handled with a 404 though - you're right. Thanks for the tip. I'll fix it.

dcrombie

10:41 am on Aug 16, 2004 (gmt 0)

I've recently noticed Googlebot following links from inside JavaScript code. Maybe they're trying to combat some SEO tricks?

eg.

document.write("<A href=\"mailto:" + new Array("nobody","nowhere.com").join("@") + "\">"); 
...

Google took the above and then requested http://www.mysite.com/nowhere.com

Do you have any JavaScript on your site with a comma-separated list of pages?

phaze

11:00 am on Aug 16, 2004 (gmt 0)

Interesting. No we don't. Just regular internal links and external links that open with _blank via a redirector so we can track what's clicked on.