Forum Moderators: DixonJones
[16/Aug/2004:06:21:25 +0000] "GET /base/Training/Teachers/,/base/Training/Teachers/ HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"
I've never seen a request that has two URI's seperated by commas. I've google'd it and can't find anything. Even checked out RFC1945 and no luck. Has anyone seen this before?
Thanks,
Mark.
I'm getting tons of this in my logs and it's worrying the hell out of me. We're finally got deep crawled by googlebot on 1 August and by 3 august were getting over 600 new users per day from google. So life is pretty good right now and I dont want it to stop being good. Should I worry about this comma seperated stuff, or just count my blessings and not question things too much?
m.
Humans? The only human involved in this was me.
No, not just you - humans control the bot, humans might have linked to your site incorrectly, which was picked by bot, possibly incorrect logging format (in which case all entries should be like that) etc etc.
Either way its not your mistake (unless you have comma delimited links on your site) and you can't do much about it really.
No, not just you - humans control the bot, humans might have linked to your site incorrectly, which was picked by bot, possibly incorrect logging format (in which case all entries should be like that) etc etc.Either way its not your mistake (unless you have comma delimited links on your site) and you can't do much about it really.
The following makes me believe it's not human error:
1. No one links deep within our directory.
2. We have over 4000 pages within our directory and this morning alone we've seen 281 requests for seperate distinct pages, all with a comma in the URL.
3. All those requests came from googlebot.
4. There is not a single instance of a human browser making a similar request
5. I've grepped the entire document tree for ",/" and various permutations, and nothing came up.
3. All those requests came from googlebot.
Then its googlebot's issue - comma separated URL is the same, human errors do happen, people edit source code at live boxes etc etc. If look at URL it does not consists of two different URLs, which may have implied they tried to request 2 URLs in one go. Even if they did mean it that would have been an illegal HTTP GET request.
I think its a red herring - just keep an eye on it and relax.
p.s. oh, was the IP address for requests matching those known to be used by Googlebot? You know anyone can write a bot with fake useragent (and whoever did that is more likely to have made mistake like that than Google men).
GET /somebase/Travel-Food-Service/,/somebase/Travel-Food-Service/ HTTP/1.0
If-Modified-Since: Thu, 29 Jul 2004 08:00:00 GMT
User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html)
From: googlebot(at)google.com
Accept: text/html,text/plain,application/*
Host: www.somewhere.com
We're getting heavily crawled now and most requests are good old single url's, but I'm seeing alot of this comma seperated junk. No idea why.
It's probably Apache's ErrorDocument - have a read of
[httpd.apache.org...]
Note that when you specify an ErrorDocument that points to a remote URL (ie. anything with a method such as "http" in front of it), Apache will send a redirect to the client to tell it where to find the document, even if the document ends up being on the same server. This has several implications, the most important being that the client will not receive the original error status code, but instead will receive a redirect status code. This in turn can confuse web robots and other clients which try to determine if a URL is valid using the status code.
Your server is not responding to these requests with an error (ie 404 Not Found). It responded with 302 Found. This won't help Googlebot notice the error.
Thanks. I'm running a custom error handler under mod_perl. Because it's a heirarchical directory that is updated daily, we often have pages that aren't there anymore. We use an errorhandler to send the user or google up one level in the directory. These comma seperated thingy's should be handled with a 404 though - you're right. Thanks for the tip. I'll fix it.
eg.
document.write("<A href=\"mailto:" + new Array("nobody","nowhere.com").join("@") + "\">");
... Google took the above and then requested http://www.mysite.com/nowhere.com
Do you have any JavaScript on your site with a comma-separated list of pages?