Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Obsessive-compulsive Googlebot Spidering

         

kennylucius

2:14 am on Jan 31, 2008 (gmt 0)

10+ Year Member



One week ago, I moved my site to another host. Today, on the old server, there were about 80 lines in the web log, all from the same G server. I expected a few DNS servers to lag behind, but what I saw in the log really caught my attention. Here’s a taste:


"GET /wiki/Planet_of_the_Apes_(film) HTTP/1.1" ... Googlebot/2.1;
"GET /wiki/The_Truth_About_Chernobyl HTTP/1.1" ... Googlebot/2.1;
"GET /wiki/The_Forgotten_Pollinators HTTP/1.1" ... Googlebot/2.1;
"GET /wiki/New_Worlds,_Ancient_Texts HTTP/1.1" ... Googlebot/2.1;
"GET /wiki/Anarchy,_State_and_Utopia HTTP/1.1" ... Googlebot/2.1;
"GET /wiki/The_City_of_Lost_Children HTTP/1.1" ... Googlebot/2.1;
"GET /wiki/Men's_Adventure_Magazines HTTP/1.1" ... Googlebot/2.1;
"GET /wiki/Annal:1994_Carnegie_Medal HTTP/1.1" ... Googlebot/2.1;
"GET /wiki/Tess_of_the_D'Urbervilles HTTP/1.1" ... Googlebot/2.1;

Notice anything weird? All of the URIs are exactly 30 characters long! I noticed immediately because all the columns lined up in my text editor.

I am somewhat curious as to why one of G’s servers (66.249.70.70) is crawling a domain that changed IPs seven days ago, but not curious enough to ask. Here's why I am posting: Why 30 characters?

FYI, G indexes over 26,000 pages from my site. G crawled 80 pages between 2am and 5pm, and all URIs were 30 chars.

tedster

5:09 am on Jan 31, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Always 30 characters long? That's one of the strangest observations ever - does the Google crawl team keep lists of urls sorted by character length? I guess they might but it's hard to imagine why.

It does sound like there must be a bad, not-updating DNS server somewhere, though. Otherwise how would your old server even get the request in the first place?

jomaxx

5:44 am on Jan 31, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've seen this exact thing happen a number of times, including from major search engine spiders (not sure about Google in particular). I always assumed that the list of URLs was being sorted in some way, and the fact that identical-length URLs end up being grouped together as a kind of side effect.

For example, if I take all the Googlebot requests from my raw log file and sort them on some character position way down in the record (somewhere in the middle of the browser ID field), then all the requests for URLs of the same length will end up being grouped together.

(Technically you have to take into account the number of characters in the filesize as well, but doing this can certainly generate the kind of pattern we're talking about.)