Forum Moderators: Robert Charlton & goodroi
"GET /wiki/Planet_of_the_Apes_(film) HTTP/1.1" ... Googlebot/2.1;
"GET /wiki/The_Truth_About_Chernobyl HTTP/1.1" ... Googlebot/2.1;
"GET /wiki/The_Forgotten_Pollinators HTTP/1.1" ... Googlebot/2.1;
"GET /wiki/New_Worlds,_Ancient_Texts HTTP/1.1" ... Googlebot/2.1;
"GET /wiki/Anarchy,_State_and_Utopia HTTP/1.1" ... Googlebot/2.1;
"GET /wiki/The_City_of_Lost_Children HTTP/1.1" ... Googlebot/2.1;
"GET /wiki/Men's_Adventure_Magazines HTTP/1.1" ... Googlebot/2.1;
"GET /wiki/Annal:1994_Carnegie_Medal HTTP/1.1" ... Googlebot/2.1;
"GET /wiki/Tess_of_the_D'Urbervilles HTTP/1.1" ... Googlebot/2.1; Notice anything weird? All of the URIs are exactly 30 characters long! I noticed immediately because all the columns lined up in my text editor.
I am somewhat curious as to why one of G’s servers (66.249.70.70) is crawling a domain that changed IPs seven days ago, but not curious enough to ask. Here's why I am posting: Why 30 characters?
FYI, G indexes over 26,000 pages from my site. G crawled 80 pages between 2am and 5pm, and all URIs were 30 chars.
It does sound like there must be a bad, not-updating DNS server somewhere, though. Otherwise how would your old server even get the request in the first place?
For example, if I take all the Googlebot requests from my raw log file and sort them on some character position way down in the record (somewhere in the middle of the browser ID field), then all the requests for URLs of the same length will end up being grouped together.
(Technically you have to take into account the number of characters in the filesize as well, but doing this can certainly generate the kind of pattern we're talking about.)