Forum Moderators: DixonJones

Message Too Old, No Replies

who's crawler is this?

         

dcrombie

12:44 pm on Apr 8, 2004 (gmt 0)



It looks like the IBM crawler, but it didn't request robots.txt and the IP address resolves to research.archive.org.
And it changed it's name!

209.237.233.203 - - [08/Apr/2004:06:43:08] "GET / HTTP/1.0" 200 4278 "-" "http://almaden.ibm.com/cs/crawler/focus" 
209.237.233.203 - - [08/Apr/2004:06:43:09] "GET /dir/page1.html HTTP/1.0" 200 8681 "-" "http://almaden.ibm.com/cs/crawler/focus"
209.237.233.203 - - [08/Apr/2004:06:50:18] "GET /page2.html HTTP/1.0" 200 3801 "-" "http://almaden.ibm.com/cs/crawler/focus"
209.237.233.203 - - [08/Apr/2004:06:50:18] "GET /page3.html HTTP/1.0" 200 2846 "-" "http://almaden.ibm.com/cs/crawler/focus"
209.237.233.203 - - [08/Apr/2004:06:50:19] "GET / HTTP/1.0" 200 4278 "-" "http://almaden.ibm.com/cs/crawler/focus"
209.237.233.203 - - [08/Apr/2004:06:50:20] "GET /dir/ HTTP/1.0" 200 4422 "-" "http://almaden.ibm.com/cs/crawler/focus"
209.237.233.203 - - [08/Apr/2004:06:50:20] "GET /dir/page1.html HTTP/1.0" 200 8681 "-" "http://almaden.ibm.com/cs/crawler/focus"
209.237.233.203 - - [08/Apr/2004:06:50:22] "GET /page4.html HTTP/1.0" 200 15061 "-" "http://almaden.ibm.com/cs/crawler/focus"
209.237.233.203 - - [08/Apr/2004:09:02:22] "GET / HTTP/1.0" 200 4278 "-" "me"
209.237.233.203 - - [08/Apr/2004:09:02:23] "GET /dir/page1.html HTTP/1.0" 200 8681 "-" "me"
209.237.233.203 - - [08/Apr/2004:09:08:16] "GET /page2.html HTTP/1.0" 200 3801 "-" "me"
209.237.233.203 - - [08/Apr/2004:09:08:16] "GET /page3.html HTTP/1.0" 200 2846 "-" "me"
209.237.233.203 - - [08/Apr/2004:09:08:17] "GET / HTTP/1.0" 200 4278 "-" "me"
209.237.233.203 - - [08/Apr/2004:09:08:17] "GET /dir/ HTTP/1.0" 200 4422 "-" "me"
209.237.233.203 - - [08/Apr/2004:09:08:18] "GET /dir/page1.html HTTP/1.0" 200 8681 "-" "me"
209.237.233.203 - - [08/Apr/2004:09:08:19] "GET /page4.html HTTP/1.0" 200 15061 "-" "me"

Staffa

6:55 am on Apr 9, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



209.237.233.203

CustName: Internet Archive
Address: 1021 Mission Street
City: San Francisco
StateProv: CA
PostalCode: 94103
Country: US
RegDate: 2002-09-20
Updated: 2002-09-20

NetRange: 209.237.232.0 - 209.237.235.255

dcrombie

1:28 pm on Apr 9, 2004 (gmt 0)



Now they're back as something else - I'm tempted to block that IP range until they start taking their medication again ;)

209.237.235.158 - - [09/Apr/2004:07:57:16 +1000] "GET /robots.txt HTTP/1.0" 200 398 "-" "os-heritrix/0.6.0 (+http://crawler.archive.org)" 
209.237.235.158 - - [09/Apr/2004:07:57:19 +1000] "GET /downloads/file1.doc HTTP/1.0" 200 24576 "-" "os-heritrix/0.6.0 (+http://crawler.archive.org)"
209.237.235.158 - - [09/Apr/2004:08:20:29 +1000] "GET /sitecheck.internetseer.com HTTP/1.0" 404 1229 "-" "os-heritrix/0.6.0 (+http://crawler.archive.org)"

gojomo

6:49 pm on Apr 29, 2004 (gmt 0)



Hi, I work on the Internet Archive's new open source crawler project -- [crawler.archive.org....] Our crawler, Heritrix, is responsible for the second batch of hits (April 9th) you report. We typically crawl for our own historical web collection, or for various national libraries or archives with an interest in preserving parts of the web.

Please let us know via the contact info on our web page (or the "From:" header we send on every request) if our crawler seems to misbehave or cause any problems for your site.

The first (April 8th) batch of hits must be some other custom research crawler someone is running on an Internet Archive machine; I don't know exactly who, we work with a number of academics and research labs. The URL they give in your logs ( [almaden.ibm.com...] ) has more information and contact info, and says that they respect robots.txt, so if you see any troublesome traffic from their crawler please contact them directly.

Hope this helps,

- Gordon @ IA

jdMorgan

7:16 pm on Apr 29, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Gordon,

Welcome to WebmasterWorld [webmasterworld.com]!

It looks like your crawler identifies itself pretty well, and provides a link to archive.org's information page. But please help spread the word among crawler operators that the From: header is next to useless; Very, very few commercial web hosting services log this header for their customers' use, so it is not visible to 99% of all Webmasters. The approach of putting the spider's name and contact info in the User-agent header is much more Web-friendly.

Please understand that small Web sites see a disproportionately-large number of malicious user-agents, and that it makes their Webmasters a little touchy about unknown user-agents, changing User-agents, and violations of robots.txt.

Again, your User-agent string looks good, but please help spread the word.

Thanks,
Jim