Forum Moderators: DixonJones
209.237.233.203 - - [08/Apr/2004:06:43:08] "GET / HTTP/1.0" 200 4278 "-" "http://almaden.ibm.com/cs/crawler/focus"
209.237.233.203 - - [08/Apr/2004:06:43:09] "GET /dir/page1.html HTTP/1.0" 200 8681 "-" "http://almaden.ibm.com/cs/crawler/focus"
209.237.233.203 - - [08/Apr/2004:06:50:18] "GET /page2.html HTTP/1.0" 200 3801 "-" "http://almaden.ibm.com/cs/crawler/focus"
209.237.233.203 - - [08/Apr/2004:06:50:18] "GET /page3.html HTTP/1.0" 200 2846 "-" "http://almaden.ibm.com/cs/crawler/focus"
209.237.233.203 - - [08/Apr/2004:06:50:19] "GET / HTTP/1.0" 200 4278 "-" "http://almaden.ibm.com/cs/crawler/focus"
209.237.233.203 - - [08/Apr/2004:06:50:20] "GET /dir/ HTTP/1.0" 200 4422 "-" "http://almaden.ibm.com/cs/crawler/focus"
209.237.233.203 - - [08/Apr/2004:06:50:20] "GET /dir/page1.html HTTP/1.0" 200 8681 "-" "http://almaden.ibm.com/cs/crawler/focus"
209.237.233.203 - - [08/Apr/2004:06:50:22] "GET /page4.html HTTP/1.0" 200 15061 "-" "http://almaden.ibm.com/cs/crawler/focus"
209.237.233.203 - - [08/Apr/2004:09:02:22] "GET / HTTP/1.0" 200 4278 "-" "me"
209.237.233.203 - - [08/Apr/2004:09:02:23] "GET /dir/page1.html HTTP/1.0" 200 8681 "-" "me"
209.237.233.203 - - [08/Apr/2004:09:08:16] "GET /page2.html HTTP/1.0" 200 3801 "-" "me"
209.237.233.203 - - [08/Apr/2004:09:08:16] "GET /page3.html HTTP/1.0" 200 2846 "-" "me"
209.237.233.203 - - [08/Apr/2004:09:08:17] "GET / HTTP/1.0" 200 4278 "-" "me"
209.237.233.203 - - [08/Apr/2004:09:08:17] "GET /dir/ HTTP/1.0" 200 4422 "-" "me"
209.237.233.203 - - [08/Apr/2004:09:08:18] "GET /dir/page1.html HTTP/1.0" 200 8681 "-" "me"
209.237.233.203 - - [08/Apr/2004:09:08:19] "GET /page4.html HTTP/1.0" 200 15061 "-" "me"
209.237.235.158 - - [09/Apr/2004:07:57:16 +1000] "GET /robots.txt HTTP/1.0" 200 398 "-" "os-heritrix/0.6.0 (+http://crawler.archive.org)"
209.237.235.158 - - [09/Apr/2004:07:57:19 +1000] "GET /downloads/file1.doc HTTP/1.0" 200 24576 "-" "os-heritrix/0.6.0 (+http://crawler.archive.org)"
209.237.235.158 - - [09/Apr/2004:08:20:29 +1000] "GET /sitecheck.internetseer.com HTTP/1.0" 404 1229 "-" "os-heritrix/0.6.0 (+http://crawler.archive.org)"
Please let us know via the contact info on our web page (or the "From:" header we send on every request) if our crawler seems to misbehave or cause any problems for your site.
The first (April 8th) batch of hits must be some other custom research crawler someone is running on an Internet Archive machine; I don't know exactly who, we work with a number of academics and research labs. The URL they give in your logs ( [almaden.ibm.com...] ) has more information and contact info, and says that they respect robots.txt, so if you see any troublesome traffic from their crawler please contact them directly.
Hope this helps,
- Gordon @ IA
Welcome to WebmasterWorld [webmasterworld.com]!
It looks like your crawler identifies itself pretty well, and provides a link to archive.org's information page. But please help spread the word among crawler operators that the From: header is next to useless; Very, very few commercial web hosting services log this header for their customers' use, so it is not visible to 99% of all Webmasters. The approach of putting the spider's name and contact info in the User-agent header is much more Web-friendly.
Please understand that small Web sites see a disproportionately-large number of malicious user-agents, and that it makes their Webmasters a little touchy about unknown user-agents, changing User-agents, and violations of robots.txt.
Again, your User-agent string looks good, but please help spread the word.
Thanks,
Jim