Forum Moderators: open

Message Too Old, No Replies

special archiver

         

keyplyr

11:12 pm on Jun 12, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



UA: Mozilla/5.0 (compatible; special_archiver/3.3.0 +http://www.loc.gov/webarchiving/notice_to_webmasters.html)
Protocol: HTTP/1.0
Robots.txt: Yes
Host: wbgrp-crawl202.us.archive.org (Internet Archive)
207.241.224.0 - 207.241.239.255
207.241.224.0/20

In December 2016, the Internet Archive and the Library of Congress Web Archiving Team, fearing the incoming administration would somehow compromise the archived index, moved/mirrored it to servers in Canada and the new crawl host & UA were created.

thetrasher

6:34 pm on Jun 13, 2017 (gmt 0)

10+ Year Member



robots.txt rules are ignored:
Our crawler is instructed to bypass robots.txt in order to obtain the most complete and accurate representation of websites such as yours.

keyplyr

6:42 pm on Jun 13, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've blocked all Internet Archive agents for years. I had my sites removed from their index about 12 years ago for multiple reasons.

keyplyr

1:32 am on Jun 26, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Here's another Internet Archive UA from same range, also disobeying robots.txt where it is disallowed:

Mozilla/5.0 (compatible; archive.org_bot +http://www.archive.org/details/archive.org_bot)