Forum Moderators: goodroi

Message Too Old, No Replies

"ROBOTS.TXT is a stupid"

So says Archiveteam

         

default password

6:35 pm on Apr 26, 2015 (gmt 0)

10+ Year Member



First seen today this UA:

ArchiveTeam ArchiveBot/20150417.01 (wpull 1.1a1) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36

And Archiveteam hates robots.txt and says "ROBOTS.TXT is a stupid, silly idea in the modern era. Archive Team entirely ignores it and with precisely one exception, everyone else should too. ... Archive Team interprets ROBOTS.TXT as damage and temporary madness, and works around it. Everyone should. If you don't want people to have your data, don't put it online."

Wow. What they say ArchiveBot is: "You give it a URL to start at, and it grabs all content under that URL, records it in a WARC, and then uploads that WARC to ArchiveTeam servers for eventual injection into the Internet Archive (or other archive sites)."

Thanks but no thanks. Access denied.

lucy24

8:58 pm on Apr 26, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



for eventual injection into the Internet Archive

Does the Internet Archive know this? All of their own robots that I've met have been quite scrupulous about reading robots.txt on each visit.

You left out the best line. For, er, a given definition of "best" (emphasis mine):
[archiveteam.org...]
ArchiveBot understands robots.txt (please read the article) but does not match any directives. It uses it for discovering more links such as sitemaps however.

where "the article" is
[archiveteam.org...]
I'm working on the assumption that "such as" in the above means "and also".

Do they have an IP of their own or do we just block the UA?

tangor

9:24 pm on Apr 26, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Robots.txt has always been voluntary, and for many years worked quite well in that regard. However, we all know human nature (not robots) and there's an evil side to some. Just like you can have a house guest get rowdy and puke on your sofa, you have those who have no ethics re: robots.txt.

default password

4:46 am on Apr 29, 2015 (gmt 0)

10+ Year Member



Only seen them from IP 192.99.32.115. I assume that is their own bot. They got / at first (and nothing else), then came back some time later and got all .png and .css links of that site -- now presumably "archived" somewhere. Why? Beyond reason.

Have not investigated "Internet Archive" re: this bot (and have had "User-agent: ia_archiver Disallow: /" for a long time in robots.txt).

lucy24

7:58 am on Apr 29, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



and have had "User-agent: ia_archiver Disallow: /" for a long time in robots.txt).

... which is just the difference between ia_archiver and ArchiveBot, isn't it?

192.99.32.115
:: shuffling papers ::
Oh, OVH. No worries then: they'll never get anything from me. Other than robots.txt, hahahahaha.