Welcome to WebmasterWorld Guest from 35.172.100.232

Forum Moderators: goodroi

Message Too Old, No Replies

"ROBOTS.TXT is a stupid"

So says Archiveteam

     
6:35 pm on Apr 26, 2015 (gmt 0)

New User

joined:Apr 23, 2015
posts:10
votes: 0


First seen today this UA:

ArchiveTeam ArchiveBot/20150417.01 (wpull 1.1a1) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36

And Archiveteam hates robots.txt and says "ROBOTS.TXT is a stupid, silly idea in the modern era. Archive Team entirely ignores it and with precisely one exception, everyone else should too. ... Archive Team interprets ROBOTS.TXT as damage and temporary madness, and works around it. Everyone should. If you don't want people to have your data, don't put it online."

Wow. What they say ArchiveBot is: "You give it a URL to start at, and it grabs all content under that URL, records it in a WARC, and then uploads that WARC to ArchiveTeam servers for eventual injection into the Internet Archive (or other archive sites)."

Thanks but no thanks. Access denied.
8:58 pm on Apr 26, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15804
votes: 845


for eventual injection into the Internet Archive

Does the Internet Archive know this? All of their own robots that I've met have been quite scrupulous about reading robots.txt on each visit.

You left out the best line. For, er, a given definition of "best" (emphasis mine):
[archiveteam.org...]
ArchiveBot understands robots.txt (please read the article) but does not match any directives. It uses it for discovering more links such as sitemaps however.

where "the article" is
[archiveteam.org...]
I'm working on the assumption that "such as" in the above means "and also".

Do they have an IP of their own or do we just block the UA?
9:24 pm on Apr 26, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10281
votes: 1050


Robots.txt has always been voluntary, and for many years worked quite well in that regard. However, we all know human nature (not robots) and there's an evil side to some. Just like you can have a house guest get rowdy and puke on your sofa, you have those who have no ethics re: robots.txt.
4:46 am on Apr 29, 2015 (gmt 0)

New User

joined:Apr 23, 2015
posts:10
votes: 0


Only seen them from IP 192.99.32.115. I assume that is their own bot. They got / at first (and nothing else), then came back some time later and got all .png and .css links of that site -- now presumably "archived" somewhere. Why? Beyond reason.

Have not investigated "Internet Archive" re: this bot (and have had "User-agent: ia_archiver Disallow: /" for a long time in robots.txt).
7:58 am on Apr 29, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15804
votes: 845


and have had "User-agent: ia_archiver Disallow: /" for a long time in robots.txt).

... which is just the difference between ia_archiver and ArchiveBot, isn't it?

192.99.32.115
:: shuffling papers ::
Oh, OVH. No worries then: they'll never get anything from me. Other than robots.txt, hahahahaha.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members