homepage Welcome to WebmasterWorld Guest from 54.227.5.234
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe and Support WebmasterWorld
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
archive.org bot
bad bot
keyplyr




msg:4632065
 7:46 am on Dec 19, 2013 (gmt 0)



So archive.org_bot requests robots.txt where it is disallowed (as well as it's crime partner ia_archiver) then continues to crawl through my site including requests for css and image files. Oh and I forgot to mention, my domain has been excluded (by me) from the Internet Archive since the beginning of time.

Webmasters: User Agent archive.org_bot is used for our wide crawl of the web. It is designed to respect robots.txt and META robots tags. Read more about Robots.txt. We try to crawl at a pace slow enough to not disrupt normal web activity. You can learn more in the Wayback Machine FAQs. If you notice archive.org_bot behaving badly, please contact us at bot.archive.org


Yeah right... so now blocked by ip ranges.

 

wilderness




msg:4632247
 7:48 pm on Dec 19, 2013 (gmt 0)

keyplr,
This thing has been a PITA for webmasters since it's inception. They've never honored robots.txt, despite what they claim.
Initially, there was a belief that it could be used to document copyright and resolve plagiarism issues, however that don't hold water. There also use to be a disclaimer on their website that they sold data by the terabyte, although that practice was stopped some years back.

A decade ago, they used a handful of IP ranges, which I failed to make notation of when adding to my denies.
At one point, I thought it would be sound administrational decision to grant them access.
Unfortunately I couldn't locate all their IP's to remove their "deny access label from my websites and on their website. Over multiple communications, they just couldn't understand, nor did they ever provide all the IP ranges they spider from so that I could remove the denies.

In the end, I'm glad it worked out that way.

There are three major widget websites that get traffic that is leap and bounds times the quantity of all other widget sites. Until a couple a years ago, two of the major widget sites had all their old pages on archive.org, however once the two majors implemented denies, all that old stuff disappeared from public access, which is what the majors intended.

dstiles




msg:4632258
 8:55 pm on Dec 19, 2013 (gmt 0)

I block the IP ranges more incidentally than specifically, since they are part of larger web farms.

I also block by UA, which catches not only the "legit" ones but chinese fakers as well.

keyplyr




msg:4632260
 9:06 pm on Dec 19, 2013 (gmt 0)



Guess I forgot to mention the IP range was legit, assigned to Internet Archive. At one time their bots obeyed robots.txt, no longer it seems. I did email an abuse report to them. So far no answer.

tangor




msg:4632313
 12:54 am on Dec 20, 2013 (gmt 0)

Guess I forgot to mention the IP range was legit, assigned to Internet Archive. At one time their bots obeyed robots.txt, no longer it seems. I did email an abuse report to them. So far no answer.

Unlikely you will get one.

Currently have a long running battle with them... but goodness gracious, it is hard work against deep pockets (check out their political support).

wilderness




msg:4632363
 5:13 am on Dec 20, 2013 (gmt 0)

Considering their part of the archival of complete books and/or complete libraries, and using the same high priced scanning machines that Google Books is using?

It is difficult to imagine they have time to communicate and appease a meager webmaster.

keyplyr




msg:4632377
 7:20 am on Dec 20, 2013 (gmt 0)



...meager webmaster


That's me :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved