WaybackMachine and Scrapers

Forum Moderators: open

Message Too Old, No Replies

WaybackMachine and Scrapers

Do scrapers scrape archive.org?

Umbra

3:48 pm on Jan 9, 2007 (gmt 0)

(Hope this is the relevant forum to post this thread)

I'm thinking of allowing archive.org to index one of our websites again. Just wondering if anyone knows or has heard of scrapers scraping the Internet Archive?

wilderness

7:32 pm on Jan 9, 2007 (gmt 0)

Umbra,
I had archive.org and ia_archiver denied access to my sites for what seems like infinity.

About nine months ago, I decided to have them archive my sites.

The first step was in editing robots.txt.
Then I removed some of the IP ranges that I was aware of.
(apparently, however not all).

The bot crawls from multiple IP's and utilizing multiple UA's.
After all this time, any searches on my sites through archive.org still fails to show any of my pages and results in a BLOCKED ERROR.

So much for my hopes of having my pages archived.
In addition when they were crawling (after resumption) the bot began crawling images that were clearly defined as out of bounds in robots.txt. Thus, I'm not too reluctant too expand their ranges of access.

BTW, my sites offer links to numerous materials on "widgets" that once provided good source material, however have since been removed from active pages. Archive.org is one method of keeping the materials available.
I do not utilize any software to explore this links, rather it's done manually.

Perhaps somebody could test wget or another software on one of their results pages?

Don