Forum Moderators: open

Message Too Old, No Replies

ia_archiver

archive.org

         

wilderness

10:04 am on Aug 1, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There's an interesting discussion going on over in alt.webamster concerning this bot and their "supposed" archiving.
It seems that some things are going to third parties as fee based while the archive has not been updated in some time.

[google.com...]

Peeress

1:54 pm on Aug 1, 2003 (gmt 0)

10+ Year Member



Interesting.
I've got "ia_archiver" disallowed in the robots.txt
and it seems to work.
their ip is 209.237.233.238 (www.archive.org), but I didn't realize
it's actually cgi7.archive.org [209.237.232.84]

As far as I know, archive.org seems to respect the robots.txt as they look at it and leave, and I am not archived at wayback machine. (not sure of other places)

I don't disagree with their idea, it's just that in my case I prefer not to have a copy of my site there (or anywhere), and save bandwidth.

stevenha

2:26 pm on Aug 1, 2003 (gmt 0)

10+ Year Member



archive.org has a fresher copy of my site from Sept 14, 2002. And I don't remember seeing it there, about 6 months ago, which was the last time I checked. archive.org now shows a total of 23 different dates for my site, extending back to Oct 1999.

So maybe they grab sites, hold them for a while internally, before posting them publically as archive updates.

If others can find some even more recent updates... it could dispell the rumor that archive.org isn't behaving like it used to.

upside

6:47 am on Aug 2, 2003 (gmt 0)

10+ Year Member



Archive.org has always added pages about 6 months after they were spidered. This can be confirmed in their FAQ.

Also, Alexa is responsible for the ia_archiver bot. Alexa uses the data that the bot collects for two main purposes: to donate to archive.org and to sell commercially.

stevenha

3:51 pm on Aug 3, 2003 (gmt 0)

10+ Year Member



I know that some people block ia_archiver, but I think its darn handy to let it store an archive.

When you have to challenge an unscroupulous webmaster, regarding copyright violation (for copying your content), showing them the archive on the wayback machine, usually solves the problem quickly.

wilderness

4:34 pm on Aug 3, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



steven and upside,
I've had the wayback denied for some time.
Recently I considered the possibility of opening my sites up for archiving.

If that was the primary intent of archive.org? I would do so in a heart-beat.

I don't believe the third-party selling of data was ALWAYS in the realm of archive.org?

For me this puts an entirely different light on archive.org, IMO there is no difference between the mining-selling they do and all the others who use webmasters resources to generate income from third parties and not "in-return" providing webmasters with a share of the profits.

On a good point, archive.org might also be used in an emergency as a backup. At least to some extent.
I've even used it myself to gather data from websites which are no longer online.

The copyright verification is a good point as well.

Don

Peeress

4:43 pm on Aug 4, 2003 (gmt 0)

10+ Year Member



stevenha:
When you have to challenge an unscroupulous webmaster, regarding copyright violation (for copying your content), showing them the archive on the wayback machine, usually solves the problem quickly.

wilderness:
On a good point, archive.org might also be used in an emergency as a backup. At least to some extent.
I've even used it myself to gather data from websites which are no longer online.

Good points! I may change my mind about it archiving my site as well.

(also figured out how to do quotes in here lol)

sidyadav

4:05 am on Aug 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I had this "ia_archiver" thing in my site for quite a long time and it caused me A LOT of bandwidth, I decided to ban it, I thought it wouldn't go but it looked at the robot.txt file and went :-)

claus

12:28 am on Aug 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Haven't banned it myself - been archived in 1998, 1999 and 2000 - and no later.

Recently submitted my url - yesterday "robots.txt" was requested no less than 33 times by the ia_archiver.

Bot IP was: 209.237.238.173

/claus

pearl

3:08 am on Aug 12, 2003 (gmt 0)

10+ Year Member



ia is owned by Amazon now and they have new ideas. Thats why you are seeing different activity nowadays.

The spider is also looking for Amazon links. They are beginning to integrate the old alexa into Amazon. Amazon affiliates are seeing the results of this now.

Google/Yahoo/Froogle watch out.

Pearl