ia_archiver: to block or not to block?

Forum Moderators: bakedjake

Message Too Old, No Replies

ia_archiver: to block or not to block?

this bot hits me constantly and sends me no visitors

berli

10:29 pm on May 15, 2003 (gmt 0)

I'd like people's opinions on alexa/ia_archiver/wayback machine. I would completely block ia_archiver if it were just alexa because I'm not getting any referals from them, while they hog my server time. On the other hand, I use WayBack, and I wouldn't mind having my stuff archived there. On the other, other hand, Wayback is pretty buggy and lots of ostensible links just 404.

jdMorgan

10:54 pm on May 15, 2003 (gmt 0)

berli,

Welcome to WebmasterWorld [webmasterworld.com]!

I'm in the same boat - Not much traffic from Alexa.

However, it will be nice to have the WayBack Machine in case I have to prove prior publication of a page in case of a Copyright dispute. So, for me, it's worth the bandwidth.

I do limit their spidering of my sites to "valuable" pages using robots.txt, though. This keeps the bandwidth reasonable.

Jim

athinktank

1:05 am on May 17, 2003 (gmt 0)

I tend to think of the Way Back thang as something good for the internet. I will not block the ia_archiver. I just accept that I will not get ANY traffic from Alexa, and bite the bullet. As developers of *stuff* we have to respect the medium and not just look for traffic all the time.

just my $0.02.

jdMorgan

1:26 am on May 17, 2003 (gmt 0)

athinktank,

I agree about "respecting the medium." I also believe it works both ways; One of the tests I use to determine whether to block a user-agent is to ask, "Is it useful to my visitors, to me, or to the public at large?"

Clearly, IA can be useful, so I don't block it.

There are other user-agents which are abusive and serve no purpose except to provide a product for sale to others at our expense. Those, I block.

My primary perspective is not one of sales or traffic, but of keeping the internet useful -- doing my part to reduce namespace, search engine index, and e-mail clutter, and wasteful use of bandwidth -- mine and others'.

Tilting at windmills, I suppose, but that's me...

Jim

athinktank

2:17 am on May 17, 2003 (gmt 0)

I guess I was speaking more to myself, and not anyone in particular. I am about5 sales and traffic and have to keep reminding myself that not to be part of the "clutter".

Im with ya jdMorgan.

Chris_R

2:22 am on May 17, 2003 (gmt 0)

I agree with the other posts - I would absolutely not block it (unless you have some sort of weird copyright problem).

Even if they aren't sending you traffic now - it is a useful service and can provide unintended benifits for you.

For example - I used wayback to investigate a company I have been thinking about doing business with. I found something that makes it more likely that I will do business than if I hadn't found it at all. In reality - it probably wouldn't have mattered in my case, but it could.

Also, you never know who will buy who in the future and what that data will be used for. Blocking it could prevent a semi useful sorce of info being used in something bigger.

carfac

4:37 am on May 17, 2003 (gmt 0)

I block it, because I find it so obnoxious. I find that- in my case- it comes too often, and stays too long. To me, not worth the bandwidth it was hogging. But that is me! Bet Don blocks the whole IP range! :)

dave

mack

4:52 am on May 17, 2003 (gmt 0)

Generaly I block ie_archiver but I have recently removed it from my robots.txt

I did have problems with it in the past. But I also see it as being a usefull "historical" tool in the near future.

I dont think the general public really know that archive.org exists and if they did I think it would be very popular with people wanting to know what Yahoo etc looked like in the "good old days"

So would say it is usefull to web users and have decided to allow it, in again. If however the bot gets out of hand I will contradict myself and block it again.

Mack.

jdMorgan

5:03 am on May 17, 2003 (gmt 0)

I will say that I only allow it to archive a few "important" top-level pages - My robots.txt file tends to be rather detailed and specific.

Jim

mack

5:07 am on May 17, 2003 (gmt 0)

Jim,
To be safe I think I will be following yout advice on this one. A lot of my site is dynamic contebt and if IE_Archiver got in amidst that it could get messy.

Mack.

berli

7:54 am on May 17, 2003 (gmt 0)

I do like the tool -- but, like others of you, I find the bot whacks me early and often and seems to hog a lot more bandwidth than other bots.

What's odd is that when I try to use Wayback, it only has results for every few months on old sites, and then often the links don't work (this has happened a lot lately) so it *says* it has something archived, but doesn't (at least in a practical sense). It also, sadly, never spidered quite a few smaller sites that have since disappeared.

Yet this bugger seems to visit my site every month--I don't know what's so exciting that it keeps coming back, and coming back, and coming back.

I'm wondering if I should block it for a few months, then let it back, block, let it back?

jdMorgan

1:07 pm on May 17, 2003 (gmt 0)

berli,

Remember that Alexa is doing the spidering to feed their search portal. After one year, they give the spidering data to Wayback. I think this may explain part of what you're seeing.

Broken links will occur if their spider cannot or does not download all the pages and images on your site. If you want a "clean" archive, some planning is required before implementing robots.txt and/or meta robots restrictions.

Jim

berli

4:30 am on May 20, 2003 (gmt 0)

Thanks for the help everyone.

I'd heard some stuff about ia_archiver being a little scummy and ignoring robots.txt, but it seems to be behaving at present and is not attacking me so aggressively now. I think I'll just chill out a bit :)

Off to rewrite my robots.txt again . . .

btw, lazy of me to ask here, but is there a way to ask all bots but one to stay off certain pages? I have this "game" on my site which is all html (not dynamically served) and it would be nice to let alexa/wayback archive it but I *don't* want, say, google users surfing into one of those pages randomly if I can help it.

(I'm not going to worry about non-robots.txt-compliant bots and .htaccess at this point -- crossing that bridge when I come to it.)

jdMorgan

4:40 am on May 20, 2003 (gmt 0)

berli,

This would allow ia_archiver to index all of your pages, and prevent everyone else from indexing anything starting with "/game/pages"


User-agent: ia_archiver
Disallow:

User-agent: * Disallow: /game/pages

(blank lines between records and at the end required as shown)
Jim