Forum Moderators: bakedjake
Welcome to WebmasterWorld [webmasterworld.com]!
I'm in the same boat - Not much traffic from Alexa.
However, it will be nice to have the WayBack Machine in case I have to prove prior publication of a page in case of a Copyright dispute. So, for me, it's worth the bandwidth.
I do limit their spidering of my sites to "valuable" pages using robots.txt, though. This keeps the bandwidth reasonable.
Jim
just my $0.02.
I agree about "respecting the medium." I also believe it works both ways; One of the tests I use to determine whether to block a user-agent is to ask, "Is it useful to my visitors, to me, or to the public at large?"
Clearly, IA can be useful, so I don't block it.
There are other user-agents which are abusive and serve no purpose except to provide a product for sale to others at our expense. Those, I block.
My primary perspective is not one of sales or traffic, but of keeping the internet useful -- doing my part to reduce namespace, search engine index, and e-mail clutter, and wasteful use of bandwidth -- mine and others'.
Tilting at windmills, I suppose, but that's me...
Jim
Even if they aren't sending you traffic now - it is a useful service and can provide unintended benifits for you.
For example - I used wayback to investigate a company I have been thinking about doing business with. I found something that makes it more likely that I will do business than if I hadn't found it at all. In reality - it probably wouldn't have mattered in my case, but it could.
Also, you never know who will buy who in the future and what that data will be used for. Blocking it could prevent a semi useful sorce of info being used in something bigger.
I did have problems with it in the past. But I also see it as being a usefull "historical" tool in the near future.
I dont think the general public really know that archive.org exists and if they did I think it would be very popular with people wanting to know what Yahoo etc looked like in the "good old days"
So would say it is usefull to web users and have decided to allow it, in again. If however the bot gets out of hand I will contradict myself and block it again.
Mack.
What's odd is that when I try to use Wayback, it only has results for every few months on old sites, and then often the links don't work (this has happened a lot lately) so it *says* it has something archived, but doesn't (at least in a practical sense). It also, sadly, never spidered quite a few smaller sites that have since disappeared.
Yet this bugger seems to visit my site every month--I don't know what's so exciting that it keeps coming back, and coming back, and coming back.
I'm wondering if I should block it for a few months, then let it back, block, let it back?
Remember that Alexa is doing the spidering to feed their search portal. After one year, they give the spidering data to Wayback. I think this may explain part of what you're seeing.
Broken links will occur if their spider cannot or does not download all the pages and images on your site. If you want a "clean" archive, some planning is required before implementing robots.txt and/or meta robots restrictions.
Jim
I'd heard some stuff about ia_archiver being a little scummy and ignoring robots.txt, but it seems to be behaving at present and is not attacking me so aggressively now. I think I'll just chill out a bit :)
Off to rewrite my robots.txt again . . .
btw, lazy of me to ask here, but is there a way to ask all bots but one to stay off certain pages? I have this "game" on my site which is all html (not dynamically served) and it would be nice to let alexa/wayback archive it but I *don't* want, say, google users surfing into one of those pages randomly if I can help it.
(I'm not going to worry about non-robots.txt-compliant bots and .htaccess at this point -- crossing that bridge when I come to it.)