Alexa ia_archiver

Forum Moderators: open

Message Too Old, No Replies

Alexa ia_archiver

Question about alexa spider

KeithBoynton

8:48 pm on Aug 18, 2004 (gmt 0)

I have a question that I hope someone can answer.

I believe Alexa's search results are powered by google yet I get ia_archiver crawling my site a lot, what use does this serve me as a webmaster if the crawl doesn't affect their search results.

Is the only benefit I'm going to get from the ia_arhiver crawl "possibly" a few visits from people looking at archived content. I get plenty of referrals from google, but to my knowledge have NEVER had anything from Alexa.

I'm considering banning ia_archiver as I don't see the benefit. Can anyone clarify?

wilderness

12:02 pm on Aug 19, 2004 (gmt 0)

Alexa has a "so-called" tool bar which provides stats on website rankings based primarily on users that have the toolbar installed rather that what is existent on the WWW.

A third benefit for letting ia_archiver crawl your site is that your pages and data gets sold to 3rd parties ;)

Here are some recent threads on Alexa
[webmasterworld.com...]
[webmasterworld.com...]

Don

Lord Majestic

12:10 pm on Aug 19, 2004 (gmt 0)

I'm considering banning ia_archiver as I don't see the benefit. Can anyone clarify?

They resell your content not as near as much as Google "resells" it. By being a nice guy and allowing non-abusive well-behaving robots to crawl your site you are acting in the true spirit of the Internet. Your lack of understanding on what can be done with data does not mean some exciting things could not be found (which is why researchers need data) that can benefit you in the future indirectly ala Google style.

The Web as we know it would not have been possible without fair use principle by human and automated processes like the one mentioned above. If people banned Googlebot when it was research project then we would not have had Google... or more likely bots would just pretend to be IE.

Please do not ban well-behaving bots only because you have no idea how exactly small usage* of bandwidth will translate into returns later - many research projects fail, but few do succeed and pay back handsomely.

* If bots use more than 10% of your bandwidth then you either have low number of visits (hence high relative figure) or you have some big issue that is not related to bots.

wilderness

12:47 pm on Aug 19, 2004 (gmt 0)

ban well-behaving bots

[webmasterworld.com...]

Hardly! Well behaving!

Not even compliant in its identification.

You assuredly have some agenda with google?
As previously mentioned the open WWW does NOT
as was originally intended or as you desire it to function.
There exists a vareity of platforms of which just three are internet, intranet and extranet.

There are a multitude of harvesters travelling the net (more negative than the reasons for your rhetoric) and under your critera ALL harvesting, regardless of use would continue.

Wake up!

which is why researchers need data

You mean researchers working under grants or university's with paid staff utilizing students to obtain their objectives, all the while utilizing the resources of websites and webmasters beyond what was the orginally intended market by the webmaster?

The past two nights while I was sleeping, somebody from OSU felt a need to "research" more of my pages than an interested visitor would normally do. The intrususion and violation of my TOS was really unnecessary as the majority of the pages OSU grabbed provide a link to a ZIP file which offers the entire content of that section.
On their return, their eating 403's.
Under your criteria, it's a crime that I do not contine to allow the harvesting. Er! Research ;)
Hogwash!

Lord Majestic

12:57 pm on Aug 19, 2004 (gmt 0)

I can't see anything wrong in the thread that you linked to. Search engine bots should be expected to mimic browser at times in order to verify that site is not doing cloaking by useragent. Only robots are supposed to request robots.txt, so for the purposes of uncovering cloaking one might skip this phase specifically for the purpose of checking if site cloaks or not.

In my view well behaved bot is the bot which acts in a way that is not noticeable by website - ie being gentle with quering to avoid possible overload of the site. Good bot should also not waste bandwidth by requesting same page(s) way too frequently (more than once a day).

Anyway, you seem to be waging what appears to be pointless crusade against bots, what did they do to you?

wilderness

1:18 pm on Aug 19, 2004 (gmt 0)

what did they do to you?

A bot is an inanimate object and hardly capable of doing anything to me ;)
The folks who program and run the bot on the other hand are quite capable of raising a reaction, in spite of the fact that the operator in most instances never see's the result of that reaction.

ALL bots are neither compliant or well-behaved!

In all fairness NEITHER are all bots harvesters and intruders.

Each webmaster makes a decision on what is effective for his/her user and market. What is beneficial or detrimental.

Lord Majestic

1:23 pm on Aug 19, 2004 (gmt 0)

The intrususion and violation of my TOS was really unnecessary as the majority of the pages OSU grabbed provide a link to a ZIP file which offers the entire content of that section.

Oh dear - so you expect bot owners to read manually TOS of all website? Or you expect bots to understand your TOS before crawling? Give me a break!

A bot is an inanimate object and hardly capable of doing anything to me

Good to hear - I was half expecting to hear a sob story about how bots slaughtered all your family while you were young, and you were raised by animals in local forest (hence wilderness) and from thereon you hunts bots whenever you find them :)

wilderness

1:33 pm on Aug 19, 2004 (gmt 0)

so you expect bot owners to read manually TOS of all website?

Why not?
Their service providers expect webmasters to jump through hoops and perform tricks of wizardry to report abuse violations merely to be compensated with an automated response ;(

BTW, that's how most websites TOS or UAG's are composed.
By visiting the website, you agree to comply with the TOS or UAG's as a condition of your visit. Regardless of whether you've read them or not. Just because you run a bot (good or bad) doesn't relieve you of compliance.

Lord Majestic

1:41 pm on Aug 19, 2004 (gmt 0)

Why not?

Because that is not doable to read TOS of 10 mln websites, where as 0.0001% of websites that have issue with crawling can read TOS on a handful of sites run by crawlers.

This is the reason why it is, and should remain to be OPT OUT - hence webmasters responsibility to setup robots.txt, meta tags etc.

Considering how much traffic people (perhaps including yourself) get as the result of spiders work was expecting you to be little bit more sympathetic to bots.

Regardless of whether you've read them or not

I think courts might find it problematic to enforce TOS that were not explicitly accepted by humans. In fact I think you will fail in the UK unless you will be able to demostrate that this "illegal" spider activity resulted in actual losses, I also think you will have to prove there was intent to do that to your site specifically. And if parts of your TOS won't stand in court (depends on country of course) then these parts are as meaningless as many of those EULAs that people "agree" to by clicking.

Perhaps courts in the USA that are known to make some, ummm, interesting "judgements" will act differently, but I don't think you got many chances - at best firm in question will selectively remove results from crawl and add your site to blacklist (IMO should be global).

Bots are not visiting, they are crawling site. If someone tries to set precedent in court on that matter than I am sure all major search engines unite in banning sites like that from their crawl -- if you want to benefit from features of the Internet, then play accordingly.

bull

2:11 pm on Aug 19, 2004 (gmt 0)

Come on, put these discussions to bed. It is to every webmaster what do to with bots and their IP ranges. It is not necessary to repeat these discussion, do a site search function. You may surely tell us your opinion a dozen times. But this does not bring any new aspects. I hope we do not get the same moral discussion in every new thread when someone reports a bot.

To me, the robots.txt is the bot's TOS. Does not fetch it -> does not accept the TOS. Does fetch it and does not obey it -> does not accept the TOS. That simple. Speaking for me, my website is intended for human users. A bot remains a bot, even when masking as a legit browser. It is therefore breaking the TOS by not fetching robots.txt.

volatilegx

4:31 pm on Aug 19, 2004 (gmt 0)

I think the original question has been answered and this thread has veered way off topic. Time for a close :)