homepage Welcome to WebmasterWorld Guest from 54.226.235.222
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Alternative Search Engines
Forum Library, Charter, Moderators: bakedjake

Alternative Search Engines Forum

    
Alexa crawling heavily?
Similiar to a google bot...
Paully




msg:465987
 9:42 pm on Oct 20, 2002 (gmt 0)

I am getting heavily hit by an Alexa crawler, looks very similar to the Google crawler.

Anyone have any info on this?

 

mack




msg:465988
 11:25 pm on Oct 20, 2002 (gmt 0)

what was the user agent?

If it was "Ia_archiver" ban it!

deejay




msg:465989
 11:30 pm on Oct 20, 2002 (gmt 0)

seriously Mack? I've had ia_archiver all over me in the last couple of days.

mack




msg:465990
 11:39 pm on Oct 20, 2002 (gmt 0)

I always bad Ia_archiver without exception. It totaly disreguards robots.txt and on one ocasion almost brough one of my sites down. Had to ban it with Htaccess.

It's just a bad bot, offers nothing in return and swallows your bandwidth.

Paully




msg:465991
 2:53 am on Oct 21, 2002 (gmt 0)


I always bad Ia_archiver without exception. It totaly disreguards robots.txt and on one ocasion almost brough one of my sites down. Had to ban it with Htaccess.
It's just a bad bot, offers nothing in return and swallows your bandwidth.

On the money Mack.

I will watch it closely, if it starts to slow down the site I will ban it. I hate banning if I dont have to.

deejay, so it left after a couple days, right?

Paully




msg:465992
 9:06 am on Oct 21, 2002 (gmt 0)

Ok, ended up banning it, to much bandwith as Mack stated.

Thanks for the great info.

deejay




msg:465993
 9:13 am on Oct 21, 2002 (gmt 0)

Without checking logs, I think I had it for three days.... took about half the site each day.

Site's not big enough for it to be a problem.... yet. :) might look at banning it anyways.

mack




msg:465994
 11:23 am on Oct 21, 2002 (gmt 0)

My main site is a small search engine...

ie_archiver landed in it one day with a search query. On my results pages there is a link "similar sites" and if a user follows this link it carries out another search for the title of the site in the serps. Ie_archiver did the search then followed the "similar site" links then did the same on the following serps. It carries out over 5000 query searches as well as downloading all 3500 pages from my site.

my site was almost forced down due to the strain this blighter was putting on it.

I emailed them and they asked me to send my logs, I also send then a copy of my robots.txt file. I never heard back from them.

The logs show clearly the bot requesting robots.txt then get /anotherpage totally disregarding the robots.txt instructions.

I was told that it was gathering information from sites that could potentially serve web data? Well they can keep their bot.

Paully




msg:465995
 5:12 pm on Oct 21, 2002 (gmt 0)

The logs show clearly the bot requesting robots.txt then get /anotherpage totally disregarding the robots.txt instructions

That really was the final straw, when I saw that it wasnt obeying the robots.txt.

Bastards... :)

mack




msg:465996
 5:33 pm on Oct 21, 2002 (gmt 0)

Paully, are you implying that ie_archiver is an illegitimate child :)

You could be done for slander, saying things like that. lol

Slade




msg:465997
 5:52 pm on Oct 21, 2002 (gmt 0)

OK, just for reference folks...

ia_archiver is the bot that was commented on.

It belongs to archive.org, an internet archiving group. They allow you to pull up things like what cnn looked like on Sept 11,2001.

Yes, for those of you who have copyrighted content, or just a lot of it, it could be bad. But other than that, it is a quite useful service.

Note: As with any bot, there may be multiple processes on multiple servers. If a bot on one IP pulls a new robots.txt, another bot may not get a copy of it, and continue pulling pages.

On my personal site, I just pulled the logs. 20 total requests, 11 of them for robots.txt, which I haven't put up yet.

mayor




msg:465998
 3:36 pm on Oct 22, 2002 (gmt 0)

here's what I use, compliments of WebmasterWorld robots.txt:

# bad bots get your butt out of here

User-agent: ia_archiver
Disallow: /

User-agent: ia_archiver/1.6
Disallow: /

User-agent: Alexibot
Disallow: /

But I haven't seen Alexa around lately

mack




msg:465999
 9:41 pm on Oct 22, 2002 (gmt 0)

Acording to the email reply I got , alexa are also using the UA Ia_archiver

rfgdxm1




msg:466000
 9:11 am on Oct 23, 2002 (gmt 0)

For someone running an information site, the Alexa bot can be seen as important because it is archiving.

kfander




msg:466001
 5:29 pm on Oct 28, 2002 (gmt 0)

I like the Alexa Internet archive, although I've removed their toolbar long ago since it was significantly slowing my system. I can go back in Alexa's archive and see what my sites looked like years ago, along with most of the changes I've made along the way.

hbird64




msg:466002
 9:12 pm on Nov 6, 2002 (gmt 0)

And what about the crawler crawl7-public.alexa.com?

Hugo

kstprod




msg:466003
 11:33 am on Nov 9, 2002 (gmt 0)

I have a very small site 15 pages or so, and for what it's worth, I have had no problem with Ia_archiver. Alexa has been very well behaved and politely visites me almost every day. I've noticed a weird pattern though here lately, she always comes almost precisely when Googlebot comes. But then, maybe it's just a coincidence.

In my opinion, I wouldn't ban something just because someone else did. What if that person was wrong? If so, then you have just eliminated some potential customers. :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Alternative Search Engines
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved