Forum Moderators: open
right now altavistas crawlers are consuming several hundred
gigabytes. is anyone seeing the same? although i like altavista - if they continue im afraid that i have to ban their robots - the traffic they cause doesnt hold against the few referrals from this search engine (although i cant complain about the rankings ...)
here are the crawler id's/UA's:
mmsorbet1.sv.av.com
mmsorbet11.sv.av.com
mmsorbet13.sv.av.com
mmsorbet14.sv.av.com
mmsorbet15.sv.av.com
mmsorbet16.sv.av.com
mmsorbet2.sv.av.com
mmsorbet20.sv.av.com
mmsorbet22.sv.av.com
mmsorbet24.sv.av.com
mmsorbet25.sv.av.com
mmsorbet28.sv.av.com
mmsorbet29.sv.av.com
mmsorbet3.sv.av.com
mmsorbet31.sv.av.com
mmsorbet32.sv.av.com
mmsorbet34.sv.av.com
mmsorbet35.sv.av.com
mmsorbet36.sv.av.com
mmsorbet38.sv.av.com
mmsorbet39.sv.av.com
mmsorbet4.sv.av.com
mmsorbet5.sv.av.com
mmsorbet7.sv.av.com
mmsorbet8.sv.av.com
any ideas? (or promises from altavista? :)
We have a new robots.txt forum [webmasterworld.com] where you could ask about keeping Scooter out of your large multimedia files, but allow it to crawl your other pages.
It'll be interesting to see what the AV-bashers have to say if the recent changes at G drive significant search traffic over to AV... I'm kinda glad to see an old friend back, myself.
Jim
[edited by: jdMorgan at 9:09 pm (utc) on Dec. 9, 2003]
and - the user agent is NOT scooter - the machines are listed above - the new user agent is "3.3.vscooter".
the crawler/grabber is "only" interested in multimedia files - if you have large image collections or huge video or audiofiles - beware!
actually there were other threads indicating bad behavior by the very same agent at:
[webmasterworld.com...]
[webmasterworld.com...]
another resource includes:
[photodude.com...]
eventually we really have to block the entire thing -
not a clever idea from altavista. i can not appriciate it if they hit the same huge files over and over (most likely the get confused with all the mirror domains we own...)
action: i have deactivated/rerouted most of my mirror domains and i have banned altavista from my multimedia folder via robots.txt.
result: the bandwith consumed from altavistas multimedia
crawlers has been greatly reduced (from several gigabytes
to a few hundred megabytes)
conclusion: 1 - altavistas crawler are NOT capable to identify multiple instances of the same file BEFORE they download the entire thing (not sure if they do it afterwards either ...) - this could be a loophole to get multimedia content into altavista
2 altavistas crawler do not follow the robots.txt protocol - or it takes some time before they reread the file - another flaw in altavista's technology
3 i wrote to them two weeks ago (corporate marketing) and i am still awaiting an answer - their customer relationship management is not very responsive ...