homepage Welcome to WebmasterWorld Guest from 54.205.254.108
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Altavista
Your Thoughts
cmjohnston




msg:395798
 12:16 pm on May 30, 2003 (gmt 0)

What are your thoughts on the Altavista spider and search engine. I am getting SLAMMED by their spider. (At least once a minute for the last 3 days) Do you think that they will help with driving more traffic to my site or are they just another 2nd tier search engine. BTW here is what shows in the log.

5/30/2003,5:42:48 AM, ,216.39.48.20,Scooter/3.2,mailto:crawl-support@av.com

 

wilderness




msg:395799
 1:05 pm on May 30, 2003 (gmt 0)

AV has been quite inactive for some time.
Their just getting rolling again.

Their is an interesting ongoing thread over in alt.webmaster or alt.html about who is the #1 SE.
The results a bit surprising.

balam




msg:395800
 5:28 pm on May 30, 2003 (gmt 0)

A couple of days ago I told Scooter to take a hike... I have no interest in seeing AltaVista visit me.

Many emails to AV about some of the Scooter bots disregarding robots.txt have gone unanswered, so it's simple: If you don't behave, you're not welcome.

No big loss, either... I get more quality referrals from Google Thailand than all referrals in total from all the global AV sites.

balam

Brad




msg:395801
 6:04 pm on May 30, 2003 (gmt 0)

I get about 5 - 7% referrals from AV across several sites. If you can rank well in AV it can still send some traffic but it varies with niche.

It is ironic that we spent years complaining that AV did not spider deep and now that it is doing so people are complaining about that! :)

balam




msg:395802
 6:16 pm on May 30, 2003 (gmt 0)

My big complaint is Scooter is told to stay away from all my image directories, but they happily dip in anyways...

balam

rbs10025




msg:395803
 12:29 am on May 31, 2003 (gmt 0)

I've got Scooter allowed in, but I've also got it lumped int with a number of agents that are not allowed to get non-HTML files. This is especially important at my site as it includes a number of very large binary datasets in numerous locations and the robots have proven too stupid to understand that downloading them is a waste of bandwidth.

RewriteCond %{HTTP_USER_AGENT} .*Ask.Jeeves.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*FAST.WebCrawl.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*ia_archiver.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*InfoSeek.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*Inktomi.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*Scooter.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*Slurp.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*Teoma.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*VoilaBot.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*Google.*
RewriteRule!.*(html¦htm¦txt¦/)$ /www/msgs/badagent.html [F]

balam




msg:395804
 5:18 am on May 31, 2003 (gmt 0)

Oooo, thanks for the code, rbs10025, and as is the habit around here,

Welcome to Webmaster World!

But, it's not enough to save Scooter... :) AV just irks me too much to let them come back. If they become any sort of player in SE game, I suppose I'll have to rethink my position, but until that day...

So, now that I've taken this step, can anyone tell me what will happen next? My site's been completely (over)indexed by AV, but now that they're not welcome, what becomes of me in their index?

I suppose I'll be progressively dropped over the next (couple of?) months, since AV can no longer verify the existance of any of my pages, yes?

balam

carfac




msg:395805
 4:57 am on Jun 1, 2003 (gmt 0)

rbs:

Welcome to the Webmasterworld!

Sweet code- thanks for posting it. Think I shal have to borrow it! :)

As I posted in another forum, I am having probs on the AV serps. One site- a sote with 14,000 pages in Google!- has one page- the index- in AV. Anither VERY popular site is not even there at all, and hasn't been for 3-4 years. I do NOT know what AV's problem is.

I seem to remember having a problem with their spider going where it diod not belong, but I never banned it. But I have not seen a Scooter around in a LONG time.

You know who else has a VERY poor spider- always going where he does not belong? Jeeves.

dave

guillermo5000




msg:395806
 11:07 pm on Jun 1, 2003 (gmt 0)

I couldn't agree more! Altavista...are you listening?

I wrote altavista late last year about their misbehaving picture bot. Their reply, clearly fresh out of a can, had nothing to do with my question.

I wrote them back and put s p a c e s in the words that a program might search for to automate a reply. Guess what...no reply. So their picture bot was sent away for good.

Then, in March of this year, I wrote them about their seemingly useless Basic Submit.

Again, I got a responce that was all about Express Inclusion, nothing about the Basic Submit that I asked about.

I again, very politely, asked them my question, and once again I received a blurb about spamming their search engine; something that had nothing to do with my question.

This was my responce:

WOW! No wonder your search engine is no longer an important part of today's SEO strategy. You people cannot even answer a question. Truthfully I do not need an answer to my question because as I said before, since you introduced the Express Inclusion program, we have gotten zero (0) well optimized and information filled sites added to AltaVista's database.

Truthfully most SEO's do not even waist time with AltaVista, and now we see why.

Good Day, mysterious unnamed canned-response person.

wkitty42




msg:395807
 5:10 am on Jun 3, 2003 (gmt 0)

guillermo5000,

i'm sure that would have come across except for the
misspelling of waste... sorry to do that but as a business
owner, i place a great amount of weight on proper
commumication capabilities... many others do, too... that
one word would have tossed you into my questionable
catagory...

no disrespect intended... i hope you understand...

guillermo5000




msg:395808
 5:22 am on Jun 3, 2003 (gmt 0)

Wow! I guess I don't have your superior commumication skills.

jdMorgan




msg:395809
 5:35 am on Jun 3, 2003 (gmt 0)

Since it seems the code posted above may "get around"...

Adding start anchors to speed up processing where possible, and removing some unneeded stuff, such as ".*" on unanchored patterns and redundant ua strings such as Inktomi/Slurp:

RewriteCond %{HTTP_USER_AGENT} Ask.Jeeves [OR]
RewriteCond %{HTTP_USER_AGENT} ^FAST-WebCrawl [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia\_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} InfoSeek [OR]
RewriteCond %{HTTP_USER_AGENT} ^Scooter [OR]
RewriteCond %{HTTP_USER_AGENT} Slurp [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teoma [OR]
RewriteCond %{HTTP_USER_AGENT} VoilaBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Googlebot
RewriteRule !\.(html¦htm¦txt)$ /www/msgs/badagent.html [F]

The original code as posted will disallow all of these user-agents from all subdirectories; If you've copied it, make sure that's what you want to do. Otherwise, remove the "¦/" at the end of the RewriteRule as shown above.

IIRC, the issue with Scooter was the re-deployment of Scooter/1.0 to spider images. It did not properly obey robots.txt. I haven't seen any problems with later versions of Scooter, but of coures, YMMV.

Jim

wkitty42




msg:395810
 5:53 am on Jun 3, 2003 (gmt 0)

guillermo5000,

apologies, dude... there is a difference between chatter in these and other forums and business oriented email... if what you posted was what you sent to them via email, oh well...

again, apologies... there aren't and grammar (not spelling!) checkers for these forums... heck, i can't even figure out how to click on the link so that it carries me to only the new posts instead of having to wade thru all the previous posts that i've already read and still maintain a link to the past postings...

balam




msg:395811
 8:40 pm on Jun 3, 2003 (gmt 0)

IIRC, the issue with Scooter was the re-deployment of Scooter/1.0 to spider images. It did not properly obey robots.txt. I haven't seen any problems with later versions of Scooter, but of coures, YMMV.

Indeed it does...

216.39.48.114 - - [01/May/2003:07:28:17 -0800] "GET /robots.txt HTTP/1.1" 200 2347 "-" "Scooter/3.3.vscooter"
216.39.48.114 - - [01/May/2003:07:28:17 -0800] "GET /someimage.jpg HTTP/1.1" 200 41331 "http://www.mysite.com/somefile.shtml" "Scooter/3.3.vscooter"

But that's not the only reason I'm unhappy with Scooter. There's numerous 'burps'...

216.39.48.34 - - [05/May/2003:17:44:10 -0700] "GET /inde HTTP/1.0" 302 306 "-" "Scooter/3.3"

And then there's Scooter/3.3_SF, who has an unhealthy fascination with 4 pages of mine. Fetched on a (almost?) daily basis, three of these pages have not changed at all (including re-uploading them, so the "Last-Modified" date hasn't changed) since they were added to the site a couple of years ago. The fourth is updated about every six months... I'd love to know what warrants such attention. A page of mine that automagically updates itself every two hours is steadfastly ignored...

(Actually, the page is updated with my own, very well behaved bot...)

I don't forget Scooter/3.2, but 3.2 forgets me. Months go by before it bothers to re-index the site... That's some fresh database AV has.

Jim, do you know when Scooter/1.0 was redeployed?

Meanwhile, in other news...

Thrust!
i place a great amount of weight on proper commumication capabilities... [...] that one word would have tossed you into my questionable catagory...

Parry!
I guess I don't have your superior commumication skills.

Oooo... Stumble!
there aren't and grammar (not spelling!) checkers

Enter the dogs...
If years of Usenet taught me anything, it's that you DON'T call up people for spelling or grammatical errors - or a distinct lack of understanding of what the SHIFT key is for ;) - because it only turns a big magnifying glass on you and your posts. Plus, spelling & grammar checkers offer nothing when youse gotsta actooally speek to a client.

balam

dvduval




msg:395812
 8:47 pm on Jun 3, 2003 (gmt 0)

It's kind of funny that the most successful search engine has a rep that is active at WebmasterWorld. I would take a search engine more seriously if they were here answering questions at WebmasterWorld. I've never seen anyone form Altavista, just Google and Inktomi.

jdMorgan




msg:395813
 9:46 pm on Jun 3, 2003 (gmt 0)

dvduval... Don't forget FAST - They popped in for awhile last year(?)

balam,

It looks like Vscooter might just be a renamed Scooter/1.0 - I assume you've placed all jpegs off-limits to vscooter in your robots.txt, and that that is why you consider your logs to show a violation... I'm not totally clear on this. Scooter 1.0 was active several months ago - maybe late last year, and elicited several negative posts over in the SE Spider Identification forum, IIRC. I blocked it with .htaccess myself, and will keep an eye out in case your report indicates a new name for the badly-behaved Scooter/1.0.

You might want to check the AV SERP listing for your page which is the object of "Scooter's unhealthy fascination." You will likely find it marked, "Updated in the last 24 hours" in their SERPs. AV's been working on a "Freshbot" of their own. Despite the fact that this page is, for you, the "wrong page" for frequent updating, consider the possibilities... :) I believe that frequent updating indicates that AV "likes" the page, ranking it highly using whatever method of page-ranking they currently use. I have a VERY static page that they like to update every two days... Seems to boost the click-through rate when the searchers see that it's "fresh", so I don't complain. To paraphrase Dr. Suess in "Horton Hears a Who" - "More traffic's more traffic, no matter how small." :)

All,
In the interest of the purpose and "Subtitle" of the WebmasterWorld site as a whole, please let's drop the unfortunate grammar/spelling aspect of this thread, forgive and forget about it, and move on - it serves no useful purpose in a discussion of AltaVista.
Thanks to all in advance.

Jim

rbs10025




msg:395814
 9:11 pm on Jun 4, 2003 (gmt 0)


Since it seems the code posted above may "get around"...

Adding start anchors to speed up processing where possible, and removing some unneeded stuff, such as ".*" on unanchored patterns and redundant ua strings such as Inktomi/Slurp:

RewriteCond %{HTTP_USER_AGENT} Ask.Jeeves [OR]
RewriteCond %{HTTP_USER_AGENT} ^FAST-WebCrawl [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia\_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} InfoSeek [OR]
RewriteCond %{HTTP_USER_AGENT} ^Scooter [OR]
RewriteCond %{HTTP_USER_AGENT} Slurp [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teoma [OR]
RewriteCond %{HTTP_USER_AGENT} VoilaBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Googlebot

I have seen a couple of these robots change their UA tag in the past such that the keyword which was once at the start of the string moved to the middle. This is why I did not enforce that, e.g., "Scooter" be at the start of the string.

Additionally, I do see separate tracks for "Inktomi Search" that are not the same as "Slurp". Possibly this is because the bot running "Inktomi Search" is on a .gov search engine and my office site is a .gov; your site may not be seeing "Inktomi Search" at all.

BTW: I've since added QuepasaCreep to the above list. I'm not sure though that InfoSeek needs to be listed as it seems like it's been ages since I last saw them.


RewriteRule!\.(html¦htm¦txt)$ /www/msgs/badagent.html [F]

The original code as posted will disallow all of these user-agents from all subdirectories; If you've copied it, make sure that's what you want to do. Otherwise, remove the "¦/" at the end of the RewriteRule as shown above.

My original posting said


RewriteRule!.*(html¦htm¦txt¦/)$ /www/msgs/badagent.html [F]

The "/" option in the match does *not* prevent access to subdirectories. The reason it's there is that a request for a directory index file that omits the "index.html" or "index.htm" (i.e., the request ends in a slash, like "GET /foo/") would otherwise be disallowed.

jdMorgan




msg:395815
 3:37 pm on Jun 5, 2003 (gmt 0)

rbs10025,

Duly noted on Inktomi... No, I haven't seen that "Inktomi Search" user-agent.

With your comment about "/" in mind, we may also want to add a RewriteCond to handle the case of URLs of both the form "http://www.yourdomain.com" and "http://www.yourdomain.com/", especially for the case where this code is used in .htaccess in a per-directory context, where the leading "/" is not available to RewriteRule:

RewriteCond %{HTTP_USER_AGENT} Ask.Jeeves [OR]
RewriteCond %{HTTP_USER_AGENT} ^FAST-WebCrawl [OR]
RewriteCond %{HTTP_USER_AGENT} ^Googlebot [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia\_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} InfoSeek [OR]
RewriteCond %{HTTP_USER_AGENT} Inktomi [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Scooter [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teoma [OR]
RewriteCond %{HTTP_USER_AGENT} VoilaBot
RewriteCond %{REQUEST_URI} !(/$¦^$)
RewriteRule !\.(html¦htm¦txt)$ /www/msgs/badagent.html [F]

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved