Forum Moderators: open
5/30/2003,5:42:48 AM, ,216.39.48.20,Scooter/3.2,mailto:crawl-support@av.com
Many emails to AV about some of the Scooter bots disregarding robots.txt have gone unanswered, so it's simple: If you don't behave, you're not welcome.
No big loss, either... I get more quality referrals from Google Thailand than all referrals in total from all the global AV sites.
balam
RewriteCond %{HTTP_USER_AGENT} .*Ask.Jeeves.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*FAST.WebCrawl.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*ia_archiver.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*InfoSeek.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*Inktomi.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*Scooter.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*Slurp.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*Teoma.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*VoilaBot.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*Google.*
RewriteRule!.*(html¦htm¦txt¦/)$ /www/msgs/badagent.html [F]
Welcome to Webmaster World!
But, it's not enough to save Scooter... :) AV just irks me too much to let them come back. If they become any sort of player in SE game, I suppose I'll have to rethink my position, but until that day...
So, now that I've taken this step, can anyone tell me what will happen next? My site's been completely (over)indexed by AV, but now that they're not welcome, what becomes of me in their index?
I suppose I'll be progressively dropped over the next (couple of?) months, since AV can no longer verify the existance of any of my pages, yes?
balam
Welcome to the Webmasterworld!
Sweet code- thanks for posting it. Think I shal have to borrow it! :)
As I posted in another forum, I am having probs on the AV serps. One site- a sote with 14,000 pages in Google!- has one page- the index- in AV. Anither VERY popular site is not even there at all, and hasn't been for 3-4 years. I do NOT know what AV's problem is.
I seem to remember having a problem with their spider going where it diod not belong, but I never banned it. But I have not seen a Scooter around in a LONG time.
You know who else has a VERY poor spider- always going where he does not belong? Jeeves.
dave
I wrote altavista late last year about their misbehaving picture bot. Their reply, clearly fresh out of a can, had nothing to do with my question.
I wrote them back and put s p a c e s in the words that a program might search for to automate a reply. Guess what...no reply. So their picture bot was sent away for good.
Then, in March of this year, I wrote them about their seemingly useless Basic Submit.
Again, I got a responce that was all about Express Inclusion, nothing about the Basic Submit that I asked about.
I again, very politely, asked them my question, and once again I received a blurb about spamming their search engine; something that had nothing to do with my question.
This was my responce:
WOW! No wonder your search engine is no longer an important part of today's SEO strategy. You people cannot even answer a question. Truthfully I do not need an answer to my question because as I said before, since you introduced the Express Inclusion program, we have gotten zero (0) well optimized and information filled sites added to AltaVista's database.
Truthfully most SEO's do not even waist time with AltaVista, and now we see why.
Good Day, mysterious unnamed canned-response person.
i'm sure that would have come across except for the
misspelling of waste... sorry to do that but as a business
owner, i place a great amount of weight on proper
commumication capabilities... many others do, too... that
one word would have tossed you into my questionable
catagory...
no disrespect intended... i hope you understand...
Adding start anchors to speed up processing where possible, and removing some unneeded stuff, such as ".*" on unanchored patterns and redundant ua strings such as Inktomi/Slurp:
RewriteCond %{HTTP_USER_AGENT} Ask.Jeeves [OR]
RewriteCond %{HTTP_USER_AGENT} ^FAST-WebCrawl [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia\_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} InfoSeek [OR]
RewriteCond %{HTTP_USER_AGENT} ^Scooter [OR]
RewriteCond %{HTTP_USER_AGENT} Slurp [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teoma [OR]
RewriteCond %{HTTP_USER_AGENT} VoilaBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Googlebot
RewriteRule !\.(html¦htm¦txt)$ /www/msgs/badagent.html [F]
The original code as posted will disallow all of these user-agents from all subdirectories; If you've copied it, make sure that's what you want to do. Otherwise, remove the "¦/" at the end of the RewriteRule as shown above.
IIRC, the issue with Scooter was the re-deployment of Scooter/1.0 to spider images. It did not properly obey robots.txt. I haven't seen any problems with later versions of Scooter, but of coures, YMMV.
Jim
apologies, dude... there is a difference between chatter in these and other forums and business oriented email... if what you posted was what you sent to them via email, oh well...
again, apologies... there aren't and grammar (not spelling!) checkers for these forums... heck, i can't even figure out how to click on the link so that it carries me to only the new posts instead of having to wade thru all the previous posts that i've already read and still maintain a link to the past postings...
IIRC, the issue with Scooter was the re-deployment of Scooter/1.0 to spider images. It did not properly obey robots.txt. I haven't seen any problems with later versions of Scooter, but of coures, YMMV.
Indeed it does...
216.39.48.114 - - [01/May/2003:07:28:17 -0800] "GET /robots.txt HTTP/1.1" 200 2347 "-" "Scooter/3.3.vscooter"
216.39.48.114 - - [01/May/2003:07:28:17 -0800] "GET /someimage.jpg HTTP/1.1" 200 41331 "http://www.mysite.com/somefile.shtml" "Scooter/3.3.vscooter"
But that's not the only reason I'm unhappy with Scooter. There's numerous 'burps'...
216.39.48.34 - - [05/May/2003:17:44:10 -0700] "GET /inde HTTP/1.0" 302 306 "-" "Scooter/3.3"
And then there's Scooter/3.3_SF, who has an unhealthy fascination with 4 pages of mine. Fetched on a (almost?) daily basis, three of these pages have not changed at all (including re-uploading them, so the "Last-Modified" date hasn't changed) since they were added to the site a couple of years ago. The fourth is updated about every six months... I'd love to know what warrants such attention. A page of mine that automagically updates itself every two hours is steadfastly ignored...
(Actually, the page is updated with my own, very well behaved bot...)
I don't forget Scooter/3.2, but 3.2 forgets me. Months go by before it bothers to re-index the site... That's some fresh database AV has.
Jim, do you know when Scooter/1.0 was redeployed?
Meanwhile, in other news...
Thrust!
i place a great amount of weight on proper commumication capabilities... [...] that one word would have tossed you into my questionable catagory...
Parry!
I guess I don't have your superior commumication skills.
Oooo... Stumble!
there aren't and grammar (not spelling!) checkers
Enter the dogs...
If years of Usenet taught me anything, it's that you DON'T call up people for spelling or grammatical errors - or a distinct lack of understanding of what the SHIFT key is for ;) - because it only turns a big magnifying glass on you and your posts. Plus, spelling & grammar checkers offer nothing when youse gotsta actooally speek to a client.
balam
balam,
It looks like Vscooter might just be a renamed Scooter/1.0 - I assume you've placed all jpegs off-limits to vscooter in your robots.txt, and that that is why you consider your logs to show a violation... I'm not totally clear on this. Scooter 1.0 was active several months ago - maybe late last year, and elicited several negative posts over in the SE Spider Identification forum, IIRC. I blocked it with .htaccess myself, and will keep an eye out in case your report indicates a new name for the badly-behaved Scooter/1.0.
You might want to check the AV SERP listing for your page which is the object of "Scooter's unhealthy fascination." You will likely find it marked, "Updated in the last 24 hours" in their SERPs. AV's been working on a "Freshbot" of their own. Despite the fact that this page is, for you, the "wrong page" for frequent updating, consider the possibilities... :) I believe that frequent updating indicates that AV "likes" the page, ranking it highly using whatever method of page-ranking they currently use. I have a VERY static page that they like to update every two days... Seems to boost the click-through rate when the searchers see that it's "fresh", so I don't complain. To paraphrase Dr. Suess in "Horton Hears a Who" - "More traffic's more traffic, no matter how small." :)
All,
In the interest of the purpose and "Subtitle" of the WebmasterWorld site as a whole, please let's drop the unfortunate grammar/spelling aspect of this thread, forgive and forget about it, and move on - it serves no useful purpose in a discussion of AltaVista.
Thanks to all in advance.
Jim
Since it seems the code posted above may "get around"...Adding start anchors to speed up processing where possible, and removing some unneeded stuff, such as ".*" on unanchored patterns and redundant ua strings such as Inktomi/Slurp:
RewriteCond %{HTTP_USER_AGENT} Ask.Jeeves [OR]
RewriteCond %{HTTP_USER_AGENT} ^FAST-WebCrawl [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia\_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} InfoSeek [OR]
RewriteCond %{HTTP_USER_AGENT} ^Scooter [OR]
RewriteCond %{HTTP_USER_AGENT} Slurp [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teoma [OR]
RewriteCond %{HTTP_USER_AGENT} VoilaBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Googlebot
I have seen a couple of these robots change their UA tag in the past such that the keyword which was once at the start of the string moved to the middle. This is why I did not enforce that, e.g., "Scooter" be at the start of the string.
Additionally, I do see separate tracks for "Inktomi Search" that are not the same as "Slurp". Possibly this is because the bot running "Inktomi Search" is on a .gov search engine and my office site is a .gov; your site may not be seeing "Inktomi Search" at all.
BTW: I've since added QuepasaCreep to the above list. I'm not sure though that InfoSeek needs to be listed as it seems like it's been ages since I last saw them.
RewriteRule!\.(html¦htm¦txt)$ /www/msgs/badagent.html [F]The original code as posted will disallow all of these user-agents from all subdirectories; If you've copied it, make sure that's what you want to do. Otherwise, remove the "¦/" at the end of the RewriteRule as shown above.
My original posting said
RewriteRule!.*(html¦htm¦txt¦/)$ /www/msgs/badagent.html [F]
The "/" option in the match does *not* prevent access to subdirectories. The reason it's there is that a request for a directory index file that omits the "index.html" or "index.htm" (i.e., the request ends in a slash, like "GET /foo/") would otherwise be disallowed.
Duly noted on Inktomi... No, I haven't seen that "Inktomi Search" user-agent.
With your comment about "/" in mind, we may also want to add a RewriteCond to handle the case of URLs of both the form "http://www.yourdomain.com" and "http://www.yourdomain.com/", especially for the case where this code is used in .htaccess in a per-directory context, where the leading "/" is not available to RewriteRule:
RewriteCond %{HTTP_USER_AGENT} Ask.Jeeves [OR]
RewriteCond %{HTTP_USER_AGENT} ^FAST-WebCrawl [OR]
RewriteCond %{HTTP_USER_AGENT} ^Googlebot [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia\_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} InfoSeek [OR]
RewriteCond %{HTTP_USER_AGENT} Inktomi [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Scooter [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teoma [OR]
RewriteCond %{HTTP_USER_AGENT} VoilaBot
RewriteCond %{REQUEST_URI} !(/$¦^$)
RewriteRule !\.(html¦htm¦txt)$ /www/msgs/badagent.html [F]
Jim