Spider Trap needs GOOD Spider list

Forum Moderators: open

Message Too Old, No Replies

Spider Trap needs GOOD Spider list

Having setup a good spider trap I need to exclude the good ones...

scottspence

10:41 am on Jul 8, 2002 (gmt 0)

Hi,

I have setup a Spider Trap and currently exclude the following Spiders from the trap:

.googlebot.com Googlebot
.av.com FAST-WebCrawler
.northernlight.com Gulliver
.infoseek.com InfoSeek
.lycos.com Lycos
.av.pa-x.dec.com Scooter
.directory.mozilla.org Robozilla
.inktomi.com Slurp
.inktomisearch.com Slurp
.quicknet.nl appie

I am looking for a list of 'good' bots :) that are valid indexes for search engines and not email harvesters or other such nastiness :(

Does such a list exist? What important bots have I missed from the list above?

Thanks for your help.

Cheers

Scott

fathom

11:46 am on Jul 8, 2002 (gmt 0)

try here

[iplists.com...]

Dreamquick

12:01 pm on Jul 8, 2002 (gmt 0)

Scott,

First I'm probably going to ask a *very* stupid question, but if you want to identify good/bad spiders (presumably by behaviour) then surely the bad ones are bad irregardless of whether they are spambots or just a name-brand SE having a bad day?

But getting somewhere back onto the topic...

The most reliable method to get a list of good robots is to make a list of all the UAs that have visited your site so far - once you have this you can begin to identify bots which visit your site.

Not only can you see which good bots you haven't got in your list, this also gives you real UA data to test your rules against.

Tony

Woz

12:46 pm on Jul 8, 2002 (gmt 0)

Scott,

the challenge is that if you assume a bot is bad unless it is on the good list, then how do you know when a new "good" bot comes along?

Better to assume all bots are good unless they do something bad at which time you add them to the spidertrap list.

So rather than setting up your Spider Trap to exclude Good Bots from the trap, set it up to trap the bad ones on the spidertrap list you will be building.

Onya
Woz

scottspence

2:06 pm on Jul 8, 2002 (gmt 0)

Hi Tony,

Good point about the spider behaviour.

<quote>
Bad ones are bad irregardless of whether they are spambots or just a name-brand SE having a bad day?
</quote>

But I have to gently balance the benefits of having my website indexed and having a bot not respect my robots.txt file or do some other 'low level nastiness'. I monitor my log files reguarly and if even the nice ones overstep my boundaries then I will stop them.

In addition, my spider trap records every spider, and only those that I like are allowed to access the site, the spider names do not at times lend to obvious

Woz,

Good point but what about a 'good' spam bot who harvests all email addresses. I don't want that happening even if they do adhere to the robots.txt commands! In addition my spider trap is not designed to measure how many files one bot harvests within a given period - too many (usually harvesters) request 100+ file every second or minute. I'd rather this bandwidth was used for humans not feeding a database somewhere.

The flaw is that I must accept (at this stage) the intensity of spiders from 'friendly' bots. I will get have to fine tune my trap to control these ones!

Thanks Fathom for the link - very useful indeed!

Cheers

Scott

Dreamquick

3:28 pm on Jul 8, 2002 (gmt 0)

Scott,

I do not believe there is such a thing as a "good" spambot - if they are honest their true UA will be in a list of banned bots and if they aren't honest well then that makes them bad...

<rant>
Once they start cloaking their true UA it all takes on the feel of an arms race - for every attempt they make to hide there will always be something that gives them away (malformed UA, fixed referrer or just sloppily coded requests).

If it's of any conselation most spambots are so dumb it's almost not funny - those simple substitutions / entity encodes you see explained / criticised so often will stop 99% of the "big-name" spambots as they run purely on pattern matching... Why is this? Because the people that wrote them are lazy, they value capture volume over improved hit-to-miss ratios and most of all they clearly still make money from selling their over-priced "applications" to wannabe e-marketeers.

That said I have seen one spambot that can decypher the simple substitutions and get a working e-mail out the other end, also it used the client's IE user-agent string for it's own UA, resulting in a near-perfect stealth (I said near-perfect as it used a sloppy http request header which gave the game away).
</rant>

Tony

scottspence

3:43 pm on Jul 8, 2002 (gmt 0)

Hi Tony,

yes, you are right about SPAM bots but there are more lurgies out there - robots that index your whole site just for the fun, I had Altavista index my entire site about 10 times in 1 day - causing a huge bandwidth hog and costing me money! Having said this my spider trap will not catch altavista out (yet) but will stop others.

My trap uses more than just the UA I also use the domain name and can use the IP address, I have hidden links but tell the good bots not to touch them. I need to id the good ones to tell them the secret of where the traps are. I'd rather look for good than bad spiders (and the SPAM bots can go to hell!) the good ones tend to keep their UA (right?)

Cheers

Scott
PS nice humans get a pretty image that the bots do not tend to download! Explaining that they must contact me to get taken off the banned list

jdMorgan

4:49 pm on Jul 8, 2002 (gmt 0)

Since no-one has brought this up specifically, you need to be very careful
about what you are doing and how it is implemented. Altavista started using a
new UA recently, for example. If you have a "good guy" list, and reject requests
from anyone else, then you would have rejected their new spider (It was
Scooter_QA or somesuch name). And what if Google changes their spider name?

The danger is that a SE will come out with a new robot, which you will exclude.
You will then be removed from that search engine's index until such time as you
allow their new UA, and they get back around to crawling your site.

Another factor with banning UA's is that some are used for good AND evil.
Directories like Yahoo or ODP will send a human reviewer out to look at your
site. Sometimes they will use a tool like libwww-perl/5.53 which is also used
by site scrapers. In that case, if you ban them, you're out of the directory.
A similar thing happens when SEs send proxy agents out to check your site to
see if you are cloaking. Again, if you exclude them, you're going to be
dropped from their index.

The difficulty of this situation is what leads to the high number of posts on
this subject. It IS an on-going battle, and you have to make the right trade
offs. One member here has essentially banned half of all satellite ISP users
from his site (including me) because we must use a proxy UA and its caching
behaviour is sloppy. He thinks that's a good trade-off. I don't, but it's his
site and his business, and only he can make that decision.

Just make sure the cost of what you are doing is acceptable to you. If getting
dropped from the search engines is acceptable, then you can use a "Good Guy"
list instead of a "Bad Guy" list. Or you can ban proven-bad agents and a few
will get through before you ban them. However, you won't run the risk of losing
all of your traffic from one or more search engines. You may want to check out
the cloaking threads, since those members also have to deal with this problem.

HTH,
Jim

scottspence

5:26 pm on Jul 8, 2002 (gmt 0)

Thanks Jim,

I see a big hole in creating a bad guy list: there are too many of them.

And they all have dynamic IPs meaning that they make me block innocent people (such as you suggested) from entering the site.

I am doing something different, I will 'purge' the dynamically created nasty list every so often. All I need is two requests to ban a bot. I am prepared to put up with 2 requests every 24 hours from these 'orible monsters.

There is of course an alternative to my good bot list:
I could assume that if a bot is nice enough to read my robots.txt and obey it then they must be a friend and I should warn all bots that read my robots.txt about my traps.

- Those that don't read robots.txt do not get told about my traps - and get caught.

- Those who read my robots.txt file but don't obey it also get caught.

- Some like Googlebot can do no wrong and are too good for business - I can tolerate them doing bad things (after all they may be having a bad day!).

Any flaws to my cunning plan?

What does SE mean anyway?

Cheers

Scott :-)

mack

5:34 pm on Jul 8, 2002 (gmt 0)

if you go to searchengineworld you can use one of the robots.txt files that Brett has been good enough to offer.

If I remember rightly there three files///one to exclude alll, one to allow all and one to aonly alow the good guys. Since I used it i have had less hassle from harvester bots on my log files.

jdMorgan

9:22 pm on Jul 8, 2002 (gmt 0)

Scott,

Sorry - SE is "Search Engine." For other acronyms you may see used here, take
a look at the WebmasterWorld glossary link at the top of the page. Also, the
WebmasterWorld site search link may be very helpful to you in building your
user agent lists. Search for "User-agent", "Disallow", and ".htaccess".

Even the "good guys" may read your Disallowed files, so be careful. Due to
symantic details, robots are allowed to read disallowed pages, they are only
disallowed from indexing them - listing them in search results. At least, that
is Googlebot's interpretation of the rules, and is cause for concern in your
robot-warning plan. See this thread:
[webmasterworld.com...]

There *are* a ton of bad guys out there - good luck doing battle! :)

Jim

scottspence

11:32 pm on Jul 8, 2002 (gmt 0)

Oh Boy Jim,

This *really* throws me. Google does not obey robots.txt - anarchy prevails. Well, I guess I will just have to keep a watchful eye on the catch-em logs. I never expected a completely automated response to 'orrible bots but I have enough logs to keep me busy on my linux box!

SE = Search Engine - it is obvious now!

I will let you know about any successes I have!

Cheers

Scott

jdMorgan

12:33 am on Jul 9, 2002 (gmt 0)

Scott,

It'a a bloody lot of work to keep up with the ogres... The rule I use is to
strike a balance between the time spent deleting spam e-mails and the time
spent tweaking .htaccess filters...

I also removed almost all e-mail addresses from our site, and replaced them
with secure e-mail forms.

However, if you block larbin, Indy Library, Beijing Express, any UA with
"email" or "e-mail" in it, plus a few others, you'll stop about half of the
problem spam-wise.

The fact that Googlebot and other spiders do read Disallowed files - even
though they won't show them in their search results - derailed a lot of our
efforts to simplify things by allowing good guys instead of disallowing bad
guys. If you read Keymaster's post - the one I cited above - you'll see that
it caught a lot of us off-guard.

Best,
Jim

Key_Master

12:50 am on Jul 9, 2002 (gmt 0)

The fact that Googlebot and other spiders do read Disallowed files - even though they won't show them in their search results - derailed a lot of our efforts to simplify things by allowing good guys instead of disallowing bad guys.

Googlebot (or any other robots.txt behaving spider) is not supposed to crawl prohibited directories or files. Sometimes they do. Why? Most often it's a robots.txt problem. I believe the rest are caused by buggy spider software on the search engine side.

If Googlebot (or any other spider) is crawling your prohibited files, check your robots.txt. If it's valid, you need to let the search engines know what is going on. They are NOT supposed to do this.

jdMorgan

4:44 am on Jul 9, 2002 (gmt 0)

K-M,

Yeah, following Googleguy's post on your thread, I am keeping a sharp eye out
for Googlebot actually loading Disallowed pages. I'm not sure I understand
Google's distinction of a linked page versus a spidered page. As far as I'm
concerned, if Googlebot finds a link to a page, it should check the robots.txt
at the root of that page's site immediately, and drop the link cold if
robots.txt Disallows it... or at least not make it visible on Google until it
has checked robots.txt. Then if there is no robots.txt or if the page is not
Disallowed in robots.txt, it can/should load it and check for a robots metatag.
If the metatag says "follow", then grab the links on that page. And if the
metatag also says "noindex" - then drop that page and follow the links if
permitted. Jeez, that would be easier to write correctly and completely using
"case" statements wouldn't it? ;)

I think this "linked pages" thing may be why I sometimes see off-limits pages
listed on Google. I haven't been paying that much attention recently, though -
I have been too busy fighting off spambots. But if I search by domain name, I
can usually turn up a few Disallowed files, especially during the dance. Google
does drop them eventually, but they hang around for a few days at least. I
think it may have to do with a delay between when Googlebot finds a link to a
page and when it checks robots.txt on that site. However, your post woke me
up, and now I'm watching.

My robots.txt is valid. I use Brett's nifty robots.txt validator every time I
upload a new one, too. I've also used the one at tardis.uk.something - can't
find that URL right now...

I do believe that if there is anything wrong with Googlebot's logic, it will
get fixed. Google's "policy of engagement" with us here on wmw,
and some of their actions in response to posts here, lead me to believe that
they do want to get it right - a refreshing contrast to another SE I've had
problems with in the past, whose policy was, "when in doubt, ban the domain."
They got carries away with that banning a couple of years ago, and now are
moribund...

Also, if I do catch Googlebot stumbling into the wrong place, I qualify for
Googleguy's "bonus points" - no cloaking or tricky stuff other than spambot
blocks - so maybe I can get a cool T-Shirt or something... :)

If I have anything to report after the next update, I will.

Thanks for the heads-up post!
Jim