Forum Moderators: open

Message Too Old, No Replies

Undesirable spiders that obey robots.txt

Undesirable spiders that obey robots.txt bad bots

         

David_1cog

8:04 pm on Jul 10, 2004 (gmt 0)

10+ Year Member



I've spent almost the entire day researching the best methods to prevent undesirable bots / spiders / content strippers / email harvesters from accessing my sites. The best solution IMHO is [webmasterworld.com...]

What I now need is a list of common spiders that offer little or no value and obey robots.txt. I've spent hours hunting for an up-to-date list but without success (not helped by the fact the site search facility seems to be broken and keeps returning 404s - fortunately I can use Google to search the site).

jdMorgan posted June 2003 [webmasterworld.com...] :


User-agent: almaden
User-agent: ASPSeek
User-agent: baiduspider
User-agent: dumbbot
User-agent: Generic
User-agent: grub-client
User-agent: MSIECrawler
User-agent: nexabot
User-agent: NPBot
User-agent: OWR_Crawler
User-agent: psbot
User-agent: rabaz
User-agent: RPT-HTTPClient
User-agent: ScoutAbout
User-agent: semanticdiscovery
User-agent: TurnitinBot
User-agent: Wget
Disallow: /

Given how quickly the spider market changes, does anyone have an up-to-date list they'd like to share? :)

volatilegx

8:19 pm on Jul 10, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I believe "baiduspider" is a spider for a legitimate search engine.

David_1cog

8:48 pm on Jul 10, 2004 (gmt 0)

10+ Year Member



baiduspider works for a Chinese search engine (http://www.baidu.com/) and is therefore a waste of bandwidth to me and the majority of non-Chinese site owners. It's a good example of a spider that is legitimate, obeys robots.txt (?) but has no value for most of us (even though there are 1.5 billion potential customers over there ;)).

victor

9:01 pm on Jul 10, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Check out WebmasterWorld's robots.txt for a long list.

But bear in mind that it bans things like Google's image-search spider as being a waster of space for this site -- it may be a welcome visitor to many others.

digitalv

9:04 pm on Jul 10, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Isn't there a way to allow specific robots and disallow everything else?

victor

9:50 pm on Jul 10, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



allow specific robots and disallow everything else?

Sadly not, as far as I know. At least, not with robots.txt. You'd need to be thinking about .htaccess on an Apache server.

There is a short-cut method that would halve the length of WebmasterWorld's robots.txt.

Instead of:


User-agent: AskJeeves
Disallow: /

User-agent: Teoma
Disallow: /

User-agent: Jeeves
Disallow: /

User-agent: WebVac
Disallow: /

User-agent: Stanford
Disallow: /

You can do it like this:


User-agent: AskJeeves
Usser-agent: Teoma
User-agent: Jeeves
User-agent: WebVac
User-agent: Stanford
Disallow: /

Some people say that can confuse a few spiders, but I've never noticed a problem on the sites I use the short-cut format on.

David_1cog

10:31 pm on Jul 10, 2004 (gmt 0)

10+ Year Member



I appreciate everyone's input but could we keep this on-topic:
does anyone have an up-to-date list... of current, undesirable bots... they'd like to share
. This helps everyone - not just me.

Check out WebmasterWorld's robots.txt for a long list.

It hasn't been updated for some time AFAIK and it's a *long* list, much of which is unnecessary (?).

jdMorgan

10:45 pm on Jul 10, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> There is a short-cut method that would halve the length of WebmasterWorld's robots.txt.

I believe you're referring to this: [webmasterworld.com...]

However, the following robots can't handle that method:

appie
Gigabot/
ia_archiver
IlTrovatore-Setaccio
NationalDirectory-WebSpider
Pompos/
Scooter/ (obsolete)

baiduspider either can't handle it, or they're still disregarding robots.txt For the spiders above, I use mod_rewrite to feed them a simplified robots.txt depending on user-agent.

As far as a list of "bad" 'bots that *do* obey robots.txt, there are a few that sometimes do, but sometimes don't.

Here's a relatively recent list I've used:

User-agent: ASPSeek
User-agent: asterias
User-agent: baiduspider
User-agent: BravoBrian
User-agent: dloader(NaverRobot)
User-agent: Dumbot
User-agent: EgotoBot
User-agent: Gaisbot
User-agent: Generic
User-agent: grub-client
User-agent: [almaden.ibm.com...]
User-agent: InfoNaviRobot
User-agent: Jyxobot
User-agent: Larbin
User-agent: MSIECrawler
User-agent: NaverRobot
User-agent: NexaBot
User-agent: NPBot
User-agent: obot
User-agent: OWR_Crawler
User-agent: PhpDig
User-agent: psbot
User-agent: puf
User-agent: QuepasaCreep
User-agent: rabaz
User-agent: Reaper
User-agent: RPT-HTTPClient
User-agent: ScoutAbout
User-agent: semanticdiscovery
User-agent: Steeler
User-agent: Teleport Pro
User-agent: TurnitinBot
User-agent: TutorGig
User-agent: vsecrawler
User-agent: Wget
Disallow: /

Again, these user-agents are unwelcome on one of my sites, but may be useful on yours. In some cases, they've behaved maliciously, and in others, their "service" is just not useful to the site, and therefore, not worth the bandwidth.

Jim

fiestagirl

1:52 am on Jul 11, 2004 (gmt 0)

10+ Year Member



I believe that the new UA for turnitinbot is : SlySearch (slysearch@slysearch.com)

jdMorgan

2:03 am on Jul 11, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hmmm...

I thought it was the other way around... I thought they changed *to* TurnitinBot, because of the negative connotations of "Sly Search". But I haven't seen either one in a while, so I'll add that one back in.

Whoever named that and "QuePasa Creep" need to re-think their "webmaster appeal factor."

Thanks,
Jim

David_1cog

8:38 am on Jul 11, 2004 (gmt 0)

10+ Year Member



Jim, thanks for the list - exactly what I needed.

slysearch / turnitinbot - [turnitin.com...] - looks like slysearch is no longer with us.

fiestagirl

6:03 pm on Jul 11, 2004 (gmt 0)

10+ Year Member



Thanks. I guess that I was halucinating that I'd seen that in my logs recently.
I have however seen Scooter this month.

jdMorgan

12:43 am on Jul 19, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Scooter!?!? Wow, thanks for the heads-up on that - I almost removed it from the allowed list recently...

Jim

Vec_One

2:44 pm on Jul 20, 2004 (gmt 0)

10+ Year Member



My site automatically diverts visitors to the appropriate language. Spiders can't traverse this until a special argument is created to invite them in.

Consequently, I'd like to have a complete list of undesirable spiders, whether they obey robot.txt or not. Can anyone point me toward something like that?

wilderness

3:58 pm on Jul 20, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'd like to have a complete list of undesirable spiders

I'm not sure nor are many others that such a list exists.
Each webmaster decides what bots and/or IP ranges are both benefical and detrimental to their websites targeted traffic.

Here's a very lengthy old and complicated thread:

[webmasterworld.com...]

I'm sure it's more than you bargained for or requested.

Edited by wilderness:
Bull previously spent plenty of time and effort in accumualting these lists:
[webmasterworld.com...]
[webmasterworld.com...]

balam

10:01 pm on Jul 25, 2004 (gmt 0)

10+ Year Member



> I'd like to have a complete list of undesirable spiders

> I'm not sure nor are many others that such a list exists.

As wilderness points out, "undesirable" is often a personal judgement call.

For example, Quepasa Creep...

While jdMorgan correctly points out a "branding" problem, I've been trying & would love for that bot to visit me.

From the Quepasa site (quepasa.com):

In 1999, quepasa became a [American] national sensation almost overnight [...] Shortly thereafter, quepasa begin leading industry peers in popularity surveys - eventually beating out Yahoo! en Espaņol, Starmedia, and others, as the most recognizable Internet brand to U.S. Hispanics [...] Since inception, Quepasa has been headquartered in Phoenix, Arizona.

And from just one of many articles on the subject, Hispanics: Reaching and Influencing a Growing Market [dmnews.com]:

The Hispanic population is now the largest ethnic group in the United States, surpassing African-Americans. Hispanic spending power is expected to grow to nearly $1 trillion in five years.

Consider that Hispanic advertising has been growing 15 percent to 20 percent in the past five years. Marketing to 38 million Hispanics is no longer a matter of choice. It has become necessary for survival.

I've never heard of a reason to ban Quepasa beyond their using the word "creep" in their UA. I haven't seen them violate robots.txt on the one dinky site they do visit. I want access to that market, with their dollars, pesos, quetzals and, uh, colons! (Colon = currency of Costa Rica) But, as the saying goes, "your mileage may vary."

Then there's something like TurnitinBot... If you have a site with lots of well-written articles that could be plagarized by students, then you might very well welcome the bot, knowing that perhaps some cheating student will be caught ripping off your content. OTOH, maybe you don't like the idea of TurnItIn making money on your content, since their service isn't free to those who employ it. Or maybe all you have is a Flash site, and visits by TurnitinBot are nothing but a waste of your - and their - bandwidth.

So, undesirable or not? Your call.

<sarcasm>
Since I want be helpful, here's some undesirable bots that obey robots.txt: Googlebot, Slurp, Gigabot... ;)
</sarcasm>

wilderness

12:30 am on Jul 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



While jdMorgan correctly points out a "branding" problem, I've been trying & would love for that bot to visit me.

Jim,
sticky me a rewrite and instead of 403's for Quepasa?
I'll send them to you :)