Forum Moderators: open
What I now need is a list of common spiders that offer little or no value and obey robots.txt. I've spent hours hunting for an up-to-date list but without success (not helped by the fact the site search facility seems to be broken and keeps returning 404s - fortunately I can use Google to search the site).
jdMorgan posted June 2003 [webmasterworld.com...] :
User-agent: almaden
User-agent: ASPSeek
User-agent: baiduspider
User-agent: dumbbot
User-agent: Generic
User-agent: grub-client
User-agent: MSIECrawler
User-agent: nexabot
User-agent: NPBot
User-agent: OWR_Crawler
User-agent: psbot
User-agent: rabaz
User-agent: RPT-HTTPClient
User-agent: ScoutAbout
User-agent: semanticdiscovery
User-agent: TurnitinBot
User-agent: Wget
Disallow: /
allow specific robots and disallow everything else?
Sadly not, as far as I know. At least, not with robots.txt. You'd need to be thinking about .htaccess on an Apache server.
There is a short-cut method that would halve the length of WebmasterWorld's robots.txt.
Instead of:
User-agent: AskJeeves
Disallow: /User-agent: Teoma
Disallow: /User-agent: Jeeves
Disallow: /User-agent: WebVac
Disallow: /User-agent: Stanford
Disallow: /
You can do it like this:
User-agent: AskJeeves
Usser-agent: Teoma
User-agent: Jeeves
User-agent: WebVac
User-agent: Stanford
Disallow: /
Some people say that can confuse a few spiders, but I've never noticed a problem on the sites I use the short-cut format on.
does anyone have an up-to-date list... of current, undesirable bots... they'd like to share. This helps everyone - not just me.
Check out WebmasterWorld's robots.txt for a long list.
I believe you're referring to this: [webmasterworld.com...]
However, the following robots can't handle that method:
appie
Gigabot/
ia_archiver
IlTrovatore-Setaccio
NationalDirectory-WebSpider
Pompos/
Scooter/ (obsolete)
baiduspider either can't handle it, or they're still disregarding robots.txt For the spiders above, I use mod_rewrite to feed them a simplified robots.txt depending on user-agent.
As far as a list of "bad" 'bots that *do* obey robots.txt, there are a few that sometimes do, but sometimes don't.
Here's a relatively recent list I've used:
User-agent: ASPSeek
User-agent: asterias
User-agent: baiduspider
User-agent: BravoBrian
User-agent: dloader(NaverRobot)
User-agent: Dumbot
User-agent: EgotoBot
User-agent: Gaisbot
User-agent: Generic
User-agent: grub-client
User-agent: [almaden.ibm.com...]
User-agent: InfoNaviRobot
User-agent: Jyxobot
User-agent: Larbin
User-agent: MSIECrawler
User-agent: NaverRobot
User-agent: NexaBot
User-agent: NPBot
User-agent: obot
User-agent: OWR_Crawler
User-agent: PhpDig
User-agent: psbot
User-agent: puf
User-agent: QuepasaCreep
User-agent: rabaz
User-agent: Reaper
User-agent: RPT-HTTPClient
User-agent: ScoutAbout
User-agent: semanticdiscovery
User-agent: Steeler
User-agent: Teleport Pro
User-agent: TurnitinBot
User-agent: TutorGig
User-agent: vsecrawler
User-agent: Wget
Disallow: /
Again, these user-agents are unwelcome on one of my sites, but may be useful on yours. In some cases, they've behaved maliciously, and in others, their "service" is just not useful to the site, and therefore, not worth the bandwidth.
Jim
I thought it was the other way around... I thought they changed *to* TurnitinBot, because of the negative connotations of "Sly Search". But I haven't seen either one in a while, so I'll add that one back in.
Whoever named that and "QuePasa Creep" need to re-think their "webmaster appeal factor."
Thanks,
Jim
slysearch / turnitinbot - [turnitin.com...] - looks like slysearch is no longer with us.
Consequently, I'd like to have a complete list of undesirable spiders, whether they obey robot.txt or not. Can anyone point me toward something like that?
I'd like to have a complete list of undesirable spiders
I'm not sure nor are many others that such a list exists.
Each webmaster decides what bots and/or IP ranges are both benefical and detrimental to their websites targeted traffic.
Here's a very lengthy old and complicated thread:
[webmasterworld.com...]
I'm sure it's more than you bargained for or requested.
Edited by wilderness:
Bull previously spent plenty of time and effort in accumualting these lists:
[webmasterworld.com...]
[webmasterworld.com...]
> I'm not sure nor are many others that such a list exists.
As wilderness points out, "undesirable" is often a personal judgement call.
For example, Quepasa Creep...
While jdMorgan correctly points out a "branding" problem, I've been trying & would love for that bot to visit me.
From the Quepasa site (quepasa.com):
In 1999, quepasa became a [American] national sensation almost overnight [...] Shortly thereafter, quepasa begin leading industry peers in popularity surveys - eventually beating out Yahoo! en Espaņol, Starmedia, and others, as the most recognizable Internet brand to U.S. Hispanics [...] Since inception, Quepasa has been headquartered in Phoenix, Arizona.
And from just one of many articles on the subject, Hispanics: Reaching and Influencing a Growing Market [dmnews.com]:
The Hispanic population is now the largest ethnic group in the United States, surpassing African-Americans. Hispanic spending power is expected to grow to nearly $1 trillion in five years.Consider that Hispanic advertising has been growing 15 percent to 20 percent in the past five years. Marketing to 38 million Hispanics is no longer a matter of choice. It has become necessary for survival.
I've never heard of a reason to ban Quepasa beyond their using the word "creep" in their UA. I haven't seen them violate robots.txt on the one dinky site they do visit. I want access to that market, with their dollars, pesos, quetzals and, uh, colons! (Colon = currency of Costa Rica) But, as the saying goes, "your mileage may vary."
Then there's something like TurnitinBot... If you have a site with lots of well-written articles that could be plagarized by students, then you might very well welcome the bot, knowing that perhaps some cheating student will be caught ripping off your content. OTOH, maybe you don't like the idea of TurnItIn making money on your content, since their service isn't free to those who employ it. Or maybe all you have is a Flash site, and visits by TurnitinBot are nothing but a waste of your - and their - bandwidth.
So, undesirable or not? Your call.
<sarcasm>
Since I want be helpful, here's some undesirable bots that obey robots.txt: Googlebot, Slurp, Gigabot... ;)
</sarcasm>