robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt

bad spider

copongcopong

11:07 pm on Jun 1, 2003 (gmt 0)

Well we have a perfect .htaccess ban list at:

[webmasterworld.com...]

How about a disallow spider list?

jdMorgan

2:20 am on Jun 2, 2003 (gmt 0)

copongcopong,

It's a much shorter list, because most bad spiders ignore robots.txt. Like the user-agents listed in the almost-perfect ban list, different webmasters may disagree with some entries, and others may want to add more.

Here's mine:

User-agent: almaden
User-agent: ASPSeek
User-agent: baiduspider
User-agent: dumbBot
User-agent: Generic
User-agent: grub-client
User-agent: MSIECrawler
User-agent: NexaBot
User-agent: NPBot
User-agent: OWR_Crawler
User-agent: psbot
User-agent: rabaz
User-agent: RPT-HTTPClient
User-agent: ScoutAbout
User-agent: semanticdiscovery
User-agent: TurnitinBot
User-agent: Wget
Disallow: /

Some entries are there only because I'm waiting to see if they will obey robots.txt, e.g. dumbdot. Others are either just an annoyance, or use my server bandwidth to make a profit and are not welcome. Again, my list is short because these are the only ones that obey (or may obey) robots.txt that I have seen.

In case anyone from one of the issuing organizations drops by... I insist that any robot which wishes to spider my site meet the following requirments:

Strict conformance to robots.txt directives.

A proper user-agent string identifying the using organization, preferably with an info-page URL, but an e-mail address is OK.

If a public-domain robot, a proper and enforceable licensing agreement with all third-party users of the robot supporting the above requirements.

A demonstrable benefit for my visitors, my potential visitors, my site, or public-domain Web resources.

I have encouraged our friends at Nutch to especially heed item #3; Grub.org failed to do this, and is now unwelcome as a result.

No, I don't have delusions of grandeur, and my sites are not important, but the list above is just "proper nettiquette" for robots. I hope my few 403-Forbidden responses will be noticed, and my custom 403 Error page might be read, but I doubt it.

Jim

wkitty42

5:17 am on Jun 3, 2003 (gmt 0)

jdM,

i agree completely with you on this... thsi is one of the
main reasons why i have spent so much time, over the years,
accumulating user agents and such and applying them to my
banned listings... i've been doing this via my apache configs
but i see now that it appears to be more beneficial to
utilize .htaccess for this purpose, if none other... at the
least, it keeps me from having to restart the server each
time i make an update...

seonut

7:52 pm on Jun 4, 2003 (gmt 0)

If I may, since you're on the subject of A robots.txt ask this question? In need of some sort of answers.

If a site has a robots.txt and certain pages has disallow to be indexed and for the pages that does not get index, does the spider just takes the url and index that without meta information?

thanks in advance.

seonut

jdMorgan

8:32 pm on Jun 4, 2003 (gmt 0)

seonut,

Google and AJ/Teoma do that - they list any URL they find anywhere. They are still in compliance with robots.txt, though, in that they do not fetch and index the page - They just list the link. Other SEs may do this too, but I'm not aware of any other major U.S. engines that behave this way.

In order to prevent the link from being listed, you have to allow Google and AJ/Teoma to fetch the page, but include the <meta name="robots" content="noindex"> meta-tag on the page itself. Inefficient, but it works.

Jim

seonut

9:11 pm on Jun 4, 2003 (gmt 0)

Jim,

Thanks for the information. This has answered my question.

Thanks again

seonut

robots.txt

bad spider

copongcopong

jdMorgan

wkitty42

seonut

jdMorgan

seonut

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week