Selectively excluding spider from subdirectory pages

Forum Moderators: phranque

Message Too Old, No Replies

Selectively excluding spider from subdirectory pages

Can I completely block subdirectories from one search engine?

Webdetective

5:35 pm on Jul 10, 2005 (gmt 0)

If I am using an automated page generation software in individual subdirectories that are a No-go on one search engine but are ok with other search engines, is it possible to selectively exclude that one search engine from those pages via robots.txt and thereby avoid any penalties from that particular search engine for having those pages?

If a search engine's spider is excluded from the "offending" pages, is that as good as not having the pages at all?

What is the robots.txt exclusion protocol for Yahoo Slurp?

Span

6:28 pm on Jul 10, 2005 (gmt 0)

User-agent: Slurp
Disallow: /folder/

[help.yahoo.com ]

Webdetective

8:16 pm on Jul 10, 2005 (gmt 0)

Will doing so protect my site from any possible Yahoo penalities if I use this for pages Yahoo doesn't like?

Also what if I am already using "User-agent" for all search engines for some pages for security reasons?

Can I use multiple instances of User-agent: in robots.txt to cover specific search engines and sub-directories in my site?

Example:

User-agent: *
Disallow: /cgi-bin/
User-agent: Slurp
Disallow: /folder/

Thanks
Fred

Span

9:15 pm on Jul 10, 2005 (gmt 0)

User-agent: *
Disallow: /cgi-bin/

User-agent: Slurp
Disallow: /folder/

Yes, you can use multiple instances of "User-agent:" in your robots.txt. Your example excludes all robots from your cgi-bin and tells Slurp to not spider "/folder/".

There are some tutorials and a robots.txt validator at Search Engine World [searchengineworld.com].

Webdetective

12:38 am on Jul 11, 2005 (gmt 0)

Yes, you can use multiple instances of "User-agent:" in your robots.txt. Your example excludes all robots from your cgi-bin and tells Slurp to not spider "/folder/".

Span,
Good. Will doing so however protect me from a search engine's penalties for having those particular pages on my site, since their robot has been excluded from those pages? Some sub-directory pages of mine might be a problem for Yahoo, however not Google or MSN.

Also if I want to exclude other pages with slightly different names Ie: index1.html, index2.html, index3.html etc... then could I use the following to cover them all:

User-agent: Slurp
Disallow: /index?.html

Thanks

Dijkgraaf

12:48 am on Jul 11, 2005 (gmt 0)

Safer just to use
User-agent: Slurp
Disallow: /index

That should disallow any URL's starting with /index
Wild cards are not part of the standard, although some bots do support them.

Webdetective

10:51 pm on Jul 11, 2005 (gmt 0)

If a site is under penalty (de-indexed) for something like having too many software-generated pages, yet is still being spidered, but the suspect pages are either excluded from Slurp in robots.txt or removed alltogether, then could the site eventually be re-indexed?