Page is a not externally linkable
- Search Engines
-- Sitemaps, Meta Data, and robots.txt
---- Yahoo! Slurp Now Supports Wildcards in robots.txt


lexipixel - 2:38 am on Nov 7, 2006 (gmt 0)


If all we are interested is these three bots (and for most of us that may be the case) then using;

User-agent: *

should be enough now? Yes? No?

-bouncybunny

No. For larger sites with mixed dynamic and static content, user/member login areas, subscription only content, etc.. Keeping the bots out of certain areas is needed, and being able to wildcard match partial strings will go a long way towards cleaning dynamic URLs in the SERPs, (on Yahoo! if they are the only ones to adopt these ROBOTS.TXT operators).

Boiled down, it looks like they added the
use of two special characters for pattern matching in Disallow (and 'Allow') statements.

* - matches a sequence of characters

$ - anchors at the end of the URL string

They also mention and demonstrate how they allow the Allow directive, (which confuses me a bit)...

I've always thought of it like a filter.

Disallow: /pattern/ (defined, true, "on")

- or -

default (not defined, not "true", "off")

A defined state for 'Disallow' is sort of double negative where "allow" is the same as "not disallow".

I wonder if Slurp would obey:


User-Agent: Yahoo! Slurp
Disallow: /calendar/archive
Allow: /calendar/archive/2006/11/*.htm

Meaning "don't crawl anything in the calendar archives, except this month's static (.htm) event files"...

Something like that could be useful when tied to a content management system that auto updates ROBOTS.TXT, (so long as other bots obey or ignore the same syntax).


Thread source:: http://www.webmasterworld.com/robots_txt/3144662.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com