homepage Welcome to WebmasterWorld Guest from 23.20.44.136
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Yahoo! Slurp Now Supports Wildcards in robots.txt
OldWolf




msg:3144664
 6:54 am on Nov 3, 2006 (gmt 0)

You can now use '*' in robots directives for Yahoo! Slurp to wildcard match a sequence of characters in your URL. You can use this symbol in any part of the URL string you provide in the robots directive. For example,

User-Agent: Yahoo! Slurp
Allow: /public*/
Disallow: /*_print*.html
Disallow: /*?sessionid

[ysearchblog.com...]

 

tedster




msg:3148001
 10:56 pm on Nov 6, 2006 (gmt 0)

Here's another nice piece of news from Yahoo (same link):

Oh, by the way, if you thought we didn't support the 'Allow' tag, as you can see from these examples, we do.

Well done, Yahoo. I love to see the web advancing toward what should be a new standard.

StupidScript




msg:3148014
 11:05 pm on Nov 6, 2006 (gmt 0)

Disallow: /*?sessionid

Is that supposed to keep Slurp from crawling based on links they find hard-coded in someone else's site?

AFAIK, this type of dynamic parameter would be appended to the URI during the visit, and not hard-coded into the filename, so I wonder where it would come into play.

I ask because I regularly see various robots coming in from a link where an excited visitor has added a link to my site on their site ... including this type of tracking parameter, and it messes with a couple of things: bot identification and session management.

<edit>A personal note: I don't mind added functionality, but this strikes me as a little political. Why would Yahoo implement their own set of codes? Why not go through the proper channels and get ALLOW and this type of wildcard use into the standard [robotstxt.org]? It's really irritating when companies start to roll out their own personal extensions to any standard. It's almost as if they don't care about the infrastructure, they just want some press. We should expect to see threads in here about "Hey ... bots are crawling my site even though I used ALLOW and wildcards to limit them!"</edit>

[edited by: StupidScript at 11:13 pm (utc) on Nov. 6, 2006]

bouncybunny




msg:3148160
 1:29 am on Nov 7, 2006 (gmt 0)

I'm trying to work out if and how this differs from Google's usage? And MSNs for that matter.

If all we are interested is these three bots (and for many that may be the case) then using;

User-agent: *

should be enough now? Yes? No?

lexipixel




msg:3148214
 2:38 am on Nov 7, 2006 (gmt 0)

If all we are interested is these three bots (and for most of us that may be the case) then using;

User-agent: *

should be enough now? Yes? No?

-bouncybunny

No. For larger sites with mixed dynamic and static content, user/member login areas, subscription only content, etc.. Keeping the bots out of certain areas is needed, and being able to wildcard match partial strings will go a long way towards cleaning dynamic URLs in the SERPs, (on Yahoo! if they are the only ones to adopt these ROBOTS.TXT operators).

Boiled down, it looks like they added the
use of two special characters for pattern matching in Disallow (and 'Allow') statements.

* - matches a sequence of characters

$ - anchors at the end of the URL string

They also mention and demonstrate how they allow the Allow directive, (which confuses me a bit)...

I've always thought of it like a filter.

Disallow: /pattern/ (defined, true, "on")

- or -

default (not defined, not "true", "off")

A defined state for 'Disallow' is sort of double negative where "allow" is the same as "not disallow".

I wonder if Slurp would obey:


User-Agent: Yahoo! Slurp
Disallow: /calendar/archive
Allow: /calendar/archive/2006/11/*.htm

Meaning "don't crawl anything in the calendar archives, except this month's static (.htm) event files"...

Something like that could be useful when tied to a content management system that auto updates ROBOTS.TXT, (so long as other bots obey or ignore the same syntax).

ashear




msg:3148497
 10:46 am on Nov 7, 2006 (gmt 0)

Nice, I think Eric Brewer would be proud!

bouncybunny




msg:3148603
 12:57 pm on Nov 7, 2006 (gmt 0)

No. For larger sites with mixed dynamic and static content, user/member login areas, subscription only content, etc.. Keeping the bots out of certain areas is needed, and being able to wildcard match partial strings will go a long way towards cleaning dynamic URLs in the SERPs, (on Yahoo! if they are the only ones to adopt these ROBOTS.TXT operators).

I think you misunderstood what I was saying, but that's still an interesting post.

My question was aimed more about what the differences were between the three main robot rules. Wild cards are indeed useful. What I was trying to ask was whether it would be neccesary to specifiy different rules for each bot, or whether simply using wildcards in one set of rules would cover all bases.

lexipixel




msg:3149144
 7:12 pm on Nov 7, 2006 (gmt 0)

My question was aimed more about what the differences were between the three main robot rules.

-bouncybunny

From your first post, it appears you want to just allow all 'bots to crawl and index everything on your site:

User-agent: *

Specifying the User-Agent: rule is only half of it --- you also need to Allow/Disallow some or all directories where the 'bots can go.


User-agent: *
Disallow:

..as I said before, "Disallow: " (with nothing specified to 'disallow'), is in effect a double-negative; "to not disallow" is the same as to "allow"...

Further, a ROBOTS.TXT file containing only:


User-agent: *
Disallow:

is pretty much the same as having no ROBOTS.TXT at all, (except for all the 404 error that will be generated by 'bots requesting the file if you don't have one).

ROBOTS.TXT is the control file for "Standards for Robots Exclusion" ---- the rules were written to keep certain bots out of certain file areas, ("exclude")... the default operation of most bots is INDEX, FOLLOW (everything).

StupidScript




msg:3149151
 7:23 pm on Nov 7, 2006 (gmt 0)

1) What is anyone's guess how Y's spider would behave by default? If it's not Disallowed, and it's not Allowed ... would the spider crawl it? Wouldn't that make Allow pretty meaningless? After all, if it's not Disallowed ...

2) Can anyone explain how/why the example for /*?sessionid would work? Does anyone have filenames that include a query string on their server? What's the point, and why is this instructions a useful addition to robots.txt, which is meant to instruct spiders/bots in where they can and can't crawl?

Thanks. It seems like a lot of noise about some fairly useless proprietary modifications to the standard. (I know GGL and MSN have their own "standards", too, but that doesn't make it less irritating.)

[edited by: StupidScript at 7:27 pm (utc) on Nov. 7, 2006]

incrediBILL




msg:3149543
 12:37 am on Nov 8, 2006 (gmt 0)

Woo hoo!

Everything we've not wanted and MORE!

Can Yahoo say CACHE server?

Can Yahoo direct all their stinking bots to use the ONE copy of my page they just downloaded?

Now THAT would be an improvement!

Funny, Google just did this for all their bots, but we wouldn't want to hold Google up as an example to Yahoo how to do something right as that would just be mean and unproductive...

[edited by: incrediBILL at 12:39 am (utc) on Nov. 8, 2006]

jdMorgan




msg:3149718
 4:43 am on Nov 8, 2006 (gmt 0)

SS,

1) What is anyone's guess how Y's spider would behave by default? If it's not Disallowed, and it's not Allowed ... would the spider crawl it? Wouldn't that make Allow pretty meaningless? After all, if it's not Disallowed ...

If a URL-path-prefix is not explicitly Disallowed or explicitly Allowed, then it is implicitly allowed, and by default, it will be crawled -- Much the same as if the robots.txt file were non-existent or blank.

2) Can anyone explain how/why the example for /*?sessionid would work? Does anyone have filenames that include a query string on their server? What's the point, and why is this instructions a useful addition to robots.txt, which is meant to instruct spiders/bots in where they can and can't crawl?

They're treating arguments to Disallow and Allow as URL-paths, not filenames. So, this addition and this example are very good news to people who don't want the same dynamic (page) URL with dozens of session IDs in the query strings crawled and then possibly treated as duplicate content, simply because people have linked to them, or because the Webmastee left the "stats" page accessible to robots. The ability th tell the robot to not crawl URLs with appended session IDs is a decent back-end fix for those sites whose owners don't have the technical know-how to disable sessions for crawler user-agents.

[soapbox]
We talk of "the Standard", and some folks get somewhat excited about defending it as sacrosanct. However, it was never "The Standard," but rather, "A Standard for Robot Exclusion." It was never voted on by any official body; It is only a de-facto standard. So, compliance with this "standard" is entirely voluntary.

However, it would be very, very nice if the search providers would lay down their arms for a short time and get together to expand and modernize the Standard, and then document the result in a formal way. The function of robots.txt files should not be an area for competition, but rather, for cooperation.

There are many things about the Standard and its extensions that are problematic:

Multiple User-agent record support, as in:

User-agent: googlebot
User-agent: slurp
Disallow: /cgi-bin

versus

User-agent: slurp googlebot
Disallow: /cgi-bin

Both methods are described in proposed Standards, but which (if any) robots support both?

Precedence versus specificity of User-agent fields: Some 'bots accept the first record containing a User-agent prefix which matches their name, or contains "*" -- whichever comes first. This is the method specified in the original Standard.

In order to make up for common Webmaster errors, though, other robots accept the record containing the "best match" on the User-agent string, regardless of the record order. No matter which method is used, this choice should be documented and available on-line.

Precedence of Disallow and Allow for non-mutually-exclusive partial paths. This can be fixed (e.g. "Allow" always overrides "Disallow"), or can be based on directive order, but it certainly should be documented.

Clarification of "case-insensitive substring match" in the Standard. I'd like to see this changed to "case-insensitive prefix-match." So that, for example, "msnbot-Media" would stop trying to crawl where only "msnbot/" is allowed, and getting itself 403'ed on my sites as a result (Hey, see that trailing slash? -- "msnbot-" does not match "msnbot/").

Documentation of behaviour for unsupported directives: If a 'bot doesn't support all 'modern' directives (for example Crawl-delay), I'd like to see an explicit declaration of behaviour, as in, "If a we don't recognize a directive, we A) ignore it, B) ignore the record in which it appears, C) consider the robots.txt file to be invalid and leave, or D) consider robots.txt to be invalid and henceforth have our way with your site." Any of these is fine with me, as I have bigger guns to back up robots.txt, but I'd like to see it in writing.

Finally, I'd like to see a return to something like 'the good old days' when new search companies appeared and simply copied the AltaVista Scooter robot documentation -- Standardization, rather than the current "Balkanization" of robots.txt handling and documentation.
[/soapbox]

Jim

bouncybunny




msg:3152161
 4:44 am on Nov 10, 2006 (gmt 0)

From your first post, it appears you want to just allow all 'bots to crawl and index everything on your site:

I wasn't trying to give an example of a robots file. I'm just wondering, now that they all support wildcards, whether it is now necessary to specify a separate set of rules for each of the three main bots. Or whether using *, followed by the allow/disallow rules would affect msnbot, gogglebot, and slurp in identical ways. If not, I'm interested in the differences.

Asia_Expat




msg:3154780
 4:01 am on Nov 13, 2006 (gmt 0)

bouncybunny, I thought your question was quite clear... and I have the same question but it appears to me that if you want to creat a single set of rules for all bots, then there is no need to specify for each bott, as the wildcard rules/protocols appear to be the same.
Someone please tell me if I'm wrong.

goodroi




msg:3158710
 4:35 pm on Nov 16, 2006 (gmt 0)

yesterday at pubcon i was sitting with some people from google webmaster central in the site crawlability session and i found out there are some slight differences, if my memory is working correctly i think the way yahoo and google handle the? is not identical

a safe thing to do is to validate your robots.txt. google webmaster central has a tool which will tell you how each google bot will respond to your robots.txt

bouncybunny




msg:3161776
 8:36 pm on Nov 19, 2006 (gmt 0)

Thanks folks.

So it may not be as clear as we would like?

It's a shame there is no standard, that they all follow.

incrediBILL




msg:3161841
 9:58 pm on Nov 19, 2006 (gmt 0)

Can Yahoo say CACHE server?

Can Yahoo direct all their stinking bots to use the ONE copy of my page they just downloaded?

For those that didn't attend PubCon, I asked this very question at the last session of PubCon just to watch Tim Mayer look very uncomfortable on stage, sorry Tim.

Maybe had someone from Yahoo addressed this in the forum, or possibly invited me to their little Yahoo party (I felt soooo snubbed hahaha) maybe I'd have kept quiet. :)

He did say they were working on it, but offered no further information or timeline, so I'll believe it when I see it.

Reid




msg:3162028
 2:54 am on Nov 20, 2006 (gmt 0)

it appears to me that yahoo and google use the wildcard in exacly the same way, I'm not sure about MSN but they do seem to be following the robots.txt 'standard'. So there are no differences when using the * or the $ between googlebot and slurp but using the wildcard in directives within a User-Agent: * would confuse all other bots who dont understand * within a URL. So the User-Agent: * becomes User-Agent: (all who understand wildcards within URL's)
So I wouldn't use wildcard URL's for User-Agent: *
So now if we want wildcards in URL's we have to

User-Agent: googlebot
Disallow: /*thing

User-Agent: slurp
Disallow: /*thing

User-Agent: *
Disallow: /something
Disallow: /anything
Disallow: /anotherthing

kindof redundant, I agree with Jim, they could do a lot if they got together and made a more complex standard that everyone could agree on.
Also the use of Allow is useless since the default is Allow, thats also part of the 'standard' but pretty useless in my opinion.

[edited by: Reid at 3:18 am (utc) on Nov. 20, 2006]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved