homepage Welcome to WebmasterWorld Guest from 54.197.94.241
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Custom rule for google, url args and wordpress feeds
Kurt




msg:3348126
 1:09 am on May 24, 2007 (gmt 0)

Current robots file:

User-agent: Googlebot
Disallow: /*/feed/$
Disallow: /*/feed/rss/$
Disallow: /*/trackback/$

site:example.com shows link #2 as www.example.com/feed/
Google webmaster tools shows that url is NOT in the robots exclusion. Why is my wildcard not working Disallow: /*/feed/$

I can not seem to locate google official policy on robots.txt and wildcards and other optional tokens.

I also have a url in google that I want out, and would like to use robots to do so. The url is
/?p&paged=16

I take it that would be an impossible url to block? It has been 404'd for ages, but google hits it all the time.

Finally, in wordpress, I have urls of:
/category/personal/page/2/
/page/3/

I am no longer sure how the second url is accessed, but they are in the serps. Is there any good reason I should even let google crawl those pages either, and would:
Disallow: /page/
Disallow: /category/
those two rules take care of it for me?

I do have <meta name="robots" content="noindex,follow"/> in each of the above cases, put into the page dynamically, but perhaps the robots file is a bit more forceful?

 

phranque




msg:3348295
 5:56 am on May 24, 2007 (gmt 0)

the official search engine site for the robots.txt specification is [robotstxt.org...]
as far as i know the only "wildcarding" allowed is for user agent specification:
User-agent: *

there is no wildcarding or regular expression support for filenames.
(so the '$' isn't doing any good either)

Kurt




msg:3348332
 6:40 am on May 24, 2007 (gmt 0)

I see numerous mentioned of the $ being used to mean 'end of name'. But I am looking for official google docs to support this, which I can not find, only sites to those docs, but none are what I call official.

vincevincevince




msg:3348461
 9:22 am on May 24, 2007 (gmt 0)

$ comes from regular expressions... not robots.txt, make sure you're not confusing yourself with mod_rewrite rules

goodroi




msg:3348671
 1:58 pm on May 24, 2007 (gmt 0)

The official robots.txt protocol does not support wildcards or pattern matching as Google calls it. Google did bend the official protocol and they support more features. Just remember Google and also Yahoo are exceptions to the rule. Most other spiders will not support wildcards aka pattern matching.

The official Google answer about pattern matching in robots.txt is here [google.com...]

Achernar




msg:3348714
 2:28 pm on May 24, 2007 (gmt 0)

Current robots file:

User-agent: Googlebot
Disallow: /*/feed/$
Disallow: /*/feed/rss/$
Disallow: /*/trackback/$

site:example.com shows link #2 as www.example.com/feed/
Google webmaster tools shows that url is NOT in the robots exclusion. Why is my wildcard not working Disallow: /*/feed/$

There is no way "/*/feed/" would match "/feed/". Even if "*" ="", you're still trying to match "//feed/" with "/feed/".
Try this instead:

Disallow: */feed/$
Disallow: */feed/rss/$
Disallow: */trackback/$

But as others have said, wildcards are understood only by google and yahoo.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved