Custom rule for google, url args and wordpress feeds - Sitemaps, Meta Data, and robots.txt forum at WebmasterWorld - WebmasterWorld

Forum Moderators: goodroi

Message Too Old, No Replies

Custom rule for google, url args and wordpress feeds

Kurt

1:09 am on May 24, 2007 (gmt 0)

10+ Year Member

Current robots file:

User-agent: Googlebot
Disallow: /*/feed/$
Disallow: /*/feed/rss/$
Disallow: /*/trackback/$

site:example.com shows link #2 as www.example.com/feed/
Google webmaster tools shows that url is NOT in the robots exclusion. Why is my wildcard not working Disallow: /*/feed/$

I can not seem to locate google official policy on robots.txt and wildcards and other optional tokens.

I also have a url in google that I want out, and would like to use robots to do so. The url is
/?p&paged=16

I take it that would be an impossible url to block? It has been 404'd for ages, but google hits it all the time.

Finally, in wordpress, I have urls of:
/category/personal/page/2/
/page/3/

I am no longer sure how the second url is accessed, but they are in the serps. Is there any good reason I should even let google crawl those pages either, and would:
Disallow: /page/
Disallow: /category/
those two rules take care of it for me?

I do have <meta name="robots" content="noindex,follow"/> in each of the above cases, put into the page dynamically, but perhaps the robots file is a bit more forceful?

phranque

5:56 am on May 24, 2007 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

the official search engine site for the robots.txt specification is [robotstxt.org...]
as far as i know the only "wildcarding" allowed is for user agent specification:
User-agent: *

there is no wildcarding or regular expression support for filenames.
(so the '$' isn't doing any good either)

Kurt

6:40 am on May 24, 2007 (gmt 0)

10+ Year Member

I see numerous mentioned of the $ being used to mean 'end of name'. But I am looking for official google docs to support this, which I can not find, only sites to those docs, but none are what I call official.

vincevincevince

9:22 am on May 24, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

$ comes from regular expressions... not robots.txt, make sure you're not confusing yourself with mod_rewrite rules

goodroi

1:58 pm on May 24, 2007 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

The official robots.txt protocol does not support wildcards or pattern matching as Google calls it. Google did bend the official protocol and they support more features. Just remember Google and also Yahoo are exceptions to the rule. Most other spiders will not support wildcards aka pattern matching.

The official Google answer about pattern matching in robots.txt is here [google.com...]

Achernar

2:28 pm on May 24, 2007 (gmt 0)

10+ Year Member

Top Contributors Of The Month

Current robots file:
User-agent: Googlebot
Disallow: /*/feed/$
Disallow: /*/feed/rss/$
Disallow: /*/trackback/$
site:example.com shows link #2 as www.example.com/feed/
Google webmaster tools shows that url is NOT in the robots exclusion. Why is my wildcard not working Disallow: /*/feed/$

There is no way "/*/feed/" would match "/feed/". Even if "*" ="", you're still trying to match "//feed/" with "/feed/".
Try this instead:

Disallow: */feed/$
Disallow: */feed/rss/$
Disallow: */trackback/$

But as others have said, wildcards are understood only by google and yahoo.