Welcome to WebmasterWorld Guest from 107.22.24.16

Forum Moderators: goodroi

Message Too Old, No Replies

Custom rule for google, url args and wordpress feeds

     
1:09 am on May 24, 2007 (gmt 0)

New User

5+ Year Member

joined:Apr 26, 2007
posts:5
votes: 0


Current robots file:

User-agent: Googlebot
Disallow: /*/feed/$
Disallow: /*/feed/rss/$
Disallow: /*/trackback/$

site:example.com shows link #2 as www.example.com/feed/
Google webmaster tools shows that url is NOT in the robots exclusion. Why is my wildcard not working Disallow: /*/feed/$

I can not seem to locate google official policy on robots.txt and wildcards and other optional tokens.

I also have a url in google that I want out, and would like to use robots to do so. The url is
/?p&paged=16

I take it that would be an impossible url to block? It has been 404'd for ages, but google hits it all the time.

Finally, in wordpress, I have urls of:
/category/personal/page/2/
/page/3/

I am no longer sure how the second url is accessed, but they are in the serps. Is there any good reason I should even let google crawl those pages either, and would:
Disallow: /page/
Disallow: /category/
those two rules take care of it for me?

I do have <meta name="robots" content="noindex,follow"/> in each of the above cases, put into the page dynamically, but perhaps the robots file is a bit more forceful?

5:56 am on May 24, 2007 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10551
votes: 10


the official search engine site for the robots.txt specification is [robotstxt.org...]
as far as i know the only "wildcarding" allowed is for user agent specification:
User-agent: *

there is no wildcarding or regular expression support for filenames.
(so the '$' isn't doing any good either)

6:40 am on May 24, 2007 (gmt 0)

New User

5+ Year Member

joined:Apr 26, 2007
posts: 5
votes: 0


I see numerous mentioned of the $ being used to mean 'end of name'. But I am looking for official google docs to support this, which I can not find, only sites to those docs, but none are what I call official.
9:22 am on May 24, 2007 (gmt 0)

Senior Member from MY 

WebmasterWorld Senior Member vincevincevince is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Apr 1, 2003
posts:4847
votes: 0


$ comes from regular expressions... not robots.txt, make sure you're not confusing yourself with mod_rewrite rules
1:58 pm on May 24, 2007 (gmt 0)

Administrator from US 

WebmasterWorld Administrator goodroi is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:June 21, 2004
posts:3106
votes: 91


The official robots.txt protocol does not support wildcards or pattern matching as Google calls it. Google did bend the official protocol and they support more features. Just remember Google and also Yahoo are exceptions to the rule. Most other spiders will not support wildcards aka pattern matching.

The official Google answer about pattern matching in robots.txt is here [google.com...]

2:28 pm on May 24, 2007 (gmt 0)

Full Member

5+ Year Member

joined:Dec 3, 2006
posts:257
votes: 0


Current robots file:

User-agent: Googlebot
Disallow: /*/feed/$
Disallow: /*/feed/rss/$
Disallow: /*/trackback/$

site:example.com shows link #2 as www.example.com/feed/
Google webmaster tools shows that url is NOT in the robots exclusion. Why is my wildcard not working Disallow: /*/feed/$

There is no way "/*/feed/" would match "/feed/". Even if "*" ="", you're still trying to match "//feed/" with "/feed/".
Try this instead:

Disallow: */feed/$
Disallow: */feed/rss/$
Disallow: */trackback/$

But as others have said, wildcards are understood only by google and yahoo.