Forum Moderators: goodroi
User-agent: *
Disallow: /admin
Disallow: /node/add
Disallow: /aggregator
User-agent: ia_archiver
Disallow: /
User-agent: Googlebot
Disallow: /topic
Disallow: /user/register
Disallow: /user/login
Disallow: /comment
Disallow: /*/feed$
Here are the pages that I think *should* be excluded by isn't:
www.example.com/topic/heating
www.example.com/comment/reply/14
www.example.com/wine-2/feed
Moreover, here are some pages I would like to exclude, but isn't sure how to do it using the robots txt:
www.example.com/blog?page=11
www.example.com/home?page=13
The robots text has been up for a month and a half now, should I wait longer for it to take effect? Thanks in advance for any help!
Robots.txt takes effect as soon as you upload it. Every time googlebot visits your site the first thing it does is. Googlebot then looks to see if you have specific user agent instructions for it. If you list specific instructions for the user agent googlebot, googlebot will ignore the generic instructions and only follow the specific instructions.
Your current robots.txt has several problems with it.
Disallow: /topic should be Disallow: /topic/ (notice the trailing slash). This is why googlebot is indexing www.example.com/topic/heating.
I assume you do not want googlebot to crawl your /admin/ folder so you should place a copy of the generic instructions under the googlebot line.
To block pages based on url wildcards (aka pattern matching) add this line: Disallow: /*blog?* (this will block all URLs that contain "blog?". For more information about Google's robots.txt [google.com]
Also when in doubt you can test your robots.txt with googlebots validator. It is located within Google Sitemaps [google.com]
While Google and some of the other major search engines will look for the most specific User-agent record that applies to them, many search engines will not. All that is required by the robots.txt Standard is that a robot accept the first record which matches (or partially matches) its user-agent name, or a User-agent record specifying "*" whichever comes first.
See A Standard for Robot Exclusion [robotstxt.org], and interpret it very carefully.
Jim