Confused over Google spider and robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

Confused over Google spider and robots.txt

flowermark

10:53 am on Feb 14, 2007 (gmt 0)

Hi everyone. My website has less than one hundred pages of content but Google has indexed 1,500 pages. Most of it is junk pages associated with comments, feeds, tags, etc. This is our robots.txt:

User-agent: *
Disallow: /admin
Disallow: /node/add
Disallow: /aggregator
User-agent: ia_archiver
Disallow: /
User-agent: Googlebot
Disallow: /topic
Disallow: /user/register
Disallow: /user/login
Disallow: /comment
Disallow: /*/feed$

Here are the pages that I think *should* be excluded by isn't:

www.example.com/topic/heating
www.example.com/comment/reply/14
www.example.com/wine-2/feed

Moreover, here are some pages I would like to exclude, but isn't sure how to do it using the robots txt:

www.example.com/blog?page=11
www.example.com/home?page=13

The robots text has been up for a month and a half now, should I wait longer for it to take effect? Thanks in advance for any help!

goodroi

5:45 pm on Feb 14, 2007 (gmt 0)

hi flowermark,

Robots.txt takes effect as soon as you upload it. Every time googlebot visits your site the first thing it does is. Googlebot then looks to see if you have specific user agent instructions for it. If you list specific instructions for the user agent googlebot, googlebot will ignore the generic instructions and only follow the specific instructions.

Your current robots.txt has several problems with it.
Disallow: /topic should be Disallow: /topic/ (notice the trailing slash). This is why googlebot is indexing www.example.com/topic/heating.

I assume you do not want googlebot to crawl your /admin/ folder so you should place a copy of the generic instructions under the googlebot line.

To block pages based on url wildcards (aka pattern matching) add this line: Disallow: /*blog?* (this will block all URLs that contain "blog?". For more information about Google's robots.txt [google.com]

Also when in doubt you can test your robots.txt with googlebots validator. It is located within Google Sitemaps [google.com]

jdMorgan

6:02 pm on Feb 14, 2007 (gmt 0)

Blank lines are required after Disallow and before the next User-agent, and also at the end of the file.

While Google and some of the other major search engines will look for the most specific User-agent record that applies to them, many search engines will not. All that is required by the robots.txt Standard is that a robot accept the first record which matches (or partially matches) its user-agent name, or a User-agent record specifying "*" whichever comes first.

See A Standard for Robot Exclusion [robotstxt.org], and interpret it very carefully.

Jim

flowermark

12:56 pm on Feb 15, 2007 (gmt 0)

Thank you so much for your help. I feel kind of honored getting TWO moderators answering my question!

Your tips are very helpful. I will consult with my technical guy and let him know about it. I'll report back and let you guys know how it turns out.