homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

Confused over Google spider and robots.txt

5+ Year Member

Msg#: 3252492 posted 10:53 am on Feb 14, 2007 (gmt 0)

Hi everyone. My website has less than one hundred pages of content but Google has indexed 1,500 pages. Most of it is junk pages associated with comments, feeds, tags, etc. This is our robots.txt:

User-agent: *
Disallow: /admin
Disallow: /node/add
Disallow: /aggregator
User-agent: ia_archiver
Disallow: /
User-agent: Googlebot
Disallow: /topic
Disallow: /user/register
Disallow: /user/login
Disallow: /comment
Disallow: /*/feed$

Here are the pages that I think *should* be excluded by isn't:


Moreover, here are some pages I would like to exclude, but isn't sure how to do it using the robots txt:


The robots text has been up for a month and a half now, should I wait longer for it to take effect? Thanks in advance for any help!



WebmasterWorld Administrator goodroi us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

Msg#: 3252492 posted 5:45 pm on Feb 14, 2007 (gmt 0)

hi flowermark,

Robots.txt takes effect as soon as you upload it. Every time googlebot visits your site the first thing it does is. Googlebot then looks to see if you have specific user agent instructions for it. If you list specific instructions for the user agent googlebot, googlebot will ignore the generic instructions and only follow the specific instructions.

Your current robots.txt has several problems with it.
Disallow: /topic should be Disallow: /topic/ (notice the trailing slash). This is why googlebot is indexing www.example.com/topic/heating.

I assume you do not want googlebot to crawl your /admin/ folder so you should place a copy of the generic instructions under the googlebot line.

To block pages based on url wildcards (aka pattern matching) add this line: Disallow: /*blog?* (this will block all URLs that contain "blog?". For more information about Google's robots.txt [google.com]

Also when in doubt you can test your robots.txt with googlebots validator. It is located within Google Sitemaps [google.com]


WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member

Msg#: 3252492 posted 6:02 pm on Feb 14, 2007 (gmt 0)

Blank lines are required after Disallow and before the next User-agent, and also at the end of the file.

While Google and some of the other major search engines will look for the most specific User-agent record that applies to them, many search engines will not. All that is required by the robots.txt Standard is that a robot accept the first record which matches (or partially matches) its user-agent name, or a User-agent record specifying "*" whichever comes first.

See A Standard for Robot Exclusion [robotstxt.org], and interpret it very carefully.



5+ Year Member

Msg#: 3252492 posted 12:56 pm on Feb 15, 2007 (gmt 0)

Thank you so much for your help. I feel kind of honored getting TWO moderators answering my question!

Your tips are very helpful. I will consult with my technical guy and let him know about it. I'll report back and let you guys know how it turns out.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved