homepage Welcome to WebmasterWorld Guest from 54.235.227.60
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Confused over Google spider and robots.txt
flowermark




msg:3252270
 10:53 am on Feb 14, 2007 (gmt 0)

Hi everyone. My website has less than one hundred pages of content but Google has indexed 1,500 pages. Most of it is junk pages associated with comments, feeds, tags, etc. This is our robots.txt:

User-agent: *
Disallow: /admin
Disallow: /node/add
Disallow: /aggregator
User-agent: ia_archiver
Disallow: /
User-agent: Googlebot
Disallow: /topic
Disallow: /user/register
Disallow: /user/login
Disallow: /comment
Disallow: /*/feed$

Here are the pages that I think *should* be excluded by isn't:

www.example.com/topic/heating
www.example.com/comment/reply/14
www.example.com/wine-2/feed

Moreover, here are some pages I would like to exclude, but isn't sure how to do it using the robots txt:

www.example.com/blog?page=11
www.example.com/home?page=13

The robots text has been up for a month and a half now, should I wait longer for it to take effect? Thanks in advance for any help!

 

goodroi




msg:3252615
 5:45 pm on Feb 14, 2007 (gmt 0)

hi flowermark,

Robots.txt takes effect as soon as you upload it. Every time googlebot visits your site the first thing it does is. Googlebot then looks to see if you have specific user agent instructions for it. If you list specific instructions for the user agent googlebot, googlebot will ignore the generic instructions and only follow the specific instructions.

Your current robots.txt has several problems with it.
Disallow: /topic should be Disallow: /topic/ (notice the trailing slash). This is why googlebot is indexing www.example.com/topic/heating.

I assume you do not want googlebot to crawl your /admin/ folder so you should place a copy of the generic instructions under the googlebot line.

To block pages based on url wildcards (aka pattern matching) add this line: Disallow: /*blog?* (this will block all URLs that contain "blog?". For more information about Google's robots.txt [google.com]

Also when in doubt you can test your robots.txt with googlebots validator. It is located within Google Sitemaps [google.com]

jdMorgan




msg:3252624
 6:02 pm on Feb 14, 2007 (gmt 0)

Blank lines are required after Disallow and before the next User-agent, and also at the end of the file.

While Google and some of the other major search engines will look for the most specific User-agent record that applies to them, many search engines will not. All that is required by the robots.txt Standard is that a robot accept the first record which matches (or partially matches) its user-agent name, or a User-agent record specifying "*" whichever comes first.

See A Standard for Robot Exclusion [robotstxt.org], and interpret it very carefully.

Jim

flowermark




msg:3253288
 12:56 pm on Feb 15, 2007 (gmt 0)

Thank you so much for your help. I feel kind of honored getting TWO moderators answering my question!

Your tips are very helpful. I will consult with my technical guy and let him know about it. I'll report back and let you guys know how it turns out.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved