homepage Welcome to WebmasterWorld Guest from 54.167.138.53
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Robots.txt disallow everything in folder, but not folder itself
sequence




msg:3987324
 1:55 pm on Sep 10, 2009 (gmt 0)

Hi there, here's the situation:

I'm having about 10 brand pages like this, which are very important to remain indexed
domain.com/brand1/
domain.com/brand2/
domain.com/brand3/
etc.

Next, we have a lot of clickouts which need to be blocked by robots.txt. These clickouts are located as ID's under the brands:
domain.com/brand1/123/
domain.com/brand1/456/
domain.com/brand1/789/
domain.com/brand2/010/
domain.com/brand2/111/
domain.com/brand3/213/
etc.

How do we block the latter links without disallow the brand pages?

 

jdMorgan




msg:3987366
 3:04 pm on Sep 10, 2009 (gmt 0)

There's no good way to do this that will work for all robots. You should really put URLs you don't want spidered into a separate directory, or divide the brands directory into spiderable and non-spiderable directories, such as

/brands/public/brand/ and brands/private/brand
or
/brands-public/brand/ and brands-private/brand
or
/brands/brand-public/ and brands/brand-private
etc.

That is, spidering should be considered in the design of the directory layout.

For Google and some other major search engines, you can use the "Allow:" directive and/or wild-card paths in robots.txt. But many search engines don't support "Allow:" and wild-card patsh because they is not part of the original Standard for Robot Exclusion. That leaves you with using the on-page (HTML meta-tag) robots control method, which may or may not be applicable to your situation. Or look into the X-Robots HTTP header -- but again, this is not supported by all robots.

Really, the best approach is to consider file organization, spiderability, access-control, and cacheability as a fundamental part of directory-layout design...

Jim

sequence




msg:3987375
 3:15 pm on Sep 10, 2009 (gmt 0)

Ok thanks for the information. So the best way is to move clickouts to a subfolder, say:
domain.com/brand1/go/123/
domain.com/brand1/go/456/
domain.com/brand1/go/789/
domain.com/brand2/go/010/
domain.com/brand2/go/111/
domain.com/brand3/go/213/

And then
User-agent: *
Disallow: /brand1/go
Disallow: /brand2/go
Disallow: /brand3/go

That wouldn't hurt the brand pages itself wouldn't it?

jdMorgan




msg:3987391
 3:34 pm on Sep 10, 2009 (gmt 0)

No, it won't "hurt" the brand pages. Only /brand<numbers>/go/<numbers>, /brand<numbers>/go/, and /brand<numbers>/go (if they exist) would be Disallowed.

Robots.txt uses prefix-matching; Any URL-path that begins with the specified string is affected.

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved