homepage Welcome to WebmasterWorld Guest from 54.242.126.126
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Blocking Spiders
How to block spiders from crawling your site
Brian0275



 
Msg#: 4343689 posted 7:03 pm on Jul 25, 2011 (gmt 0)

We are getting ready to implement a feature on our site that would create massive duplicate content. It is good for users, bad for spiders. We have weighed several options about how to avoid this.

1) rel="canonical" - I think this is not the best option because it will still allow spiders to use up our resources

2) block all major bot User Agents from seeing this feature - Would require updating, would never be able to block all bots.

3) Meta noindex, nofollow, noarchive - Could be a possible solution

What do you guys suggest/had the best luck with?

 

penders

WebmasterWorld Senior Member penders us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4343689 posted 8:40 pm on Jul 25, 2011 (gmt 0)

4) Block known spider IPs? (Is there a list of known spider IPs?) ... would perhaps need updating like the User Agents.

5) Robots.txt?

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4343689 posted 8:50 pm on Jul 25, 2011 (gmt 0)

Work smart, not hard... WHITELIST the bots allowed and disallow all the rest. These days that list of who's invited in is MUCH SMALLER than the other way around.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4343689 posted 8:57 pm on Jul 25, 2011 (gmt 0)

Bad bots ignore robots.txt so you need .htaccess rules to shut the door in their face.

Good bots obey robots.txt so you can keep them out of various bits of the site quite easily if those URLs have an easy to recognise pattern or feature in them.

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4343689 posted 9:12 pm on Jul 25, 2011 (gmt 0)

Thanks for reminder that I left out a bit, g1smd... .htaccess, of course, is how it is done. As for robots.txt, same applies: Whitelist what you allow and disallow all the rest... then deal with those unruly bots which do not play nice. Mea cupla for leaving that out as we do say this many, many times in many, many threads... my bad...!

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4343689 posted 9:15 pm on Jul 25, 2011 (gmt 0)

It's a multi-faceted approach. Once you have a month of raw site logs you will have details for much more than 90% of the bots that might access your site.

After that, you'll just need a minor tweak to the rules now and again.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved