Page is a not externally linkable
- Search Engines
-- Search Engine Spider and User Agent Identification
---- What's the Best Way to Keep All Spiders/Bots Out?


inbound - 3:34 pm on Oct 2, 2011 (gmt 0)


I'm launching a new site that I don't want in any search engine (apart from a few static pages) - and I certainly don't want people running bots against it (as each page will have one/several BOSS API calls, at a cost to me).

I'm aware of several techniques such as:

Rate limiting
Honey Pots
IP Banning
UA Banning

But there's a very specific issue, people are likely to want to "crawl" the site by doing lots of search queries rather than link-based crawling. Another thing (that may help a bit) is that users are going to be UK only - but I might want to allow US,CA,IE,AU,NZ usage too (mainly so webmasters/press can try it and write about it).

Effectively, it's the same issue that Search Engines have and I'm unsure of the best way to deal with several things (such as IP addresses that have many simultaneous users - is AOL still like that?). I don't want to use ANY feature that could be seen as a privacy issue (so no cookie dropping without permission - although I'd be happy to use a short-lived server side identifier).

I'm happy to read as much as required (so links to older, but still valid, threads would be handy too).


Thread source:: http://www.webmasterworld.com/search_engine_spiders/4369764.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com