Forum Moderators: phranque
One tactic you might consider is to have programs monitor your log files for any bots that seem to behave rudely, like pulling your pages too fast, or not crawling properly. You can then shut them down temporarily and send a warning to yourself to check it out. Things like that might help.
I also make sure to plug my content with a copyright notice and a link back to my site that is only visible to robots. I don't mind that search engines see it, not a big deal, but if someone else scrapes my site, they have to do some tricky programming to get rid of all my randomized (and randomly placed) copyright notices... ;-)
Hope this helps.
Good luck!
Also install base href tags--incase they scrape the whole page. This will help the search engine to determine where the original content came from.
[edited by: Lorel at 8:22 pm (utc) on Mar. 20, 2008]
User-agent: *
Disallow: /This is the worst advice I have ever seen! All "good" bots will obey the robots.txt and stop crawling which causes your site to fall out of Google, Yahoo and Live, but all the bad bots and scrapers will still come and rip your site.
This will both stop good and bad, depending if they respond to the robot.txt file.
The best way is to search for the robots.txt file hack, whereby you throw off the bad robots by putting a file in your robots.txt that isn't in use. Block any crawler that tries to access that file. A similar is to set up a hidden email address and block any IP that tries to email that account.
Next time read my post lammert, it was stated correctly, which was answering the more general question that was being asked, "How to limit crawlers/spiders, in general, from viewing the site." Not only the bad.
The second question was inquiring about the list of good sites. But the htaccess is the best way to block the access to bad sites only, answering the second question. How you find these sites and set that up is in a post that was deleted.
[edited by: phranque at 11:01 am (utc) on Mar. 31, 2008]
[edit reason] please see TOS #24 [webmasterworld.com] [/edit]
This will both stop good and bad, depending if they respond to the robot.txt file.The best way is to search for the robots.txt file hack, whereby you throw off the bad robots by putting a file in your robots.txt that isn't in use. Block any crawler that tries to access that file. A similar is to set up a hidden email address and block any IP that tries to email that account.
Next time read my post lammert, it was stated correctly, which was answering the more general question that was being asked, "How to limit crawlers/spiders, in general, from viewing the site." Not only the bad.
The second question was inquiring about the list of good sites. But the htaccess is the best way to block the access to bad sites only, answering the second question. How you find these sites and set that up is in a post that was deleted.
I was thinking the same thing about the bad advice, but I have been out of this so long, I thought maybe I was missing something.
[edited by: Eric_in_Tennessee at 8:05 pm (utc) on April 2, 2008]
[webmasterworld.com...]