Page is a not externally linkable
maximillianos - 1:29 pm on Mar 21, 2010 (gmt 0)
Thanks. I'll check that command out.
My "system" has nabbed over 15 scrapers this week alone. I've had to tweak the limits it checks for and how much of the log file it reviews to be sure I'm not catching an heavy pageviewing visitor... Right now I bumped my intervals for the cron job down to 5 minutes, and I only look at 7000 lines of the log file each time at that interval. Then I have a job that runs at a longer interval and does a more broad check of the log files in case there is a slow scraper...
This morning it blocked one that had taken 150 pages in 90 seconds... who knows how many of my 100,000 pages it would have gotten if my program didn't catch it.
I go back and forth on this issue. On one hand, my site has survived 10 years without such a system in place. On the other hand, a few years ago I had a new competitor in my niche scrape 15,000 pages of content from my site to kick start their site... and they got away with it (long story).
So if I had this system in place, it may have stopped them. I guess it helps me sleep better at night knowing there is a good chance it will thwart the next copy-cat site that tries to steal my pages.
Though I know a sophisticated person can get around it, but then again there is not much any of us can do against the most advanced tactics.