Forum Moderators: goodroi
Yes, I know scrapers are free to ignore robots.txt
When the spidering began, I originally assumed they were interested in the new URL's that became available (even though this content was already available via port 80).
But then it occurred to me, maybe we were being scraped because we were asking SE's not to spider our content. That is, if our pages aren't in the SERPS under our own domain, the content might be more attractive to steal.
I don't have a lot of experience with robots.txt changes, so I'm interested in anyone's impressions here.
For example on my sites I usually add a few fun entries to robots.txt like:
Disallow: /MeanPeople/
Disallow: /Headaches/
Disallow: /SuperDuperSecretFolder/
Since those directories are not anything close to my real directories I know the only way someone would try to access it is if they were reverse engineering my robots.txt. If you have smart but mischevious friends that like to play around on your site you may end up blocking them.