Scrapers checking robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

Scrapers checking robots.txt

Spiders arrived on heels of robots.txt deployment

timster

3:34 pm on Dec 14, 2007 (gmt 0)

A site I work on was recently heavily spidered (almost immediately) after a few small changes. The first change was to deploy a robots.txt file that disallows all spidering. The second change made several thousand pages available via https.

Yes, I know scrapers are free to ignore robots.txt

When the spidering began, I originally assumed they were interested in the new URL's that became available (even though this content was already available via port 80).

But then it occurred to me, maybe we were being scraped because we were asking SE's not to spider our content. That is, if our pages aren't in the SERPS under our own domain, the content might be more attractive to steal.

I don't have a lot of experience with robots.txt changes, so I'm interested in anyone's impressions here.

goodroi

5:34 pm on Jan 2, 2008 (gmt 0)

It is possible that a competitor was monitoring your robots.txt but that is not very common. My gut feeling is that it was just coincidence. Many webmasters don't bother with looking at robots.txt.

timster

4:26 pm on Jan 3, 2008 (gmt 0)

Thanks for the response. Yes, it certainly could have been a coincidence. I appreciate you sharing your impression.

goodroi

1:29 am on Jan 4, 2008 (gmt 0)

If you are not feeling comfortable and think someone is out to get you then you can create a bot trap. List a folder only in robots.txt and then record any ip that attempts to access that folder and block the ip. Make sure you monitor the process so you know who you are blocking.

For example on my sites I usually add a few fun entries to robots.txt like:

Disallow: /MeanPeople/
Disallow: /Headaches/
Disallow: /SuperDuperSecretFolder/

Since those directories are not anything close to my real directories I know the only way someone would try to access it is if they were reverse engineering my robots.txt. If you have smart but mischevious friends that like to play around on your site you may end up blocking them.