Blocking site scraper programs

Forum Moderators: phranque

Message Too Old, No Replies

Blocking site scraper programs

flyerguy

1:43 pm on Jan 25, 2005 (gmt 0)

I need to stop people from stealing my content. It's really amazing the depths that people will sink to make a buck.

I realize there's many methods to spoof and still get my content, but I figure anyone smart enough to beat the more advanced techniques is smart enough to realize there's easier and safer ways to make money off the internet.

Can someone please give me a gameplan to fend off these losers?

Here is my latest user agent report:

MSIE 8472334 73.879%
Mozilla 1650366 14.391%
Mozilla (compatible) 337238 2.941%
msnbot 245095 2.137%
Wget 187283 1.633%
Opera 136936 1.194%
Teleport Pro 84723 0.739%
Googlebot 83936 0.732%
Java1.3.1 79265 0.691%
Mediapartners-Google 72397 0.631%
WebStripper 22580 0.197%
WebCopier v4.0 11125 0.097%
ia_archiver 10936 0.095%
InternetSeer.com 9075 0.079%
WebSnatcher 5514 0.048%
SiteSucker 4555 0.040%
Anarchie 4017 0.035%
Pompos 3570 0.031%
psbot 3397 0.030%
FlashGet

bcolflesh

1:48 pm on Jan 25, 2005 (gmt 0)

[webmasterworld.com...]

too much information

1:52 pm on Jan 25, 2005 (gmt 0)

Have you thought of trying cookies? You could set a cookie for every visitor, then check the value of the cookie and if the cookie didn't exist you could just serve up a 403 or some other type of useless content.

Then you only have to worry about allowing the legitimate bots through which can be done by IP or their UA.

flyerguy

2:00 pm on Jan 25, 2005 (gmt 0)

That thread looks great, however I'm on a Windows Server.

bcolflesh

2:06 pm on Jan 25, 2005 (gmt 0)

[webmasterworld.com...]

flyerguy

2:15 pm on Jan 25, 2005 (gmt 0)

Thanks bcolflesh - I will look in to the IIS rewrite apps, as installing them would also fix up my dynamic?storefront nicely.

As a quick short term fix, do you know a comprehenive robots.txt that knocks off all the major scraper programs?

bcolflesh

2:28 pm on Jan 25, 2005 (gmt 0)

[webmasterworld.com...]

Most bad bots ignore the robots.txt file.

encyclo

2:28 pm on Jan 25, 2005 (gmt 0)

a comprehenive robots.txt that knocks off all the major scraper programs?

Try: [webmasterworld.com...] ;)

Bear in mind that many scraper programs ignore robots.txt completely, so you will still need to look at IP banning etc. as well as the above. Also, don'T just copy the list - you need to make you own judgement about the bots listed.

Too slow ;)

flyerguy

2:35 pm on Jan 25, 2005 (gmt 0)

Yeah I have seen the 'Input your own User Agent ID' section of programs such as Web Reaper.

Seeing all these scrapers in my logs with their stock IDs makes me think that these thiefs are especially stupid, since by not spoofinf right off the bat they are calling my attention into to takin protective measures..

Maybe a no-go from the start will be enough to dissuade a percentage.

flyerguy

2:40 pm on Jan 25, 2005 (gmt 0)

May I ask the purpose of strings such as:

User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows XP)
Disallow: /

Why would you want to ban IE browsers.. or is this only to prevent spidering by such agents? confused.

bcolflesh

2:42 pm on Jan 25, 2005 (gmt 0)

...or is this only to prevent spidering by such agents?

Yes

flyerguy

3:07 pm on Jan 25, 2005 (gmt 0)

I currently have the following tags in my sites pages:

Which one will trump the other, the robots.txt or these tags? Should I remove the meta tag declarations altogether?

topr8

3:12 pm on Jan 25, 2005 (gmt 0)

Which one will trump the other, the robots.txt or these tags? Should I remove the meta tag declarations altogether?

the point is, as mentioned above, that most scrapers won't obey either robots.txt or robots metatags, so in the context of this thread, it is irrelevant.

sharbel

1:43 pm on Jan 26, 2005 (gmt 0)

Requiring a cookie isnt going to stop anything.. You can simply sniff the HTTP stream with a network analyzer to see what your browser is sending back and forth when viewing the site, then create your scrapper to request and provide the same info... if your site requires a cookie, the network analyzer will tell the scrapper exactly what cookies to write..

encyclo

4:43 pm on Jan 26, 2005 (gmt 0)

<meta content="ARCHIVE" name="ROBOTS">
<meta content="ALL,INDEX,FOLLOW" name="ROBOTS">

In both cases these meta tags are specifying the default for all robots: in other words, if you want the bots to index, follow and archive, you don't need to specify it. Those tags are therefore not required at all.