Forum Moderators: open
Reason to note this is because you do not want say Google bot to crawl your entire site though a proxy and get hit with a duplicate content penalty or have someone else earn money by inserting there own Google Adsense ads in.
Make sure robots.txt only allows the bots you wish to crawl and index the website. I suggest only the top 3 or 4, which in my opinion are Google, Yahoo, MSN, and Ask Jeeves.
The following checks will also stop major search engines which are crawling though a transparent proxy server unknowingly, thus saving duplicate content penalties for the website as a side benefit.
(A) DNS check, require looking up the IP to get the Hostname. Check resolved hostname against the known patterns for the search engine in question. And if they do not match mark the ip as banned and give it a proper message.
(B) Then doing a look up on the Hostname to see if it resolves back with a list of ip address’s that contain the ip which you started with.
Something to watch out for with some fake bots will have there ip address resolve to a hostname which matches there ip’s address and thus would pass the test, so this must be explicitly tested for to bounce these results by default. For example ip "10.0.0.1" would resolve to "10.0.0.1" hostname.
MSN, Yahoo, Google, Ask Jeeves all support this functionality currently, others may as well. The purpose of this check is prevent others from spoofing well known Crawlers and setting up there DNS records to resolve there ip’s to a well known Search Engine hostname, but since they will not control the reversing of the Hostname to ip they will get caught with this check..
One thing we do, at the early stage, is check against the IP Blocks of known Datacenters/Hosting Ranges/Colos. From real life expirience most of the scrapers come from thouse.
These most offen are: ThePlanet(EV1), nLayer, SoftLayer, GNAC, Abovenet, ISPRIME, bluehost, NOC4HOSTS, SCHLUND(1&1), KEYWEB, OVH(french have sence of humor), and my recent Favorite of all Netdirekt.
That cuts down on a lot of overhead. But then again I remember someone(think it was Martinibuster) mentioned something that stock in my head:.. Give your site to scrapers, they provide backlinks....
We do but custom.
There are also a different kinds of scrapers. Chances are there are only a few(hundred) that are targetting your niche. So Tracking where they scrape your stuff from and where they host it gets to be a lot of fun. Ok, I'll stop with the ranting(Slayer stuff :) ), but More to come at a later date.
Blend27
Give your site to scrapers, they provide backlinks....
While Martini's thesis was correct at some level it only covered a few scrapers, the "legit" scrapers trying to build actual resource sites and they are far and few between and it's what we call a "BAD IDEA".
Giving your site to the scrapers normally results in them competing with you for your own content trying to hijack your SERPs or associating your site to bad keywords, nasty neighborhoods and worse.
When it comes to scrapers, just say 403.
The value of this header is usually a URL pointing to an xml file describing the supported features of the browser and phone.
Most major web browsers will send this header along with the request, which tells the web servers what the browser can accept. I have only listed the major browser providers but this is usually safe for all known browser except for a few mobile browsers which is why the mobile browser checks are in place earlier.
1. Hosting providers exist that offer robust bot blocking as a value added feature or benefit of their hosting plans. (They needn't make it an across the board offer of all hosting plans.)
It seems to me that "as a value added service" - a service that would also cut down on bandwidth and server load - that such an offer would be attractive.
Of course, those hosting providers that host scrapers and bots likely wouldn't be first to offer the service. ;-P
2. Absent #1 an entity, such as CPanel, would offer bot identification and blocking software-as-a-service on a per-server basis, with routine automatic uploads.
Under version #2 the service could offer opt-out functions, built into the bot blocking control panel, which would allow user control of IP range or block exclusion, etc.
The downside of bot blocking is the risk of blocking a friendly bot. IF the search engines are a contributing cause of bad bot activity then one would think that the search engines, as an expression of their do no harm policy, would support efforts and inititatives to identify, track and block bad bots. At the very least one would think that the search engines would provide the bot blocking services with a ready remedy to reverse the effects of any improvident blocking of their bots.
If the search engines aren't helping third-party software developers or hosting firms to identify, track and block bad bots what would be the reason(s)?
[edited by: Webwork at 11:57 am (utc) on Mar. 31, 2008]
Number 3 doesn't look good to me, blocking anyone from robots.txt does not help new search engines entering the market. From what I have seen, bad bots do not get robots.txt anyway. Blocking a bot that takes robots.txt but then access a blocked page would be more reasonable.
Even if you block a specific bot, it is trivial to crawl your site through one of the many public caching services.
4. ..... Generally I prefer to send the user a 403 status code with no further content, so not to waste valuable bandwidth on bad bots, and not to supply the bot owners with information how to sneak around the anti-scraping measure put into place on the website.
Yes, but the 403 itself may already be too much of information for them. It may be more fun to send them a 200 with an empty page just containing a few blank characters. Or a short alternate content.
I once even came across a public forum discussion where one scraper script kiddie complained that its scraping script may have a bug when trying to scrape www.<mysite>.com but getting no content and what to do about that bug.
Lol. That made may day, of course.
Kind regards,
R.
User-agent: *
Disallow: /User-agent: Googlebot
Disallow:
User-agent: Slurp
Disallow:
User-agent: msnbot
Disallow:
User-agent: Teoma
Disallow:
Please note that bad bot will mostly ignore your robots.txt.
There is also list of bad bots IP; load them into your firewall using a shell script or using a php script - [spamhaus.org...]
Hope this helps someone.
I am very quiet about my security (and it's quiet to my human visitors) but one thing I don't allow are any open source spiders. Don't get me wrong, I'm all for open source but online believe it or not I'm mostly business minded. Commercial spiders are allowed on my site so long as they aren't redistributable in any way (open source or not).
Beyond that it only takes a little ingenuity to combat bad bots. It's not really that difficult and the logic behind it is a lot of fun. :)
- John
It's not really that difficult
Most of them aren't that difficult but when you get into the wonderful world of commercial spybots it can become quite difficult because they don't want to be found in the first place and have the money to hide quite effectively.
Imagine if you would a bot using MSIE's default user agent and 100 different IP addresses from various locations around the world, how easy would it be to spot even a sequential scan of your site that hops from service provider to service provider, or country to country?
Now imagine how easy that is to accomplish when you get your hands on a list of 6K open proxies operating from random locations around the world which means I could probably scrape 100K pages from any web site without getting caught unless they block this proxy list.
OK, now imagine this proxy list isn't public and it's run and used by a private consortium of customers that need to operate without being detected...
If they read robots.txt log the IP addresses and User-Agents and weather by the rules outlined in the robots.txt if they should be banned or not based. I usually assume anything that reads the robots.txt file is a bot or someone snooping around who is up to no good.Make sure robots.txt only allows the bots you wish to crawl and index the website. I suggest only the top 3 or 4, which in my opinion are Google, Yahoo, MSN, and Ask Jeeves.
I left out have a dynamic robots.txt similar to WebmasterWorld here, where I show only the bots I want with the version not blocking everything. So if they are not white listed to receive the unblocked robots.txt the next file they take will get them blocked by default.
As for new search engined I stopped being an early adopter when I had to pay the high bandwidth bill every month that allowing all the new spiders free access caused. I like to have some profits, and eat nicely thank you very much.
Now imagine how easy that is to accomplish when you get your hands on a list of 6K open proxies operating from random locations around the world which means I could probably scrape 100K pages from any web site without getting caught unless they block this proxy list.OK, now imagine this proxy list isn't public and it's run and used by a private consortium of customers that need to operate without being detected...
Unique concept!
Imagine how the schemed and collective group above (as well as many of those within "the www" that banter about the theme of "free access" or "public domian") percieve a collective group of webmasters at SSID discussing the limited access of possible infractions? ;)
Saw where some universities where taking part in public-domain archive of their libraries through Archive.org (I believe), rather than Goggle because google presented too many possibilites for future access restrictions and future paid access (subscription) possibilities.
On a side note, I always find it fun to feed known scrapers either mashed up copy, completely reverse all the text so it's backwards and mixed up and or feed them their own tail through a proxy. Yes .. it's mean, but so is using my content to display ads and trashing our rankings.
We recently signed up with hackersafe and one of the things that threw a flag in their system was disclosing ANY directories within robots.txt
-- bot blockers then the bots would easily work around them. It is possible to make bots unidentifiable from normal visitors. --
But 99% of them(scrapers) have no clue on how to do it. I've compiled a long list hosting/datacenter ranges within a past couple of years. Anything that comes from thouse gets an automatic boot. The ones that do know what they are doing would most likely rent a server and a big PIPE.
The rest gets stock in later subroutines/logic. + Random Spider Traps
It is possible to make bots unidentifiable from normal visitors.
They can try, but even trying to look 100% human they often fail.
The problem they have is looking too human slows the scraping process down and uses a significant amount of bandwidth just to hide their activities.
It's all a cat and mouse game as I've automatically snared some impressive bots they they still hit bot traps and can't respond to unknown situations which makes them vulnerable.
Any thoughts on using an IPlist such as at [iplists.com...] to cloak a sitemap, as I have noticed that most scrapers are using my sitemaps in their scripts, If I disallow all except known spider IPs from viewing the sitemaps, might this help?
Restricting access your sitemaps file is a good idea to stop unwanted bots from reading the file. Weather you use the lists of IPs offered at that site is your choice. I think what you are asking are the public lists provided good enough to build your access list your sitemaps file with. Personally I would only use those lists as a small part of the solution not the whole part. I use a bunch of smaller tests to validate the allowed bots, so to weed out the fake bots, without having to maintain lists of IP's for each bot.
Is it these kind of spider traps that you use? I saw a few such as a little test to see if the bot followed robots.txt rules, that sounded like a good idea, do you do stuff like that?
Thanks, I had come to conclusion today that I would run a risk relying on public IP lists, and had found a few pages about spider traps and such that I thought may be a better solution. Also as you say to keep any kind of IP list updated could be difficult.Is it these kind of spider traps that you use? I saw a few such as a little test to see if the bot followed robots.txt rules, that sounded like a good idea, do you do stuff like that?
Yes I use a custom written module for Asp.Net which allows me to filter out the unwanted traffic, by following the steps I outlined in my posts here and then some.