| This 42 message thread spans 2 pages: 42 (  2 ) > > || |
|What's the Best Way to Keep All Spiders/Bots Out?|
Only want a couple of pages crawled
I'm launching a new site that I don't want in any search engine (apart from a few static pages) - and I certainly don't want people running bots against it (as each page will have one/several BOSS API calls, at a cost to me).
I'm aware of several techniques such as:
But there's a very specific issue, people are likely to want to "crawl" the site by doing lots of search queries rather than link-based crawling. Another thing (that may help a bit) is that users are going to be UK only - but I might want to allow US,CA,IE,AU,NZ usage too (mainly so webmasters/press can try it and write about it).
Effectively, it's the same issue that Search Engines have and I'm unsure of the best way to deal with several things (such as IP addresses that have many simultaneous users - is AOL still like that?). I don't want to use ANY feature that could be seen as a privacy issue (so no cookie dropping without permission - although I'd be happy to use a short-lived server side identifier).
I'm happy to read as much as required (so links to older, but still valid, threads would be handy too).
I should add that hardware load balancers and firewalls are within the budget, so if any help in this respect (thinking of the ease of IP range banning) I'd be delighted to hear about your good experiences.
|is that users are going to be UK only - but I might want to allow US,CA,IE,AU,NZ usage too (mainly so webmasters/press can try it and write about it). |
I used something similar for more than ten years, based upon IP.
The most effective method is to "deny all", then "allow the UK ranges".
For the other country ranges, I would suggest monitoring your "raw logs" (after a while the 403's will jump out at you), and then making very small Class D exceptions for those that your desire access "on the fly".
To allow the entire US and CAN IP's is far too extensive a task.
The Oceanic (AU and NZ) are easy enough to separate from the other APNIC IP's.
I've no experience focused upon Ireland's IP's, at least as an individual group, and thus I'm not aware of how extensive that category is.
Thanks, I think the /8 exceptions would work for an established site that wants to grant access for legitimate users overseas.
There are a few problems that I can see with that for us though:
1. The site will be brand new so no previous publicity; I think webmasters will be interested in it no matter where they are from - so I don't want to alienate people at a time when they might want to look at it in order to discuss/write about it.
2. There might (still to decide) be tools on the site designed for webmasters, these should appeal to all English language site owners even though our (non-webmaster) audience would be very much UK specific.
3. I hope that traffic will be enough to make manual scanning of logs a bit of an issue.
I suppose that I really need to decide on whether the webmaster tools will be part of it - if so I could do an alpha and beta stage that could be invite only (with people getting a number of invites each); that could serve as a source for IP exceptions (that would still be subject to other anti-bot measures).
|I think the /8 exceptions would work for an established site that wants to grant access for legitimate users overseas. |
IMO a flawed and very bad decision, at least using that as your sole criteria.
For others; /8 suggests allowing by Class A.
Told you I need help. I thought you meant allowing (first 3 * 8 bits of IP).anything(last 8) (which I mistakenly put as /8, it's a long time since I used such terminology and should have checked) - are you suggesting something different.
Do you mean opening up single IPs? Do you think that allowing class C blocks allows people to crawl from lots of separate IP's in that range (or are there other reasons why that's a really poor decision?).
Also, do you find that people change IP often in the States?
Two tiers of blocking.
Most IPs are shut out at the gate. (I'm sorry, robot, but if your long-time goal is to end war and wipe out world hunger, don't put your base of operations at Hetzner.)
The question marks get redirected. For a while I had up to half a dozen ranges getting the "I don't like your face" page, which says (quote) "If you believe I have misjudged you, borrow a friendís computer and send me an e-mail explaining who you are and what youíre doing." Here I guess it would have to be "a friend in another country". My version of the page has no links of any kind, but you could make a throwaway e-mail address and change it as needed.
First, what your attempting is no simple matter.
1) deny all
2) allow UK IP ranges (there are some sites that will provide you with all the UK ranges (at least with some degree of accuracy), however these ranges will NOT be a copy and paste solution. You'll still need to combine and condense IP's.
3) then go back allow non-UK ranges based upon raw log activity. (Be very carful when allowing these IP's (NOT primarily on /8-A's, rather restricting access to a Class D range.
allows the 0-32 Class D Range.
In some instance you may wish to focus upon a Class C range, however and if your wish is to restrict access from the masses, you should be very careful in these ranges.
Each instance of an exception for you to allow an IP range would require an analysis and a decision (based upon raw log activity and research of the IP range), not something you may trust to a software or a processor.
|Do you mean opening up single IPs? |
|Also, do you find that people change IP often in the States? |
"Often", and generally speaking,no.
The North American (at least within the US and CAN) IP ranges even though dynamic are fairy consistent (unless the user resets their modem daily). With the majority using broad band connections these days their routers are left on 24/7 maintaining the same dynamic IP for months at a time.
In Europe the provide dynamic ranges are far more broad in comparison and may vary by both Class A's and B's.
In addition, malicious visitors will come from a variety of IP's based upon proxies, server farms, and even more.
Wilderness, thanks for tking the time to explain all that. The more I look into this, the more of a minefield it appears to be.
For UK use, note that IPs can be anywhere from /11 down to /22 or even /23 for broadband/DSL providers. Allowing a full /16 is likely to end in a mix of UK, UA, RU, NL etc.
Some UK ranges should in any case not be permitted. The fact it is listed as UK does not mean it's broadband/DSL: it may be a server farm, which should normally be excluded.
Also, in some cases IP by country can be a bit hit/miss: the latest allocations are not always included in country lists that may only have been in use for a year or less - sometimes as long as two or three years.
It is also quite feasible for a bad bot to originate from a compromised computer (ie it's got a virus and become part of a botnet). These can be on popular broadband ranges more often than server ranges, but do not rule out the latter: there are a lot of compromised servers around at the moment.
My personal view is that you have to define a range of header field conditions (not exclusively User-Agents) and IP ranges that are acceptable. If you are constricting by country that still does not mean you can ignore bad bots and scrapers. Remember that scrapers are often run by "seo" people who run from broadband lines, often in your own country!
|The North American (at least within the US and CAN) IP ranges even though dynamic are fairly consistent (unless the user resets their modem daily). With the majority using broad band connections these days their routers are left on 24/7 maintaining the same dynamic IP for months at a time. |
That's not what I'm seeing. A quick riffle through the logs shows my IP address changing every one to two weeks on no discernible schedule. (I never touch the modem, and the cat can't reach its power switch.) If you're on a pretty big ISP, the range can be substantial:
69.111.nnn (change after 9 days)
67.117.nnn (change after almost 4 weeks)
67.122.nnn (change after 6 days)
71.141.nnn (about two weeks so far)
That's on DSL. Similarly I've seen a local cable modem change from 74... to 75... (this is someone I know personally).
Thanks Lucy, that helps by effectively making it a certainty that certain things are out of the question. In the UK I had to choose a specific broadband provider to make sure I got a static IP.
|(such as IP addresses that have many simultaneous users - is AOL still like that?) |
AOL simultaneously distributes even a single user's hits across scores of its servers.
Vice-versa, too, I suppose, but I don't have enough simultaneous users from aol.com anymore to be sure.
I've had ATT/SBC for nearly ten years and have NEVER been assigned a PPPoX Pool - Rback range.
Perhaps your variation in dynamic assignments is related to LSA.
I monitored my websites visitors for more than ten years and then compared those IP's to email communications in order to confirm identities, making adjustments in the process. Also supported the efforts with tracerts.
Thus, Ive sympathy for your indifference, however what I previously provided are accurate statements.
You're just saying that so I'll go off in search of a place that can translate it into English :-P
All I know is that when Suddenlink took over from Cox, their connection speed promptly became so lousy that things actually improved when I switched to DSL to save money. It's slower now that I live three blocks further away (it's an Official Dividing Line, so I also pay less). But I don't have dates to go with the before-and-after IPs.
Is it possible you live in a more densely populated area than I do? My address range apparently covers everything from San Francisco on north. When I feed my IP into one of the free tools, they take a pin, stick it into the map at random and say "Here you are!"
Anyway, I just realized I have a much simpler test. I looked myself up on a forum where I've got moderator access (the "look up all IP addresses" kind), with posts going back to something like 2003. Turns out that even when I was on a cable modem, which is supposed to be fixed IP, I somehow racked up nine different IPs, although one of them accounted for about half of all posts. On DSL I've had almost 300 addresses over the past 5 years.
Also a handful calling themselves ppp. I have no idea what those are, except that they're in the same range as some of the dsl's.
###. There are addresses here I wouldn't even recognize as my own. These are just the DSLs, not the cable modem.
No matter what you do, if any link gets out, the bots will find it. And if found will crawl. Only way to deal with that is to ip ban (problematic) or password the product.
Securing the web is extremely difficult.
|people are likely to want to "crawl" the site by doing lots of search queries rather than link-based crawling. |
Separate issue... I have a few sites like this where I offer site-based search (no bots allowed, run my own index). Keeps it all "in-house".
tangor, it's the kind of measures that you'll put in place to stop abuse of the search you offer that's interesting to me.
The issue that I can see is people wanting to run automated queries that will cost me money and give a poorer experience to others using the site legitimately.
Have you considered the other approach - rather than banning bots, allow only humans?
Allowing a few queries then checking the user is a human with a captcha is a fairly common approach on webmaster-targeted sites (such as for whois, speedtest, dns test, geo ip, etc).
That way, robots don't get to see much, don't get to use the API calls.
If you want to be listed in search engines such as Google do not block US IPs.
|Allowing a few queries then checking the user is a human with a captcha is a fairly common approach on webmaster-targeted sites (such as for whois, speedtest, dns test, geo ip, etc). |
Indeed, this is likely to be the method for the areas that are aimed at webmasters. Unfortunately, the main part of the site (being aimed at average users with no huge carrot enticing them) would probably not support any, even small, hurdles.
|If you want to be listed in search engines such as Google do not block US IPs. |
That's part of the issue; I don't want any more than a couple of static pages to be listed in any search engine (and those will be treated differently) - the site behind the search interface is for humans only. I know that might sound strange, our other sites welcome search engine traffic, but the new site just does not suit being returned as a search result...
You might have already considered this but here goes:
Put the static pages and common files in / with relatively loose root-level .htaccess controls, then put the search interface (and any supporting files) in a subdir with tight dir-level .htaccess controls? Or alternatively, in a subdomain?
I do the latter with a search interface and its databases and symlink the subdomain's .htaccess to the main site's because I want common blocks. (The files could just as easily be distinct but that would require much more maintenance.) The subdomain search interface is public but the front-end 'form' requires the same-domain referrer to work. FWIW
I would use a combination of tricks, with honeybot to ban bots by IP.
Something I just thought of, is to have a anonymous login page, not a page that requires a password, but just test to see if they are a real user. If real, then allow that session to view the rest of the pages.
RE hurdles and limiting to humans...
This whole thing is mostly way over my head, but you said you expect most navigation to be via search. Can you use something like hashcash (which requires a "proof of work" to test whether human or not - transparent to the user doing normal human things like typing and clicking, but blocks bots fairly effectively) and then any dynamic pages generated via search are simply not accessible except via search?
I like the idea of verifying human interaction... I guess that could be built into the ajax that is likely to be part of the interface anyway.
Even as I type I can see some very nice, neat, unobtrusive ways to validate that user actions have happened (an a specific page ID - without privacy issues as pages would be given random ID's not attached to anyone - and those ID's would be cheap/quick to store; on the day in memory, beyond that in a slower storage method). There will be ways to get around them (isn't there always) but it's probably enough to make it less likely people will bother. Also, it would not need to break the ability of people to link to the results (bookmark or email), you just need to store that it was a valid, human triggered, search at one point.
I'm in the UK and I had a similar problem. I now have a .htaccess file several thousand lines long that I have built up over the last decade, which blocks IP addresses that bots have come from worldwide as well as /8 blocks on APNIC. Most of the blocks are outside the UK but there are certain universities and others in Britain that I've blocked for persistent problems. It doesn't seem to slow down performance one iota. I'll let you have a copy if you wish.
I've used honeypots in the past but every now and again a real search engine has strayed into it with all sorts of complicated results.
The UK IP's via GeoLite [maxmind.com] (downloaded their GeoLite CSV Format dated Aug 2011), then sorted by country and IP.
The entire IP's for the UL from the GeoLite file are more than 5500 lines.
A possibility exists that some quantity of this 5500+ lines may be reduced, however considering the UK has very small Class ranges, I kinda doubt the quantity would be reduced.
There are some IP to Regex conversion tools online, however they do a bad job, adding unnecessary syntax.
Their first 28 lines converted to RegEx (I'm not about to go through 5500 lines of conversion, just to be a nice guy or make a point.
There are some very good leads for me to follow up (some of which I'm delving into now and finding just how deep the rabbit hole is). Thanks to everyone who has posted.
But there are still issues to cover...
It seems to me that any counter-measure could be navigated by someone with enough resources (and hunger for the data) so I'm also going to put more thought into how to offer webmaster features that will reduce the desire to scrape the data.
Our problems are similar in nature to (but by no means the scale of) those that Google must deal with all the time - I can imagine people wanting to scrape data mainly to see if they appear in it (which is very similar to the way SERPs were/are targetted to monitor positions).
Does anyone have any experience of offering (less) data through and API or tool that has had a material effect on scraper volume? My guess is that some smarter people will take the API route but many (probably easy to block) bots will still be let loose (by idiots) on sites that they could legitimately get what they want from.
An important point is that the data on the site will not be editorial in nature - it's data that will have snippets of textual data and have inferred data by the way it is presented (e.g. if there are 20 items that match a query and those 20 are from 2 different categories, there will be 2 rows of data shown which group data into the categories - although the categories might not be shown/disclosed in any way).
Am I going to be fighting an endless battle with bots or is there a way to satisfy those that have the ability (at a higher cost to them) to circumvent counter-measures?
|Am I going to be fighting an endless battle with bots or is there a way to satisfy those that have the ability (at a higher cost to them) to circumvent counter-measures? |
Your going in circles (chasing your own tail).
You've been provided multiple solutions by longtime participants in this forum, and yet, rather than tackling the work and solutions, your still looking for a one-shot copy and paste solution where no such thing exists.
|The most effective method is to "deny all", then "allow the UK ranges". |
|then go back allow non-UK ranges based upon raw log activity. |
|Only way to deal with that is to ip ban |
|Put the static pages and common files in / with relatively loose root-level .htaccess controls, then put the search interface (and any supporting files) in a subdir with tight dir-level .htaccess controls? Or alternatively, in a subdomain? |
|I'm in the UK and I had a similar problem. I now have a .htaccess file several thousand lines long |
| This 42 message thread spans 2 pages: 42 (  2 ) > > |