Forum Moderators: open
'building a white list of ips.'
that's pretty much the gist of it
another option that people have gone with is something like this
Blocking Badly Behaved Bots [webmasterworld.com]
Please excuse my lack of understanding I'm not a programmer, however when I find the right solution I will ask my programmers to implement it.
If I understand right this thread: [webmasterworld.com...]
- Is a system of banning based on frequency of requests and time.
- Includes whitelisted IPs e.g. for SEs.
- Is mainly for sites with bandwidth issues.
I'm trying to decide if this is the solution I need.
We manage approx 100 sites (e.g. business brochure type sites), mostly 20-200 pages, average 1000 page views per month. Bandwidth is not a big problem for us. I just hate the rogue bots because they:
- Mess up the stats,
- steal content for scraper sites,
- seek vulnerabilities
- ...and probably other things I am not aware of.
I am happy to allow any SE bot. I just want to ban the scrapers, vulnerability-seekers, and any non-SE bots.
Does this script do what I'm looking for? [webmasterworld.com...]
Or would a honeypot approach be better? e.g:
A typical good bot obeys robots.txt, while a bad bot has no reason to obey robots.txt and may see robots.txt as a signpost to the good stuff. Would the following be a more appropriate system for me?:
- In robots.txt somehting like: Dissallow all from folder X.
- Then on every page a hidden link to X/index.html
- On X/index.html put a meta robot exclusion tag, AND a hidden link to X/notallowed.htm.
- Ban anything that arrives at X/notallowed.htm
I'm sure that many who are concerned about rogue visitors would agree that a multi-pronged approach is needed (or, at least desired). A honeypot, on its own, isn't enough to catch all the malicious visitors; neither is using .htaccess (on Apache web servers), nor using a Perl program [webmasterworld.com] or PHP script to catch them. But, when combined together, you have a much stronger defence.
Hi Balam, Yes, that Perl program (or PHP script) is the kind of honeypot I mean.
Honeypot:
A note for new users: Install the robots.txt exclusion described above several days (even a week) before "going live" with the script. Many legitimate robots don't always read a new copy of robots.txt every time they access your site; Give them some time to find out that they shouldn't swallow the bait.Yes, I would probably do this. However, would my suggestion of a 2 step process with <meta name="robots" content="noindex,nofollow"> on each page help? Ie the spider may not see the robots.txt but if its following links to get to the '2nd step banning page' it will have hit a previous page with an anti-spider metatag. Its the major SE spiders that I don't want to upset, and they should obey that tag.
What kind of automated bad bots would escape such a honeypot? I just a few escape it, if I can stop the majority I will be happy (and consider further tactics later).