Bretts measures to prevent rogue bots

Forum Moderators: open

Message Too Old, No Replies

Bretts measures to prevent rogue bots

kapow

7:20 pm on Jan 20, 2006 (gmt 0)

WebmasterWorld was out of Google in December because of Bretts measures to prevent a plague of rogue bots. I see WebmasterWorld is back in Google now. I've been trying to find what Bretts solution was/is? Tried looking through this thread: Attack of the Robots, Spiders, Crawlers.etc: [webmasterworld.com...]
and can only find a note about - 'building a white list of ips.'. Is that the solution? Is there more to it? Can anyone update me as this issue drives me crazy too.

jecasc

8:54 pm on Jan 20, 2006 (gmt 0)

As far as I remember the solution had to do with cloaking robots.txt

jatar_k

8:58 pm on Jan 20, 2006 (gmt 0)

no it didn't actually

'building a white list of ips.'

that's pretty much the gist of it

another option that people have gone with is something like this
Blocking Badly Behaved Bots [webmasterworld.com]

kapow

8:18 pm on Jan 21, 2006 (gmt 0)

Thanks Jatar_k

Please excuse my lack of understanding I'm not a programmer, however when I find the right solution I will ask my programmers to implement it.

If I understand right this thread: [webmasterworld.com...]
- Is a system of banning based on frequency of requests and time.
- Includes whitelisted IPs e.g. for SEs.
- Is mainly for sites with bandwidth issues.
I'm trying to decide if this is the solution I need.

We manage approx 100 sites (e.g. business brochure type sites), mostly 20-200 pages, average 1000 page views per month. Bandwidth is not a big problem for us. I just hate the rogue bots because they:
- Mess up the stats,
- steal content for scraper sites,
- seek vulnerabilities
- ...and probably other things I am not aware of.
I am happy to allow any SE bot. I just want to ban the scrapers, vulnerability-seekers, and any non-SE bots.

Does this script do what I'm looking for? [webmasterworld.com...]
Or would a honeypot approach be better? e.g:
A typical good bot obeys robots.txt, while a bad bot has no reason to obey robots.txt and may see robots.txt as a signpost to the good stuff. Would the following be a more appropriate system for me?:
- In robots.txt somehting like: Dissallow all from folder X.
- Then on every page a hidden link to X/index.html
- On X/index.html put a meta robot exclusion tag, AND a hidden link to X/notallowed.htm.
- Ban anything that arrives at X/notallowed.htm

balam

3:04 am on Jan 22, 2006 (gmt 0)

> I'm trying to decide if this is the solution I need.

I'm sure that many who are concerned about rogue visitors would agree that a multi-pronged approach is needed (or, at least desired). A honeypot, on its own, isn't enough to catch all the malicious visitors; neither is using .htaccess (on Apache web servers), nor using a Perl program [webmasterworld.com] or PHP script to catch them. But, when combined together, you have a much stronger defence.

kapow

12:31 pm on Jan 24, 2006 (gmt 0)

After review that loooong thread, all I could see from Bretts comments was a preference for an IP white list (lots of other solutions were discussed of course). However, Brett doesn't seem to be using IP white list; I removed my cookies and visited WebmasterWorld using another ISP, I was able to access WebmasterWorld threads ok without my login (as Google bot is clearly doing too, but wasn't a couple of months ago). Have I missed something? I really wanted to know Bretts solution. Or is it that Brett isn't telling? I would understand if that is the case (and find my own way).

Hi Balam, Yes, that Perl program (or PHP script) is the kind of honeypot I mean.
Honeypot:

A note for new users: Install the robots.txt exclusion described above several days (even a week) before "going live" with the script. Many legitimate robots don't always read a new copy of robots.txt every time they access your site; Give them some time to find out that they shouldn't swallow the bait.

Yes, I would probably do this. However, would my suggestion of a 2 step process with <meta name="robots" content="noindex,nofollow"> on each page help? Ie the spider may not see the robots.txt but if its following links to get to the '2nd step banning page' it will have hit a previous page with an anti-spider metatag. Its the major SE spiders that I don't want to upset, and they should obey that tag.

What kind of automated bad bots would escape such a honeypot? I just a few escape it, if I can stop the majority I will be happy (and consider further tactics later).