Forum Moderators: open
I'm wondering if I should be constantly monitoring for new user agents to ban, or if I should simply set my sites to only allow certain user agents.
Seems to me that if I only allow user agents with strings that started with "Mozilla", or containted "msn", "google", "yahoo", and "mediabot" I would be pretty safe. (I don't use rebots.txt, I send non-approved user agents to a 404 page.)
I don't want to block any legitimate users, but this seems like it would allow 99% of them through, am I missing something?
If you scroll down to where Mozilla begins?
You'll find that even "staying current" with this list will take some time.
What's the quanity of Mozilla UA's as comapred to ALL the others?
I haven't a clue, nor is it something I'm inlcined to spend my time on.
The most effective and least time cosuming plan is using a bot trap.
Making personal judgements on traffic patterns is something that has not made its way into the computer. At least as of yet ;)
Don
I do see instances where I would be blocking some bots that have followed the rules in the past and I wouldn't mind them continuing. I can write exceptions for them, though it's a manual process and by the time I've done this, they may not come back.
Then there is the major flaw that it would be easy for a rogue programmer to fake a normal browser user agent string. The only way to detect that is to count hits by IP address I guess.
I should have mentioned that the reason for me bringing this up in the first place is that over the weekend I was hit by a bot from an "engineering company" that doesn't even have a search engine. Their explanation was that they are just looking for web pages that contain engineering information. While doing this, their spider brought one of my servers to a crawl and it took me an hour to figure out where the source of the problem was. BTW, their web site states that their spider doesn't obey robots.txt, only noindex. This means that they would still have to crawl each page to find the noindex....