Blocking all but Mozzilla/G/Y/M

Forum Moderators: open

Message Too Old, No Replies

Blocking all but Mozzilla/G/Y/M

Is this too simple?

dataguy

12:44 pm on Feb 1, 2005 (gmt 0)

I operate a number of web directories and sites with a lot of original content, and I'm constantly banning different user agents because it look like they are scraping my pages and not crawling them for legitimate use.

I'm wondering if I should be constantly monitoring for new user agents to ban, or if I should simply set my sites to only allow certain user agents.

Seems to me that if I only allow user agents with strings that started with "Mozilla", or containted "msn", "google", "yahoo", and "mediabot" I would be pretty safe. (I don't use rebots.txt, I send non-approved user agents to a 404 page.)

I don't want to block any legitimate users, but this seems like it would allow 99% of them through, am I missing something?

Glamba

11:32 pm on Feb 1, 2005 (gmt 0)

That's your policy decission.

Points to consider:

How about agent free requests?
Shift through your logs to see if there are any unusual agents (e.g. image search engines, text-only browsers, mobile access, proxy users, monitoring services ...) that you want to keep.
Agents are easy to fake.

wilderness

11:50 pm on Feb 1, 2005 (gmt 0)

dataguy,
Here's a link (thanks to Glenn):
[icabot.com...]

If you scroll down to where Mozilla begins?
You'll find that even "staying current" with this list will take some time.

What's the quanity of Mozilla UA's as comapred to ALL the others?
I haven't a clue, nor is it something I'm inlcined to spend my time on.

The most effective and least time cosuming plan is using a bot trap.
Making personal judgements on traffic patterns is something that has not made its way into the computer. At least as of yet ;)

Don

wruppert

11:57 pm on Feb 1, 2005 (gmt 0)

You'll keep out plenty of good folks and not stop some of the worst. You should reconsider using a robots.txt. I use it to let the big guys go deep and the others go shallow.

Plenty of scrapers use fake IE6 user agents. I examine my logs and deny their IP's via .htaccess.

dataguy

3:54 pm on Feb 2, 2005 (gmt 0)

Well, I've been pouring over my logs for the past two days and I can't find any instance of where using this method would keep an actual visitor from seeing my web pages, except for a few mobile devices and one blank user agent (out of about 50,000 page views).

I do see instances where I would be blocking some bots that have followed the rules in the past and I wouldn't mind them continuing. I can write exceptions for them, though it's a manual process and by the time I've done this, they may not come back.

Then there is the major flaw that it would be easy for a rogue programmer to fake a normal browser user agent string. The only way to detect that is to count hits by IP address I guess.

I should have mentioned that the reason for me bringing this up in the first place is that over the weekend I was hit by a bot from an "engineering company" that doesn't even have a search engine. Their explanation was that they are just looking for web pages that contain engineering information. While doing this, their spider brought one of my servers to a crawl and it took me an hour to figure out where the source of the problem was. BTW, their web site states that their spider doesn't obey robots.txt, only noindex. This means that they would still have to crawl each page to find the noindex....

wilderness

5:45 pm on Feb 2, 2005 (gmt 0)

bot from an "engineering company"

There's an old thread on this