homepage Welcome to WebmasterWorld Guest from 54.227.77.237
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Don't waste time blocking bots, OPT-IN bots and control your content
blacklisting is a no-win endless game
incrediBILL




msg:1512513
 9:03 pm on Mar 15, 2006 (gmt 0)

Here's an EXAMPLE of the approach I'm taking of allowing robots to access my server using the OPT-IN or whitelist method. Everyone else has typically used the blacklist approach which is nuts as new bots appear every day and many now generate random user agent strings just to bypass your block lists. This method just allows the good bots and browsers and even punts blank user agent strings.

I wouldn't use the following AS-IS without a bit more work, especially if you have RSS feeds and such as you'll need to allow them in, but it gives you an idea of how to approach this from the OPT-IN angle opposed to the OPT-OUT blacklist route that everyone currently uses which is just a massive waste of time monitoring and chasing new bots.

PROS: The list is short, executes fast, easy to maintain and infrequent maintenance needed opposed to daily battle with all the new bots

CONS: If one of the major engines changes their bot name you might bounce it but so far this hasn't been a problem

#allow just search engines we like, we're OPT-IN only

#a catch-all for Google
BrowserMatchNoCase Google good_pass

#a couple for Yahoo
BrowserMatchNoCase Slurp good_pass
BrowserMatchNoCase Yahoo-MMCrawler good_pass

#looks like all MSN starts with MSN or Sand
BrowserMatchNoCase ^msnbot good_pass
BrowserMatchNoCase SandCrawler good_pass

#don't forget ASK/Teoma
BrowserMatchNoCase Teoma good_pass
BrowserMatchNoCase Jeeves good_pass

#allow Firefox, MSIE, Opera etc., will punt Lynx, cell phones and PDAs, don't care
BrowserMatchNoCase ^Mozilla good_pass
BrowserMatchNoCase ^Opera good_pass

#Let just the good guys in, punt everyone else to the curb
#which includes blank user agents as well
<Files *>
Order Deny, Allow
Deny from all
Allow from env=good_pass
</Files>

Other refinements that I'm using but didn't show just to keep it simple here is after the first pass of allow/deny then filter anything that gets past starting with Mozilla for words like CRAWL, DOWNLOAD, HTTRACK and punt those downloaders to the curb as well.

Additionally, on my site anything identifying itself as Google, Yahoo, MSN etc. is being allowed by IP only to stop user agent spoofing.

Those ranges of IPs as best I have them are as follows:

MSN has blocks
64.4.0.0 - 64.4.63.255
207.68.128.0 - 207.68.207.255
65.52.0.0 - 65.55.255.255
207.46.0.0 - 207.46.255.255

Yahoo has blocks
66.228.160.0 - 66.228.191.255
66.196.64.0 - 66.196.127.255
68.142.192.0 - 68.142.255.255
72.30.0.0 - 72.30.255.255

Google has blocks
64.233.160.0 - 64.233.191.255
66.249.64.0 - 66.249.95.255
72.14.192.0 - 72.14.239.255
216.239.32.0 - 216.239.63.255

Gigablast has blocks
66.154.103.0 - 66.154.103.255
64.62.168.0 - 64.62.168.255
66.154.102.0 - 66.154.102.255

Teoma has blocks
65.214.44.0 - 65.214.47.255

Remember, this is purely for informational purposes to aid you in your own quest to be bot free and there are no claims made for usability. Be careful and don't block something you're currently relying on so check your logs and web stats file before doing anything rash but my site has been running like this for over 5 months now and my search engine ratings couldn't be better.

Lastly, your robots.txt file can be minimal information for all bots and never updated unless you just want to coddle the other spiders that honor robots.txt as they'll never get thru this firewall in the first place.

 

jdMorgan




msg:1512514
 1:25 am on Mar 16, 2006 (gmt 0)

Much as we Webmasters ask that good robots fetch and obey robots.txt, we Webmasters should do all robots the courtesy of posting a valid robots.txt to politely Disallow those unwanted.

I agree with the premise of this thread, and use a whitelist/blacklist approach as well. Actually, there are several layers: IP blacklist, UA and IP whitelist, spoofed UA blacklist.

I should also note that since either SetEnvIf or mod_rewrite can set environment variables, those variables can also be used in Server-Side-Includes, PHP, and PERL scripts for serving (or not serving) cookies or other data to robots.

Jim

incrediBILL




msg:1512515
 2:57 am on Mar 16, 2006 (gmt 0)

Webmasters should do all robots the courtesy of posting a valid robots.txt

Well, I'm split on this topic because of 2 things.

a) I didn't invite the sudden flood of bots whether they honor robots.txt or not and don't really feel obligated to keep up with them and clean up after them whatsoever. For instance, NUTCH honors robots.txt but everyone runs nutch now so you allow nutch and every nutjob that fires up nutch this morning will get in, except I don't allow nutch and they all bounce off without warning, too bad.

b) Telling the world via robots.txt which bots you allow is a HUGE hole in your blocking security that the scrapers can use to get past .htaccess for any bot you don't have redundantly protected with an IP filter.

If you feel you must use a robots.txt file I recommend something like this only:


User-agent: bad_bot_1
User-agent: bad_bot_2
User-agent: bad_bot_3
User-agent: bad_bot_4
Disallow: /

User-agent: *
Crawl-delay: 10
Disallow: /cgi-bin/

This way only the blocked bots you feel obligated to be nice to are known, the good bots allowed are a secret that can't be used against you by divulging it to scrapers.

Of course in the end they'll all just claim to be Firefox, Internet Explorer or Opera and then you have to use real-time profiling scripts to catch them and stop them, and a high volume of them are doing that already.

jdMorgan




msg:1512516
 4:06 am on Mar 16, 2006 (gmt 0)

Not that I want to disagree with your premise, but *any* robots listed in robots.txt will 'give away' the ones you do and don't allow, whether you define the 'allowed' set or the 'disallowed' set. The two sets are mutually-exclusive, so defining either set also defines the other, at least insofar as identifying candidate UAs for spoofing.

The problem of spoofing exists whether or not either set is defined, and as you said, IP-address-range- and behaviourally-based methods can be used to protect against that.

I also agree that "Bots for hire" like Nutch, Larbin, etc. are a problem, especially since I was unable to convince Nutch that they should, under their licensing agreement, require that users specify the using organization and valid contact info in the User-agent string (This was in a thread a year or more ago in the search engine spiders forum).

There's no technical requirement to use robots.txt to 'politely' deny access to known-good-but-not-useful-to-the-specific-site robots, but I support the cooperative model of the Web -- to the extent possible. Firewall, IPtables, robots.txt, mod_access, mod_rewrite, Key_master script, xlcus/AlexK script -- All can and should play a part.

Jim

vortech




msg:1512517
 4:42 am on Mar 16, 2006 (gmt 0)

This post caught my eye on recent posts and even though I use IIS I am also going this route. I have a data base of good guy spiders IPs from IPlists but that list contains some partial IPs which I can't use for my code. My code looks for an exact match of the IP. Can anyone point me to a complete IP list of good guy spiders that has every full IP in it? Wish I knew how to generate it from the partials. It's a long shot, but anybody?

thanks,

vortech

incrediBILL




msg:1512518
 5:31 am on Mar 16, 2006 (gmt 0)

Vortech,

You'll need to change your code to use a variety of bits per IP, otherwise you'll have one HUGE database for no particular reason as Google, Yahoo, etc. use a boatload of IPs during a single crawl.

jdMorgan,

I agree any in the robot.txt list gives clues which is why mine is void of all bot names and I just bounce them, but certainly don't list the allowed bots, just the disallowed bots. I too agree it all should play a part but the scrapers have changed the rules and robots.txt is king of a useless except for the handful of good crawlers left.

It's really more or less an obsolete concept IMO but I still put up a minimum file that looks something like this:

User-agent: *
Crawl-delay: 10
Disallow: /cgi-bin/
Disallow: /images/

Those not allowed bounce off the hard way.

Call me a bad netizen if you want, but the level of scaper/crawler escalation drove me to the point I'm at today.

Pfui




msg:1512519
 4:10 am on Mar 20, 2006 (gmt 0)

1.) I've been whitelisting since November and once I got things going, it's saved me a TON of obsessive-compulsive .htaccess tweaking time. That said, I'm still not totally happy with things -- next on my list, still, is figuring out how to throttle [webmasterworld.com] apparently real visitors ripping through everything as rapidly as the worst robots. But for those of us who don't need or want to kowtow to every search engine wannabe alpha-beta-whatever hogging the buffet table, the whitelist approach is great.

2.) Bill, Jim, Lurkers:

How do you handle what I call 'stealth' engines? I'm talking about the majors, overwhelmingly Google but also Microsoft and Jeeves and Yahoo/Inktomi, a.k.a. those SEs whose visits appear to be 'regular people' using 'regular browsers' but they're hailing from plain IPs, and they're hardly 'regular' anything.

For example, I require Googlebot to hail from .googlebot.com or it gets curbed. (My 'curb' is an info page on a separate IP explaining what's going on, logging IP info, and urging real people to touch base, because sometimes, they do get snagged.)

But lately, increasingly, Google is running all over the place, and often, and inhumanly quickly, AND without heeding robots.txt, and without any ID other than its IPs being irritatingly familiar to me. 'They' often take graphics, run JS and CGI scripts (Disallowed in robots.txt), snag a favicon.ico, the whole nine yards -- but rarely do 'they' follow redirects to the curb and arrive on the other server. (This latter aspect has been a real tell-tale sign of bot activity, at least for me.) And no Google employee has ever e-mailed me about getting back in...

3.) The UNusual Suspects

One stealth Google IP went totally nuts on 03-01 (see P.S.). Here are two more 'regular' ones:

64.233.172.18
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.6) Gecko/20050225 Firefox/1.0.1
Date Page Status Referrer
03/05 14:30:49 /dir3/ 403 http://www.example.com/links.htm
03/05 14:30:51 /dir3/ 403 http://www.example.com/links.htm
03/05 14:30:52 /dir3/dir4/example.jpg 200 /dir3/
03/05 14:30:52 /favicon.ico 200 -
03/05 14:30:55 /dir3/file1.html 403 /dir3/
03/05 14:30:58 /dir3/file2.html 403 /dir3/file1.html
03/05 14:31:01 /dir3/file6.html 403 /dir3/file1.html
03/05 14:31:01 /dir3/file3.html 403 /dir3/file1.html
03/05 14:31:07 /dir3/file4.html 403 /dir3/file1.html
03/05 14:31:08 /dir3/file5.html 403 /dir3/file1.html

72.14.194.31
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1
Date Page Status Referrer
03/12 21:18:32 / 200 -
03/12 21:18:32 /cgi-bin/[counter script]
03/12 21:18:37 /dir1/file1.html 403 /
03/12 21:18:39 /dir1/file1.html 200 -
03/12 21:18:48 /dir1/file2.html 403 /dir/welcome.html
03/12 21:18:49 /dir1/file2.html 200 -
03/12 21:25:41 /dir1/file3.html 403 /dir/welcome.html
03/12 21:25:41 /dir1/file3.html 403 /dir/welcome.html
03/12 21:25:42 /dir1/file2.html 200 -

(As you can see, sometimes I'm all over the map with what to do with these guys, and where -- 200s, 302s, 403s -- but that's another, ongoing, tweak-fix topic:)

Other browser-running G 'visitors' have also hailed from the following IPs using a mix of older, new, and other UAs. From intermittent access_log checks:

64.233.160.136
64.233.172.2
64.233.173.4
64.233.172.21
64.233.173.73
64.233.173.77
64.233.173.100
64.233.173.124
64.233.178.136
66.102.6.136
72.14.192.14
72.14.194.18
72.14.194.21
72.14.194.29

Java/1.5.0_04
MovieTrack
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50215)
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.7) Gecko/20050414 Firefox/1.0.3
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5

4.) I don't think that just because something/someone hails from a Google or other major SE IP, 'they' should be able to go wherever they want to, particularly when their hit rates and referrers suggest crawling. Have you encountered these kinds of atypical Googlers? Would you leave the Welcome mat out for all Google IPs, despite the clearly, and repeatedly suspect conduct?

...
P.S.
Yet Another unidentified Googler, this time gone totally goofy. I'm including the entire thing so you can get a clear idea of session times, hit rates, and redundancy! An employee? I don't think so...

64.233.173.73
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1
Date Page Status Referrer
03/01 20:59:28 /dir1/file33.html 403 /dir1/file1.html
03/01 20:59:28 /dir1/file33.html 403 /dir1/file1.html
03/01 20:59:28 /dir1/file33.html 403 /dir1/file1.html
03/01 20:59:28 /dir1/file33.html 403 /dir1/file1.html
03/01 21:05:03 /dir1/file32.html 403 /dir1/file1.html
03/01 21:05:03 /dir1/file32.html 403 /dir1/file1.html
03/01 21:05:03 /dir1/file32.html 403 /dir1/file1.html
03/01 21:05:03 /dir1/file32.html 403 /dir1/file1.html
03/01 21:05:03 /dir1/file32.html 403 /dir1/file1.html
03/01 21:05:03 /dir1/file32.html 403 /dir1/file1.html
03/01 21:13:37 /dir1/file23.html 403 /dir1/file1.html
03/01 21:13:37 /dir1/file23.html 403 /dir1/file1.html
03/01 21:13:37 /dir1/file23.html 403 /dir1/file1.html
03/01 21:13:37 /dir1/file23.html 403 /dir1/file1.html
03/01 21:13:37 /dir1/file23.html 403 /dir1/file1.html
03/01 21:13:37 /dir1/file23.html 403 /dir1/file1.html
03/01 21:13:37 /dir1/file23.html 403 /dir1/file1.html
03/01 21:15:20 /dir1/file21.html 403 /dir1/file1.html
03/01 21:15:20 /dir1/file21.html 403 /dir1/file1.html
03/01 21:15:20 /dir1/file21.html 403 /dir1/file1.html
03/01 21:15:20 /dir1/file21.html 403 /dir1/file1.html
03/01 21:15:20 /dir1/file21.html 403 /dir1/file1.html
03/01 21:15:20 /dir1/file21.html 403 /dir1/file1.html
03/01 21:15:20 /dir1/file21.html 403 /dir1/file1.html
03/01 21:15:20 /dir1/file21.html 403 /dir1/file1.html
03/01 21:15:22 /dir1/file20.html 403 /dir1/file1.html
03/01 21:15:22 /dir1/file20.html 403 /dir1/file1.html
03/01 21:15:22 /dir1/file20.html 403 /dir1/file1.html
03/01 21:15:22 /dir1/file20.html 403 /dir1/file1.html
03/01 21:15:22 /dir1/file20.html 403 /dir1/file1.html
03/01 21:15:22 /dir1/file20.html 403 /dir1/file1.html
03/01 21:15:22 /dir1/file20.html 403 /dir1/file1.html
03/01 21:15:22 /dir1/file20.html 403 /dir1/file1.html
03/01 21:15:25 /dir1/file18.html 403 /dir1/file1.html
03/01 21:15:25 /dir1/file18.html 403 /dir1/file1.html
03/01 21:15:25 /dir1/file18.html 403 /dir1/file1.html
03/01 21:15:25 /dir1/file18.html 403 /dir1/file1.html
03/01 21:15:25 /dir1/file18.html 403 /dir1/file1.html
03/01 21:15:25 /dir1/file18.html 403 /dir1/file1.html
03/01 21:15:25 /dir1/file18.html 403 /dir1/file1.html
03/01 21:15:25 /dir1/file18.html 403 /dir1/file1.html
03/01 21:15:28 /dir1/file17.html 403 /dir1/file1.html
03/01 21:15:28 /dir1/file17.html 403 /dir1/file1.html
03/01 21:15:28 /dir1/file17.html 403 /dir1/file1.html
03/01 21:15:28 /dir1/file17.html 403 /dir1/file1.html
03/01 21:15:28 /dir1/file17.html 403 /dir1/file1.html
03/01 21:15:28 /dir1/file17.html 403 /dir1/file1.html
03/01 21:15:28 /dir1/file17.html 403 /dir1/file1.html
03/01 21:15:28 /dir1/file17.html 403 /dir1/file1.html
03/01 21:15:29 /dir1/file16.html 403 /dir1/file1.html
03/01 21:15:29 /dir1/file16.html 403 /dir1/file1.html
03/01 21:15:29 /dir1/file16.html 403 /dir1/file1.html
03/01 21:15:29 /dir1/file16.html 403 /dir1/file1.html
03/01 21:15:29 /dir1/file16.html 403 /dir1/file1.html
03/01 21:15:29 /dir1/file16.html 403 /dir1/file1.html
03/01 21:15:29 /dir1/file16.html 403 /dir1/file1.html
03/01 21:15:29 /dir1/file16.html 403 /dir1/file1.html
03/01 21:15:30 /dir1/file15.html 403 /dir1/file1.html
03/01 21:15:30 /dir1/file15.html 403 /dir1/file1.html
03/01 21:15:30 /dir1/file15.html 403 /dir1/file1.html
03/01 21:15:30 /dir1/file15.html 403 /dir1/file1.html
03/01 21:15:30 /dir1/file15.html 403 /dir1/file1.html
03/01 21:15:30 /dir1/file15.html 403 /dir1/file1.html
03/01 21:15:30 /dir1/file15.html 403 /dir1/file1.html
03/01 21:15:30 /dir1/file15.html 403 /dir1/file1.html
03/01 21:15:35 /dir1/file14.html 403 /dir1/file1.html
03/01 21:15:35 /dir1/file14.html 403 /dir1/file1.html
03/01 21:15:35 /dir1/file14.html 403 /dir1/file1.html
03/01 21:15:35 /dir1/file14.html 403 /dir1/file1.html
03/01 21:15:36 /dir1/file14.html 403 /dir1/file1.html
03/01 21:15:36 /dir1/file14.html 403 /dir1/file1.html
03/01 21:15:36 /dir1/file14.html 403 /dir1/file1.html
03/01 21:15:36 /dir1/file14.html 403 /dir1/file1.html
03/01 21:15:38 /dir1/file11.html 403 /dir1/file1.html
03/01 21:15:38 /dir1/file11.html 403 /dir1/file1.html
03/01 21:15:38 /dir1/file11.html 403 /dir1/file1.html
03/01 21:15:38 /dir1/file11.html 403 /dir1/file1.html
03/01 21:15:38 /dir1/file11.html 403 /dir1/file1.html
03/01 21:15:38 /dir1/file11.html 403 /dir1/file1.html
03/01 21:15:38 /dir1/file11.html 403 /dir1/file1.html
03/01 21:15:38 /dir1/file11.html 403 /dir1/file1.html
03/01 21:15:43 /dir1/file10.html 403 /dir1/file1.html
03/01 21:15:43 /dir1/file10.html 403 /dir1/file1.html
03/01 21:15:43 /dir1/file10.html 403 /dir1/file1.html
03/01 21:15:43 /dir1/file10.html 403 /dir1/file1.html
03/01 21:15:43 /dir1/file10.html 403 /dir1/file1.html
03/01 21:15:43 /dir1/file10.html 403 /dir1/file1.html
03/01 21:15:43 /dir1/file10.html 403 /dir1/file1.html
03/01 21:15:43 /dir1/file10.html 403 /dir1/file1.html
03/01 21:15:47 /dir1/file09.html 403 /dir1/file1.html
03/01 21:15:47 /dir1/file09.html 403 /dir1/file1.html
03/01 21:15:47 /dir1/file09.html 403 /dir1/file1.html
03/01 21:15:47 /dir1/file09.html 403 /dir1/file1.html
03/01 21:15:47 /dir1/file09.html 403 /dir1/file1.html
03/01 21:15:47 /dir1/file09.html 403 /dir1/file1.html
03/01 21:15:47 /dir1/file09.html 403 /dir1/file1.html
03/01 21:15:47 /dir1/file09.html 403 /dir1/file1.html
03/01 21:15:50 /dir1/file08.html 403 /dir1/file1.html
03/01 21:15:50 /dir1/file08.html 403 /dir1/file1.html
03/01 21:15:50 /dir1/file08.html 403 /dir1/file1.html
03/01 21:15:50 /dir1/file08.html 403 /dir1/file1.html
03/01 21:15:50 /dir1/file08.html 403 /dir1/file1.html
03/01 21:15:50 /dir1/file08.html 403 /dir1/file1.html
03/01 21:15:50 /dir1/file08.html 403 /dir1/file1.html
03/01 21:15:50 /dir1/file08.html 403 /dir1/file1.html
03/01 21:19:17 /dir1/file45.html 403 /dir1/file1.html
03/01 21:19:17 /dir1/file45.html 403 /dir1/file1.html
03/01 21:19:17 /dir1/file45.html 403 /dir1/file1.html
03/01 21:19:17 /dir1/file45.html 403 /dir1/file1.html
03/01 21:19:17 /dir1/file45.html 403 /dir1/file1.html
03/01 21:19:17 /dir1/file45.html 403 /dir1/file1.html
03/01 21:19:17 /dir1/file45.html 403 /dir1/file1.html
03/01 21:19:17 /dir1/file45.html 403 /dir1/file1.html
03/01 21:19:17 /dir1/file45.html 403 /dir1/file1.html
03/01 21:19:17 /dir1/file45.html 403 /dir1/file1.html
03/01 21:24:41 /dir1/file34.html 403 /dir1/file1.html
03/01 21:24:41 /dir1/file34.html 403 /dir1/file1.html
03/01 21:24:41 /dir1/file34.html 403 /dir1/file1.html
03/01 21:24:41 /dir1/file34.html 403 /dir1/file1.html
03/01 21:24:41 /dir1/file34.html 403 /dir1/file1.html
03/01 21:24:41 /dir1/file34.html 403 /dir1/file1.html
03/01 21:24:41 /dir1/file34.html 403 /dir1/file1.html
03/01 21:24:41 /dir1/file34.html 403 /dir1/file1.html
03/01 21:24:41 /dir1/file34.html 403 /dir1/file1.html
03/01 21:24:41 /dir1/file34.html 403 /dir1/file1.html
03/01 21:24:42 /dir1/file34.html 403 /dir1/file1.html
03/01 21:31:17 /dir1/file40.html 403 /dir1/file1.html
03/01 21:31:17 /dir1/file40.html 403 /dir1/file1.html
03/01 21:31:17 /dir1/file40.html 403 /dir1/file1.html
03/01 21:31:17 /dir1/file40.html 403 /dir1/file1.html
03/01 21:31:17 /dir1/file40.html 403 /dir1/file1.html
03/01 21:31:17 /dir1/file40.html 403 /dir1/file1.html
03/01 21:31:17 /dir1/file40.html 403 /dir1/file1.html
03/01 21:31:17 /dir1/file40.html 403 /dir1/file1.html
03/01 21:31:17 /dir1/file40.html 403 /dir1/file1.html
03/01 21:31:17 /dir1/file40.html 403 /dir1/file1.html
03/01 21:31:17 /dir1/file40.html 403 /dir1/file1.html
03/01 21:31:17 /dir1/file40.html 403 /dir1/file1.html
03/01 21:31:17 /dir1/file40.html 403 /dir1/file1.html
###

incrediBILL




msg:1512520
 10:24 pm on Mar 20, 2006 (gmt 0)

I don't throttle the major SEs, barely pay attention to them, but I've seen Google doing some strange hits which could be from that nasty Web Accelerator thingy and I've also seen people going thru Google for cell phones or PDAs, not sure exactly.

Also, don't forget the Google Labs may be playing with something and it's running amok, you never know.

stressedoutStaff




msg:1512521
 6:41 am on Mar 22, 2006 (gmt 0)

hello all

im a new user here, all the work Brett did for this site with reguards to bots got alot of notice all over the Internet so I thought this would be a good place to ask for help, and incrediBILL really thinks like me with a white list only method instead of hundreds of blacklisted bots. I sticky mailed incrediBILL to share just how to code only browsers in a allow statement and kick everything else to the curb, and thanks incrediBILL for sharing in this thread that helped alot, the following day I used your above code but I got a Internal server error, could be 2 things that caused this, the fact Im using a sun raq server and it has some internal control factors built into it, and certain .htaccess entrys make the web server check think it's not working but it could have been order of the .htaccess statements since I aleady have quite a detailed .htaccess file, with rewriting hotlinking orders and custom error pages and maybe your script needs to be placed before one of them, I got around this error with a piece of a old block bad bots statement since I figured the line the server was complaining about was
<Files *>
Order Deny, Allow
Deny from all
Allow from env=good_pass
</Files>

so I changed that order to this order

<Limit GET POST PUT HEAD>
order deny,allow
deny from all
allow from env=good_pass
</Limit>

to make this be my final code

BrowserMatchNoCase ^Mozilla good_pass
BrowserMatchNoCase ^Opera good_pass
<Limit GET POST PUT HEAD>
order deny,allow
deny from all
allow from env=good_pass
</Limit>

allowing only browsers with mozilla and opera user agents allowed and kick everything else and this works, my big question is ,is this as effective as yours? using the
<Limit GET POST PUT HEAD>
instead of
<Files *>
you used? im not a expert at .Htaccess by anymeans, but I can understand the syntax, but I wonder if this is as stict as yours? using <Limit GET POST PUT HEAD>

but it works both on Internet Explorer and Mozilla Firefox although I still get lj2569.inktomisearch.com which I assume is Yahoo or somebody spoofing a yahoo robot on a daily basis and Im still getting weird entrys like

customer-reverse-entry.216.93.179.108
214.177.uio.satnet.net
ool-182d8d0c.dyn.optonline.net
maryk.me.vt.edu

which most are spoofed with IP's that don't exist, so Im added IP denying after your statment above to shorten the spoofers still using Mozilla user agent but im wondering if your script can be narrowed down even more

BrowserMatchNoCase ^Mozilla good_pass (Which allows any browser that starts with Mozilla)

down to

BrowserMatchNoCase ^Mozilla/[4-5]\.0 good_pass (Which only allows Mozilla and IE User Agents, Im not sure if my syntax is correct)

cut out Mozilla 1,2,3 and only allow 4.0 and 5.0 which is what most browser user agents start with at least for IE and Mozilla and maybe narrow it down further with a
windows
mac
unix
os check since I bet most change after the Mozilla/ 5.0 (statment so this would cut down alot more semi spoofed agents, I guess ultimately I wish browsers had one line that would be strictly only a browser and everything else is gone to identify since we don't care to see any bot ever again and rely on word of mouth alone and ultimately only real users on broswers allowed on. If anybody can tell me if <Limit GET POST PUT HEAD> is as good as <Files *> as incrediBILL is using, since I have no doubt incrediBILL has the right idea Vs trying to keep up with random changing bot names, and I sure don't want to resort to login only methods just to keep all this junk at bay, I will keep watching your posts but if you can narrow down the Mozilla statement or have any advice to attempt to make this browser only I would really appreciate the advice. great subject that could help alot of us out.

incrediBILL




msg:1512522
 1:24 am on Apr 9, 2006 (gmt 0)

Could mostly be that I don't actually use .htaccess to block the bots, what I posted was a SAMPLE of how I would do it, remember I said a few times it was a SAMPLE ONLY, not to be used without testing ;)

I'm using a rather complex script behind the scenes that profiles the little beasties and I'm logging all sorts of information about them, quarantining IP addresses dynamically, banning them if they turn out to be repeat offenders vs. some one shot DHCP address.

Just wanted to give people an idea how to go about what I'm doing at the most basic level so hope this gives you a nice starting point.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved