Forum Moderators: phranque

Message Too Old, No Replies

Trap/Favicon question

         

Joe Belmaati

1:42 am on Nov 1, 2004 (gmt 0)

10+ Year Member



It seems that the Safari browser under Mac OSX sends a request for favicon.ico files with an empty user string. As I have my htaccess file setup to get rid of people with empty user strings I wanted to exclude requests for favicon and robots.txt files from being banned. Can anyone look over this code and let me know whether it achieves what I am trying to do...

# Forbid if blank (or "-") Referer *and* UA, except for HEAD requests from caching proxies (such as AOL)
RewriteCond %{REQUEST_METHOD}!^HEAD$
RewriteCond %{HTTP_REFERER} ^-?$
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteCond %{REQUEST_URI}!^.*robots\.txt$
RewriteCond %{REQUEST_URI}!^.*\.ico$
RewriteCond %{REQUEST_URI}!/getout\.php$
RewriteRule .* /getout.php [L]

Thank you very much in advance.
Sincerely
Joe Belmaati
Copenhagen Denmark

jdMorgan

2:12 pm on Nov 1, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Tha should work, but you may end up with a very large number of trapped IP addresses. You might consider just responding with a 403-Forbidden response:

# Forbid if blank (or "-") Referer *and* UA, except for HEAD requests from caching proxies (such as AOL)
RewriteCond %{REQUEST_METHOD} !^HEAD$
RewriteCond %{HTTP_REFERER} ^-?$
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteCond %{REQUEST_URI} !\.ico$
RewriteCond %{REQUEST_URI} !^/custom403\.html$
RewriteRule .* - [F]

Also, you can shorten the code and speed it up by combining the REQUEST_URI pattern:

# Forbid if blank (or "-") Referer *and* UA, except for HEAD requests from caching proxies (such as AOL)
RewriteCond %{REQUEST_METHOD} !^HEAD$
RewriteCond %{HTTP_REFERER} ^-?$
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteCond %{REQUEST_URI} !(^/robots\.txt¦\.ico¦^/custom403\.html)$
RewriteRule .* - [F]

Replace the broken pipe "¦" characters above with solid pipes before use.

Jim

Joe Belmaati

5:00 pm on Nov 1, 2004 (gmt 0)

10+ Year Member



Thanks a lot, Jim. How would I use your code without a redirect to a custom 403?

jdMorgan

6:57 pm on Nov 1, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Even if you don't have a custom 403 page, it will work anyway. But you can remove the reference to "custom403.html" from the code in order to eliminate unnecessary processing.

Jim

Joe Belmaati

11:02 pm on Nov 1, 2004 (gmt 0)

10+ Year Member



Thank you very much. I know that I have asked too many times already, but for the life of me I can't get the end part of my htaccess file to work. Basically, what I want to do is what is mentioned above (seems to work) and to do the same for requests for _vti or other Office files. But when I activate the last part I get a 500 internal server error. Any idea how I can make everything in the htaccess file happily co-exist? Any help is much appreciated!

Sincerely,
Joe Belmaati
Copenhagen Denmark

SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome

# Don't look in my htaccess file
SetEnvIf Request_URI "^\.ht" getout
# This ip can do what it wants (disregard the #'s - they are real numbers in my htaccess file)
SetEnvIf Remote_Addr "^##\.##\.###\.###$" allowsome
#

<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>

Options +FollowSymLinks
RewriteEngine on
RewriteBase /
# Don't look in my htaccess file
RewriteRule ^\.ht - [F]
RewriteCond %{REMOTE_ADDR} ^80\.196\.101\.240$
RewriteRule .* - [L]
# Various bots
RewriteCond %{HTTP_USER_AGENT} ^WinHttp\.WinHttpRequest\.\d+ [NC,OR]
# Address harvesters
RewriteCond %{HTTP_USER_AGENT} ^(autoemailspider¦ExtractorPro) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^E?Mail.?(Collect¦Harvest¦Magnet¦Reaper¦Siphon¦Sweeper¦Wolf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (DTS.?Agent¦Email.?Extrac) [NC,OR]
RewriteCond %{HTTP_REFERER} iaea\.org [NC,OR]
# Download managers
RewriteCond %{HTTP_USER_AGENT} ^(Alligator¦DA.?[0-9]¦DC\-Sakura¦Download.?(Demon¦Express¦Master¦Wonder)¦FileHound) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Flash¦Leech)Get [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Fresh¦Lightning¦Mass¦Real¦Smart¦Speed¦Star).?Download(er)? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Gamespy¦Go!Zilla¦iGetter¦JetCar¦Net(Ants¦Pumper)¦SiteSnagger¦Teleport.?Pro¦WebReaper) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(My)?GetRight [NC,OR]
# Image-grabbers
RewriteCond %{HTTP_USER_AGENT} ^(AcoiRobot¦FlickBot¦webcollage) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Express¦Mister¦Web).?(Web¦Pix¦Image).?(Pictures¦Collector)? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image.?(fetch¦Stripper¦Sucker) [NC,OR]
# "Gray-hats"
RewriteCond %{HTTP_USER_AGENT} ^(Atomz¦BlackWidow¦BlogBot¦EasyDL¦Marketwave¦Sqworm¦SurveyBot¦Webclipping\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (girafa\.com¦gossamer\-threads\.com¦grub\-client¦Netcraft¦Nutch) [NC,OR]
# Site-grabbers
RewriteCond %{HTTP_USER_AGENT} ^(eCatch¦(Get¦Super)Bot¦Kapere¦HTTrack¦JOC¦Offline¦UtilMind¦Xaldon) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web.?(Auto¦Cop¦dup¦Fetch¦Filter¦Gather¦Go¦Leach¦Mine¦Mirror¦Pix¦QL¦RACE¦Sauger) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web.?(site.?(eXtractor¦Quester)¦Snake¦ster¦Strip¦Suck¦vac¦walk¦Whacker¦ZIP) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCapture [NC,OR]
# Tools
RewriteCond %{HTTP_USER_AGENT} ^(curl¦Dart.?Communications¦Enfish¦htdig¦Java¦larbin) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (FrontPage¦Indy.?Library¦RPT\-HTTPClient) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (libwww¦lwp¦PHP¦Python¦www\.thatrobotsite\.com¦webbandit¦Wget¦Zeus) [NC,OR]
# Unknown
RewriteCond %{HTTP_USER_AGENT} ^(Crawl_Application¦Lachesis¦Nutscrape) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^[CDEFPRS](Browse¦Eval¦Surf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Demo¦Full.?Web¦Lite¦Production¦Franklin¦Missauga¦Missigua).?(Bot¦Locat) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (efp@gmx\.net¦hhjhj@yahoo\.com¦lerly\.net¦mapfeatures\.net¦metacarta\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Industry¦Internet¦IUFW¦Lincoln¦Missouri¦Program).?(Program¦Explore¦Web¦State¦College¦Shareware) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Mac¦Ram¦Educate¦WEP).?(Finder¦Search) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Moz+illa¦MSIE).?[0-9]?.?[0-9]?[0-9]?$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[0-9]\.[0-9][0-9]?.\(compatible[\)\ ] [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NaverRobot [NC,OR]
# Email
RewriteCond %{REQUEST_URI} (mail.?form¦form¦form.?mail¦mail¦mailto)\.(cgi¦exe¦pl)$ [NC,OR]
# Various
RewriteCond %{REQUEST_URI} ^/(bin/¦cgi/¦cgi\-local/¦sumthin) [NC,OR]
RewriteCond %{THE_REQUEST} ^GET\ /?http [NC,OR]
# Forbid if UA is a single word - case-insensitive, A-Z only
RewriteCond %{HTTP_USER_AGENT} ^[a-z]+$ [NC]
RewriteCond %{REQUEST_URI}!/getout\.php$
RewriteRule .* /getout.php [L]
# Forbid if blank (or "-") Referer *and* UA, except for HEAD requests from caching proxies (such as AOL)
RewriteCond %{REQUEST_METHOD}!^HEAD$
RewriteCond %{HTTP_REFERER} ^-?$
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteCond %{REQUEST_URI}!(^/robots\.txt¦\.ico¦^/custom403\.html)$
RewriteRule .* - [F]
# Frontpage Office etc
#RewriteCond %{REQUEST_URI} ^/(MSOffice¦_vti) [NC,OR]
#RewriteCond .* - [F]

Joe Belmaati

11:06 pm on Nov 1, 2004 (gmt 0)

10+ Year Member



PS. I also tried to remove the [L] clause - same result. If I put the _vti rewrite Cond into the block right before it (the one with the empty referrer string) I simply get a 404 when I try to request say [mydomain.com...]

jdMorgan

11:22 pm on Nov 1, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Remove the [OR] from the RewriteCond just before the RewriteRule. You cannot have an [OR] on the last RewriteCond.

Make sure you have a space between each "}" and "!", and make sure that you have changed all broken pipe "¦" characters to solid pipe characters.

This rule will block "MARTINI" which is a robot from LookSmart:


# Forbid if UA is a single word - case-insensitive, A-Z only
RewriteCond %{HTTP_USER_AGENT} ^[a-z]+$ [NC]

If that is important to you, create an exception using a RewriteCond.

Jim

balam

4:16 am on Nov 2, 2004 (gmt 0)

10+ Year Member



This rule will block "MARTINI" [...] If that is important to you, create an exception using a RewriteCond.

This is how I handle this situation:

# Forbid visitor if UA is a single word - case-insensitive, A-Z only
RewriteCond %{HTTP_USER_AGENT} ^[a-z]+$ [NC]
# ...some exemptions though...
RewriteCond %{HTTP_USER_AGENT} !^DeepIndex$
RewriteCond %{HTTP_USER_AGENT} !^FavOrg$
RewriteCond %{HTTP_USER_AGENT} !^MantraAgent$
RewriteCond %{HTTP_USER_AGENT} !^MARTINI$
RewriteRule !403\.html$ - [F]

I see that Joe is using an .htaccess that I posted some time ago... A nice, little ego boost to see that "code" keep popping up.

Joe Belmaati

7:35 am on Nov 2, 2004 (gmt 0)

10+ Year Member



Thank you so much, guys. That solved everything. Let me take this opportunity to thank both Jim and Balam for all the help. I found the long htaccess thread some time ago and decided to go through my access log looking for some the things that were mentioned in that thread. Sure enough, lots of requests for formmail scripts and other malicious behavior. With the help provided by this community I feel that my site is a safer place for my members and I - so thank you VERY MUCH :D

balam

1:02 am on Nov 4, 2004 (gmt 0)

10+ Year Member



> Let me take this opportunity to thank both Jim and Balam for all the help.

Always happy when I can offer some help. Here's another (not so?) little tidbit you'll want to know about if you are in, or are trying to get into, the DMOZ directory...

From what I've been able to gather, the editors of DMOZ have a custom-made link-checking program named "TulipChain" that they use to verify the existence of sites in the directory. It's written in Java and uses other "toolbox" software. Here's the UA (or a recent version thereof):

TulipChain/6.02 (http://ostermiller.org/tulipchain/) Java/1.4.0_03 (http://java.sun.com/) Windows_XP/5.1 RPT-HTTPClient/0.3-3

It's important to note the "RPT-HTTPClient/0.3-3" part of the UA, since RPT-HTTPClient is contained in the second RewriteCond of the "Tools" section in the .htaccess posted in message 5, above. Specifically:

RewriteCond %{HTTP_USER_AGENT} (FrontPage¦Indy.?Library¦RPT\-HTTPClient) [NC,OR]

I've had trouble with "visitors" using RPT-HTTPClient (which, to be honest, I can't quite remember what it is), but I don't want to ban DMOZ, so near the top of my .htaccess I have:

RewriteCond %{HTTP_USER_AGENT} ^TulipChain
RewriteRule (.*) - [L]

If you are concerned about this, I would add the above two lines after your "Don't look in my htaccess file" section and before your "Various bots" section.

Also worth noting is that "Java" is part of the first RewriteCond in the "Tools" section and it also appears in the TulipChain UA. Since that RewriteCond requires that the UA start with Java (or the other expressions it tests for), it will not stop TulipChain, but if that important caret (^) is ever removed and the extra two lines I offered above aren't added to your .htaccess, then that RewriteCond will ban TulipChain as well (or more accurately, first).

jdMorgan

3:36 am on Nov 4, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is a good example of what can happen if you copy something like the close to perfect .htaccess ban list [webmasterworld.com] without *fully* understanding how each and every user-agent in the list is used. It is certainly not a "plug and play" operation.

Jim

Joe Belmaati

7:40 am on Nov 4, 2004 (gmt 0)

10+ Year Member



Once again, a huge thanks. I did read the entire "close to perfect..." thread, and thought that that put me in "some sort of position" to make an informed choice about what to add and what not to. I have since found some pitfalls that I have had to make provisions for (with the help of Jim), most notably that people who are using the IE Discuss Toolbar are getting trapped because of the _vti rewrite condition. Now I am sending them to a custom 403 that explains to them to turn the toolbar off and surf right on. Furthermore, I am keeping a vigilent eye on who I am trapping, and I am unbanning if the dns resloves to things like pacbell.net, comcast.net etc etc.

Thank you once again :)

Sincerely,
Joe Belmaati
Copenhagen Denmark

TreoRenegade

1:48 pm on Nov 12, 2004 (gmt 0)

10+ Year Member



I wanted to update one of the bot trap threads, but they're closed. This seems the next best location. While I was tempted, strongly, to just copy and paste one of the bot trap scripts for automated htaccess banning, I decided to keep it manual. My concern: there are relatively new aggregator-style bots out there exclusively for indexing blogs, and I wasn't certain of their intelligence.

Sure enough, despite setting an off-limits area in the robots.txt file over a month ago, and despite waiting until this week to set up the trap, within the first 24 hours, three bots were snagged, two of which are related to blogs and/or PDA usage:

IP address: 64.157.224.100
Domain name: sync00.avantgo.com
User agent: Mozilla/4.0 (compatible; AvantGo 5.2; FreeBSD)

IP address: 198.87.83.123
Domain name: www.syndic8.com
User agent: Syndic8/1.0 (http://www.syndic8.com/ )

Lesson learned! Dumb bots, but not devious. Thus, I'll be sticking with manual htaccess banning, even though that's a little extra work.

jdMorgan

2:53 pm on Nov 12, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This thread [webmasterworld.com] is the most recent continuation of the Close to perfect htaccess ban list thread, and is still open.

You can simply add an exclusion to the script or to the htaccess code you use to redirect to the script, in order to avoid banning WAP requests or anything else you wish to permit.

As an example, let's assume you have cloaked the bad-bot.pl script using mod_rewrite. Instead of Disallowing bad-bot.pl in robots.txt and putting links to it in your pages, you use "all-private.html" instead. Then, the .htaccess code might look like this:


# Redirect bad-bot bait files to IP banning script. Exclusions are to avoid banning search engines and
# AvantGo WAP proxies, Google proxies, and WebTV. AvantGo, Google WAP proxy, and WebTV may display the
# link to the spider trap, so users may click on it. Search engines should not attempt to fetch files
# starting with "all_private" because this is disallowed in robots.txt. However, the following
# exclusion list is a "safety net."
RewriteCond %{HTTP_USER_AGENT} !(Ask\ Jeeves¦FAST-.*WebCrawler/¦Fluffy¦GalaxyBot/¦Gigabot/¦Googlebot/) [NC]
RewriteCond %{HTTP_USER_AGENT} !(ia_archiver¦MARTINI¦Mercator¦msnbot/¦Overture-WebCrawler/¦Robozilla) [NC]
RewriteCond %{HTTP_USER_AGENT} !(Scooter/¦Scrubby/¦Slurp¦Steeler/¦Submission\ Spider\) [NC]
RewriteCond %{HTTP_USER_AGENT} !(Teoma¦Vagabondo/¦VoilaBot¦Zealbot¦ZyBorg/) [NC]
RewriteCond %{HTTP_USER_AGENT} !(AvantGo¦Blazer¦Google\ .*\ Proxy¦Tulipchain¦WebTV¦Xenu) [NC]
RewriteRule ^all_private /cgi/bad-bot.pl [L]
# /private is an empty directory which is password-protected. User agents excluded above will get a 401
# authentication required response if they ignore robots.txt and attempt to fetch all-private.
RewriteRule ^all-private /private/login.html [L]

The list of user-agents in the code will vary, depending on what user-agents are critical to your site. Any user-agent listed will be allowed to violate robots.txt, so keep this list short. I also suggest an additional positive test on each listed user-agent to make sure that it is a valid one and not a spoof. That code is not shown here.

Jim