homepage Welcome to WebmasterWorld Guest from 54.227.40.166
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

This 243 message thread spans 9 pages: < < 243 ( 1 2 3 4 5 6 [7] 8 9 > >     
A Close to perfect .htaccess ban list
toolman




msg:441824
 3:30 am on Oct 23, 2001 (gmt 0)

Here's the latest rendition of my favorite ongoing artwork....my beloved .htaccess file. I've become quite fond of my little buddy, the .htaccess file, and I love the power it allows me to exclude vermin, pestoids and undesirable entities from my web sites

Gorufu, littleman, Air, SugarKane? You guys see any errors or better ways to do this....anybody got a bot to add....before I stick this in every site I manage.

Feel free to use this on your own site and start blocking bots too.

(the top part is left out)

<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

 

SomeCallMeTim




msg:442004
 12:56 pm on Dec 4, 2002 (gmt 0)

Is there a way to use upside's method of:

SetEnvIf Remote_Addr ^12\.40\.85\. getout
SetEnvIfNoCase User-Agent ^Microsoft.URL getout

<Limit GET POST>
order allow,deny
allow from all
deny from env=getout
</Limit>

but return something more ambiguous than a 403 so that the person trying to grab the site is confused...say a 304 for page not modified for example?

Is upside's method more expensive than using rewrite?

Thanks

webmasta




msg:442005
 8:45 pm on Dec 6, 2002 (gmt 0)
back to the bot list ... i am new here (first post) and found this thread very interesting so i tried it on my site/

i downloaded the new version 4.37 of BlackWidow ... and used the last updated list that superman posted / somehow BlackWidow is getting past the htaccess file and downloading the site. However when i tried it from http://www.wannabrowser.com/ and changed the UA to BlackWidow it was blocked .. getting the 403 from there /

any takes on that? this new black widow comes with a ton of plugins that could get around almost anything including encrypted sites! and can decode script generated url's....

Next .. i noticed "Web Spider" in my logs. Any idea who this is?

Also .. while searching for some offline browsers i came across www.matuschek.net/software/jobo/index.html .. seems like this JoBo is a smartass / it gives u the option to change the UA to anything u want! of course when i added JoBo to the htaccess file it was blocked but when i changed the UA to Mozilla it downloaded the entire site!

This one is dangerous! guess we might as well start thinking that insted of blocking unwanted bots we only allow the good ones.... its just as hard or even harder to keep on top of the bad ones and add them to the list as keeping on top of the good ones and add them /

webmasta

[1][[b]edited by[/b]: jatar_k at 9:04 pm (utc) on Dec. 6, 2002][/1]

jdMorgan




msg:442006
 3:05 am on Dec 7, 2002 (gmt 0)

Wow - Looks like this thread woke up again!

pmkpmk,
You'd be right - spoofing a known-harmless user-agent is a popular technique. Most site scrapers don't know how to change the UA though, or don't bother to do it. For those who do, there are other measures to dispatch them - see the last part of this post.

Andy_White,
Yes, the virtual account has to have sufficient "permissions" for .htaccess to do its job, so AllowOverride needs to be set up correctly. In addition, many setups will require the Options +FollowSymlinks directive to precede the RewriteEngine on directive in per-directory .htaccess.

okidata,
Banning by country doesn't work well for several reasons: First the domain name for the requesting IP must be available in reverse dns in order to get the country code as remote_host. That's not always true, and reverse DNS is slow. Second, you are banning an IP assigned to the ISP, not necessarily the user or a group of users. The user and ISP could be in different countries by virtue of the international nature of telecom. Even banning by IP numbers assigned to countries doesn't work all that well. There is no central "map" to tell you what address blocks go with what country - and they are assigned piecemeal and willy-nilly. There are, however, some nice subscription services to make the info available to you - big buck$, though...

Part two. Use order "deny,allow", "deny from" and "allow from" to get past your problem.

The following bans two IP addresses, but allows all to access 403.html and robots.txt:

SetEnvIf Remote_Addr ^12\.148\.209\.196$ banit
SetEnvIf Remote_Addr ^65\.80\.255\.116$ banit
SetEnvIf Request_URI "^(/403\.html¦/robots\.txt)$" allowit
<Files *>
order deny,allow
deny from env=banit
allow from env=allowit
</Files>

upside,
You don't need mod_rewrite to serve a custom error page. All your need is ErrorDocument 403 /your403page.html and/or ErrorDocument 404 /your404page.html in your .htaccess file at web root.

SomeCallMe ... Tim?
Neet!
No, you can't return a bogus server code. You could redirect to a PERL script and start a very long delay, or just execute a die without creating an html response I suppose, but it's hardly worth it, IMHO. Think of 403-Forbidden responses as a raised middle finger, and take joy in sending them! You want to send the shortest response possible to bad-bots, while still giving enough info to an unintentionally-denied visitor to allow him/her to fix problems such as misconfigured Norton Internet Security settings.

Webmasta,
This anomaly is probably due to the fact that you are running blackwidow from inside your server.

JoBo leaves its url in your server log in order to get your attention so you'll buy it. Bad strategy with this group, eh?

Blocking by allowing only "known good" user-agents doesn't work well - I tried it myself. The problem is that major search engines and directories come up with new UA variations and new IP addresses all the time - too hard to keep up with and the penalty might be getting dropped from the search engine. Also, even legitimate browsers have thousands of variations of UA layout.

Even "Mozilla" can't download my site - a multi-defense solution is needed, though:

--

As some have pointed out in earlier posts, there is no perfect solution. My UA ban list is about four times larger than the ones posted here, and server workload is negligible because I have "compressed" the UA list. But my sites are small - hits in the hundreds or thousands per day, but rarely more. So your approach may need to be different than mine. I block by UA, http referer, request method, remote host, remote address (IP address), requested protocol, and combinations of several of these.

I have also implemented a version of the bad-bot banning script available in the archives here at WebmasterWorld. Search for "malicious robots PERL script" using WebmasterWorld site search for more info. This script and its associated traps tend to be very good at catching heretofore-unreported user-agents that come to your site and try to have their way with it. They'll get a few pages or objects, and then the door gets slammed in their face. Next stop - WebmasterWorld Search Engine Spider Identification forum to report them. :)

HTH,
Jim

webmasta




msg:442007
 4:13 am on Dec 7, 2002 (gmt 0)

thnx jd..
>>This anomaly is probably due to the fact that you are running blackwidow from inside your server. /

but no .. BlackWidow is sitting on a normal machine with internet access .. just called up the url to the site and it shows up in browser mode... i would think that blackwidow would be sending its UA to get that url?/

something doesnt seem right... it grabbed the entire site like if the htaccess file wasnt there .. and when i tried to acces the same site from wannabrowser with BlackWidow as the UA it was blocked .. could be that bw is spoofing when using it as a browser?

i checked my logs .. i see nothing there about blackwidow ..
hmmm

webmasta

webmasta




msg:442008
 5:06 am on Dec 7, 2002 (gmt 0)

more..

i am getting this a lot in my logs "libwww perl" i tracked it to this website > www.linpro.no/lwp/

scroll down the page on the above site .. i see 2 bots there that i dont see in any of the htaccess list so far in this thread > webPluck and webMirror < but then again there are good bots based on that library also/

webmasta

webmasta




msg:442009
 8:18 am on Dec 8, 2002 (gmt 0)

further to blackwidow 4.37/

i did a script to trap the UA for blackwidow when used in browser mode.. seems like the spidirt is using the UA from whatever default browser u have on board .. i couldnt tell the difference from the printout when i was using IE or black widow .. both UA strings were the same ...:o

and of course i was able to download the entire site while scanning... all packaged nicely and laid out like a picnic table .. i know it was bad before but now a predator in disguise.

Obvoiusly creative thinking is needed ..

webmasta

58sniper




msg:442010
 4:29 am on Dec 12, 2002 (gmt 0)
jdMorgan -

I'm attempting to use your correction of my code, and it doesn't seem to be blocking.

RewriteCond %{HTTP_REFERER} ^http://(www\.)?flipdog\.com [NC]
RewriteRule .* /simple.php?sid=robots [F,L]

Is still letting flipdog.com through. Also, I'm getting traffic from bsb.jobs.flipdog.com that I need to block as well.

I'm also using:
RewriteEngine on
RewriteCond %{HTTP_REFERER}!^$
RewriteCond %{HTTP_REFERER}!^http://(www\.)?mydomain.com.*$ [NC]
RewriteCond %{HTTP_REFERER}!^http://(dev\.)?mydomain.com.*$ [NC]
RewriteCond %{HTTP_REFERER}!^http://localhost/.*$ [NC]
RewriteCond %{HTTP_REFERER}!^http://myipaddress$ [NC]
RewriteRule \.(gif¦jpg¦zip¦pdf)$ http://www.mydomain.com/dev/apology.gif [R,L]

And this does work when viewing the flipdog site. At least my images are not there. But the content, including formatting, is.

My legal staff is sending them a nasty gram today, but I need to do something ASAP.

[1][[b]edited by[/b]: jatar_k at 4:56 pm (utc) on Dec. 12, 2002][/1]

Edge




msg:442011
 5:49 pm on Dec 26, 2002 (gmt 0)

The following browser entry clearly doesn't work for my website .htaccess,

"RewriteCond %{HTTP_USER_AGENT} ^MSIECrawler [OR] "

Is there an alternative entry for this browser within .htaccess? MSIECrawler actualy shows as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461; .NET CLR 1.0.3705; MSIECrawler)"

MSIECrawler does check my robots.txt, so I disallowed it there.

Suggestions?

Thanks in advance!

webmasta




msg:442012
 2:47 am on Dec 27, 2002 (gmt 0)

Edge....

>>>> RewriteCond %{HTTP_USER_AGENT} ^MSIECrawler [OR]

Seems like ure looking to match MSIECrawler at the start of the UA string when it appears at the end /

>>>MSIECrawler actualy shows as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461; .NET CLR 1.0.3705; MSIECrawler)"

try - RewriteCond %{HTTP_USER_AGENT} MSIECrawler [OR] -
without ^... the ^ will try to match "MSIECrawler" at the start of the string and when it doesnt find it will just move on to the next rewrite .. it wouldnt look down the string to the end.

hope this helps...
webmasta

maxidrom11




msg:442013
 1:51 pm on Jan 7, 2003 (gmt 0)

I have recently published the following .htaccess and the server gives me 500 error, could you check if something is not correct, please!

ErrorDocument 401 /custompage.html
ErrorDocument 403 /custompage.html
ErrorDocument 404 /custompage.html
ErrorDocument 500 /custompage.html
RewriteOptions +FollowSymLinks
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule!^custompage\.html$ /custompage.html [L]
RewriteCond %{REMOTE_ADDR} ^217\.113\.22[4-8]\. [OR]
RewriteCond %{REMOTE_ADDR} ^217\.113\.229\.([0-9]¦[1-9][0-9]¦1[01][0-9]¦12[0-7])$
RewriteRule .* [somesite.com...]

[edited by: jatar_k at 6:42 pm (utc) on Jan. 7, 2003]
[edit reason] removed specifics [/edit]

SomeCallMeTim




msg:442014
 1:40 am on Jan 9, 2003 (gmt 0)

does anyone in this thread ever thought about, that these apps all can change their user-agent-string and add a refferer bei theirselves?

Which is why we went with the Spambot Trap solution. The google cache of the page is at:

cached page [216.239.39.100]

It is working nicely for us.

hakre




msg:442015
 3:33 am on Jan 9, 2003 (gmt 0)

nice trap. and these traps should also generate pages with dozens of wrong email adresses which will spam the databases of these robots. if they can't get enough, feed them to death ;-)

pmkpmk




msg:442016
 8:17 am on Jan 9, 2003 (gmt 0)

Anybody already tried SugarPlum? www.devin.com/sugarplum/

"Sugarplum employs a combination of Apache's mod_rewrite URL rewriting rules and perl code. It combines several anti-spambot tactics, includling fictitious (but RFC822-compliant) email address poisoning, injection with the addresses of known spammers (let them all spam each other), deterministic output, and "teergrube" spamtrap addressing.

Sugarplum tries to be very difficult to detect automatically, leaving no signature characteristics in its output, and may be grafted in at any point in a webserver's document tree, even passing itself off as a static HTML file. It can optionally operate deterministically, producing the same output on many requests of the same URL, making it difficult to detect by comparison of multiple HTTP requests.

Friday, 09/27/2002: Sugarplum 0.9.8 is available. This is a major revision, based on a "two years hence" review of evolved spammer tactics, countermeasure viability, and various public feedback. This release is much quicker, easier to install and maintain, and about half the size. See the changelog for details. "

xlcus




msg:442017
 2:47 am on Jan 12, 2003 (gmt 0)

Slightly off topic, but a related subject...
If you're trying to block crawlers and bots that rapidly hit your server and put it under heavy load, and you have access to PHP, you might want to take a look at the script I posted to this thread [webmasterworld.com].

It identifies, on the fly, WebCrawlers rapidly requesting pages without the need for a black-list of bots.

neslon




msg:442018
 9:45 pm on Feb 8, 2003 (gmt 0)

What a great thread! I've incorporated your "latest and greatest" list into my own very out-of-date list of harvesters/bandwidth-suckers. I had a few that you didn't in the list of yours that I worked from, but these may be obsolete:

RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^SearchExpress [OR]
RewriteCond %{HTTP_USER_AGENT} ^ZyBorg [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebBandit [OR]

I was surprised that there was no mention of mod_throttle [snert.com] for those who run their own apache servers. I've only just started playing with it, but it seems to be an absolutely tremendous tool, even if only for pinpointing real-time, at a glance, who's eating all the bandwidth. But there are facilities for delaying/refusing requests from client IPs that make too many requests. The one downside seems to be slightly skimpy documentation.

At any rate, this is a great site (my first post here) and a tremendous resource. Thanks!

jatar_k




msg:442019
 10:18 pm on Feb 8, 2003 (gmt 0)

Welcome to WebmasterWorld neslon

andreasfriedrich




msg:442020
 8:38 pm on Feb 9, 2003 (gmt 0)

mod_throttle, Apache [httpd.apache.org]::SpeedLimit, et. al. are mentioned in quite a few threads around here. Try either the site search or Google to locate them :).

WindSun




msg:442021
 4:41 pm on Feb 12, 2003 (gmt 0)

This seems to work for blocking most of the email harvesters, but I am not sure it is the most efficient way to do it (in .htaccess):

SetEnvIfNoCase User-Agent "EmailCollector/1.0" spam_bot
SetEnvIfNoCase User-Agent "EmailSiphon" spam_bot
SetEnvIfNoCase User-Agent "EmailWolf 1.00" spam_bot
SetEnvIfNoCase User-Agent "ExtractorPro" spam_bot
SetEnvIfNoCase User-Agent "Crescent Internet ToolPak HTTP OLE Control v.1.0" spam_bot
SetEnvIfNoCase User-Agent "Mozilla/2.0 (compatible; NEWT ActiveX; Win32)" spam_bot
SetEnvIfNoCase User-Agent "CherryPicker/1.0" spam_bot
SetEnvIfNoCase User-Agent "CherryPickerSE/1.0" spam_bot
SetEnvIfNoCase User-Agent "CherryPickerElite/1.0" spam_bot
SetEnvIfNoCase User-Agent "NICErsPRO" spam_bot
SetEnvIfNoCase User-Agent "WebBandit/2.1" spam_bot
SetEnvIfNoCase User-Agent "WebBandit/3.50" spam_bot
SetEnvIfNoCase User-Agent "webbandit/4.00.0" spam_bot
SetEnvIfNoCase User-Agent "WebEMailExtractor/1.0B" spam_bot
SetEnvIfNoCase User-Agent "autoemailspider" spam_bot
Order Allow,Deny
Allow from all
Deny from env=spam_bot

Panicschat




msg:442022
 9:30 pm on Feb 14, 2003 (gmt 0)
I have been reading through this thread and have found it to be extreemly interesting and useful. I particuarly like the helpful content from Superman, Toolman and Key_Master.

I have been mucking around with my .htaccess for some months, trying to block people who have been doing various neferous things like hotlinking, downloading my content to display on other sites and grabbing my entire web site.

Hotlinking is taken care of. I have a little seperate .htaccess in each sub directory of the root directory that reads as follows:

RewriteEngine on

RewriteCond %{HTTP_REFERER}!^http://mydomain.org/.*$ [NC]
RewriteCond %{HTTP_REFERER}!^http://www.mydomain.org/.*$ [NC]
RewriteCond %{HTTP_REFERER}!^http://myotherdomain.org/.*$ [NC]
RewriteCond %{HTTP_REFERER}!^http://www.myotherdomain.org/.*$ [NC]
RewriteCond %{HTTP_REFERER}!^http://www.myotherdomain.org/index.html/.*$ [NC]
RewriteCond %{HTTP_REFERER}!^http://www.mydomain.org/newindex.htm/.*$ [NC]
RewriteRule .*\.(jpg¦jpeg¦gif¦png¦bmp)$ http://www.mydomain.org/403.shtml [R,NC]

That works just fine. No problems with that at all. Note that it allows access to my pictures from both of my domains.

In my root directory I have the following .htaccess file. Obviously most of my neforous visitors are locals. Yes I am blocking out whole ISPs which will affect a huge number of visitors but that's okay, as it is part of my intention. I am working on reducing the number of Ip addresses listed by adding machine names that correspond to ip ranges. Trust me, that does work.

I have a couple of questions;
Is there a way to write this so that people do get to see the 403 error? Currently they don't see it.
I know I can use something like "deny from 61.95.30." but can I also use "deny from 61.95."? Note the second one just has two ip numbers.

Also, before I go, here's something useful for you all who are effectivly blocking out people who steal your web site content. Ever thought those people will just go to Google cache and steal content from there? Then this line in your HTML head will fix that;
<META NAME="ROBOTS" CONTENT="NOARCHIVE">

ErrorDocument 403 403.shtml

<Limit GET>
order allow,deny
deny from 61.95.30.
deny from 63.148.99.
deny from 64.12.183.
deny from 64.229.81.
deny from 65.92.21.
deny from 65.94.39.
deny from 65.95.181.
deny from 65.95.185.
deny from 128.250.6.
deny from 128.250.9.
deny from 128.250.15.
deny from 128.250.16.
deny from 129.78.64.
deny from 139.134.64.
deny from 144.135.25.
deny from 147.188.192.
deny from 195.239.232.
deny from 202.12.144.
deny from 203.40.140.
deny from 203.40.160.
deny from 203.40.161.
deny from 203.40.162.
....(many more of these starting with 203.)
deny from 204.83.211.
deny from 205.191.171.
deny from 207.44.200.
deny from 207.156.7.
deny from 207.172.11.
deny from 209.90.147.
deny from 209.178.220.
deny from 210.49.20.
deny from 210.49.21.
deny from 210.49.22.
deny from 210.50.16.
deny from 211.28.51.
deny from 211.28.96.
deny from 211.28.219.
deny from 212.95.252.
deny from 216.12.216
deny from 216.16.1.
deny from 216.218.129.
deny from .adnp.net.au
deny from .alphalink.com.au
deny from .comindico.com.au
deny from .csu.edu.au
deny from .da.uu.net
deny from .gil.com.au
deny from .iprimus.net.au
deny from .labyrinth.net.au
deny from .netspace.net.au
deny from .nsw.bigpond.net.au
deny from .optusnet.com.au
deny from .ozemail.com.au
deny from .sympatico.edu.ca
deny from .tmns.net.au
deny from .usyd.edu.au
deny from .vic.bigpond.net.au
allow from all
</Limit>

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon
RewriteRule /*$ http://www.crimestoppers.com.au/ [L,R]

lorax




msg:442023
 5:12 am on Feb 15, 2003 (gmt 0)

My gawd, forget to look at a thread and see what happens. Great stuff has happened since I last read this thread.

Looks like I got here too late to look at the Spambot trap.

Panicschat




msg:442024
 10:17 am on Feb 15, 2003 (gmt 0)
Oh yes, a couple of things I forgot to mention. I know I don't really needs the index.html on the hotlink code, but for a while I did and just left it there.

I have also tried the main .htaccess with the full path to the 403 document, eg; ErrorDocument 403 http://www.mydomain.org/403.shtml Still no deal. I know I'm doing something wrong. I know this has to be re-written, but I'm stuffed if I know how. :) I have an idea from the code on this discussion, but I prefer not to 500 my entire site by stuffing up my .htaccess. ;) Ideas and suggestions welcome.

roelbaz




msg:442025
 8:27 pm on Feb 16, 2003 (gmt 0)

Hello all,

Read this tread, but still have a problem. I've:

ErrorDocument 401 /error/errorbot.php3?error=401
ErrorDocument 403 /error/errorbot.php3?error=403
ErrorDocument 404 /error/errorbot.php3?error=404
ErrorDocument 500 /error/errorbot.php3?error=500

How do i call tot document 403 using the errorobot.php3?error=403 in rewriterule:

{....}
RewriteCond %{HTTP_USER_AGENT} ^Zeus [OR]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule ^.* - [F,L]

Tryed some of the suggestions, but stil get a:

"Additionally, a 403 Forbidden
error was encountered while trying to use an ErrorDocument to handle the request."

Can I and how change the rewriterule incorperating the errordocument?

grz, roel

beachbum




msg:442026
 11:17 pm on Feb 17, 2003 (gmt 0)

Panicschat; roelbaz

Think about this, you've decided to ban these bots or IPs from viewing ANY file on your site...now you want to serve them your custom error file...a file they are banned from seeing.

It can be done, and 'how' was discussed elsewhere in this thread (I believe...although I can't find it now).

BUT....the theme of this thread has been how to get bad bots off of your site as quickly and efficiently as possible...minimizing the load on your server and your bandwidth.

SO....why do you want to serve up a custom error page? I also have custom error pages (pretty ones....complete with my navigation links) which I serve up to mis-guided humans who may need and benefit from a little help. But, who thinks that bad bots are actually reading their 'helpful' error pages. :-) Why be 'friendly' to them, and waste YOUR resources? Why not dipatch them as quickly as possible? :-)

Hello ALL!

Very helpful thread! I did much of this before ever discovering this forum....so, naturally I did a few things a little different. I'll give some examples of what I've done, and perhaps you'll give me some feedback on doing things one way vs. another.

I've seen this condition frequently, in this forum:
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]

I use this instead:
RewriteCond %{HTTP_USER_AGENT} ^(.*)WebBandit(.*) [NC,OR]

The (.*) says I don't care where in the UA string it appears, it's gone! The [NC] says nothing in the string is 'case-sensitive'.

I'm curently using this rule:
RewriteRule ^(.*) - [F]

But, I have played around with this one:
RewriteRule ^(.*) [127.1.1.1...] [R=permanent,L]

I know this last rule will take longer to return an error to a browser, but will either rule move the load off my server quicker than the other? I realize that if the banned UA came from a machine running a server, this rule 'might' create a problem on that machine....but, gee, that's their problem, right? :-)

Any thoughts, pro or con would be appreciated!

thermoman




msg:442027
 12:33 am on Feb 18, 2003 (gmt 0)

Anybody already tried SugarPlum? www.devin.com/sugarplum/

"Sugarplum employs a combination of Apache's mod_rewrite URL rewriting rules and perl code. It combines several anti-spambot tactics, includling fictitious (but RFC822-compliant) email address poisoning, injection with the addresses of known spammers (let them all spam each other), deterministic output, and "teergrube" spamtrap addressing.

Hi,

i'm using an other way to fool spammers spider:

<snip>

This is _not_ a guestbook - you have been warned ;-)

greetings from germany,
Marcel.

richard




msg:442028
 1:02 am on Feb 27, 2003 (gmt 0)

A very impressive thread, my 2 bits of mod_rewrite knowledge are not worth adding to it, except to reiterate what others have said, read the apache documentation [httpd.apache.org].

A minor aside, some time ago as I was first tackling mod_rewrite, and thought I'd discovered a minor bug, (can't remember what it was, except that it was a bug in my logic), total naivety, I sent the good man Ralf S. Engelschall an email, only to get a "Mail delivery failed: returning message to sender". The reason being <sigh>Mr Engelschall was getting to much spam</sigh>.

P.S.
I liked andreasfriedrich's "If you care about freedom be permissive, if you are paranoid be restrictive."

And a tiny bit of irony, including self irony, the definition of an expert: x is an uncertain quantity and spurt is a drip under pressure ;-).

Another post closer to being a "preferred member".

Brett_Tabke




msg:442029
 2:06 am on Feb 28, 2003 (gmt 0)

A nice article I found in the referrer logs from Mark Pilgrims Diveintomark.org site:

Mouse over it - this crowd will just love this url [diveintomark.org]! Thanks Mark.

lorax




msg:442030
 2:35 am on Feb 28, 2003 (gmt 0)

That's a beaut Brett - Thanks!

Hester




msg:442031
 12:12 pm on Feb 28, 2003 (gmt 0)

I have these questions:

1) What are the bot companies doing with all the data they take?

2) Isn't it illegal?

3) I've checked my webstats but all I see are lists of IP addresses and unknown names. How can I tell what is good and what is bad? None of the names published here seem to be in my list. (Some obvious search engines are though.)

4) To reiterate a previous post, what's to stop all robots announcing themselves as legitimate browsers? (See point 2!)

5) It's only a matter of time before bots can decipher Javacript URLS too. Is it even worth trying to fight them when extra bandwidth is fairly cheap? So long as it doesn't impact the genuine user?

Brett_Tabke




msg:442032
 4:48 pm on Feb 28, 2003 (gmt 0)
Lets try to stay on topic here in this mega multi-year thread. Please start a new thread in the spider id forum hester for the side topic issues.
Oaf357




msg:442033
 1:01 am on Mar 7, 2003 (gmt 0)

So you only need to put the .htaccess in your root directory. What if the robots enter from another area?

WolfHawk




msg:442034
 6:29 am on Mar 7, 2003 (gmt 0)

Hello everyone...

I'm new to both this forum and perl & cgi scripts. I've been doing a lot of late night studing to learn it all quickly but it's just not possible to learn what I need to know in the short amount of time I have.

To make a long story short and to the point, I'm setting up my first web site and while searching for information about keeping nasty bots away from my site, I found this forum.

The information and knowledge I've come across in this thread is spectacular however, when I added the rewrite script I found here to my current .htaccess file I discovered that while it does a great job at keeping nasty bots away and provides me with an easy way to ban things not only by browser and name, but by IP address also, for some unknown reason every time I click the submit button on any of my online submition forms I get a 403 error message.

This is the only problem that adding rewrite rules appears to be causing. I went through a process of elimination by removing each rewrite rule one at a time until I only had the

RewriteEngine On
RewriteRule ^.* - [F,L]

portion left and I still kept getting a 403 error message every time I clicked on the submission button on any of my forms. Once I removed the remaining section of the rewrite script, my forms began functioning again.

Can any one help me out with this?

For your information when I first went into my .htaccess file I found the following content already in it which I'm aware was already brought up in this thread but I couldn't find any responce to the previous similar inquiry from veenerz...

# -FrontPage-

IndexIgnore .htaccess */.?* *~ *# */HEADER* */README* */_vti*

<Limit GET POST>
order deny,allow
deny from all
allow from all
</Limit>
<Limit PUT DELETE>
order deny,allow
deny from all
</Limit>
AuthName www.mydomainname.com
AuthUserFile /the /path/to/a/file.here
AuthGroupFile /and/the/path/to/another/file.here

The rewrite script was easy for me configure and use but this "mod_access" stuff with order deny, allow etc... I just can't understand or figure out.

Any and all assistance will be greatly appreciated.

Wolf

This 243 message thread spans 9 pages: < < 243 ( 1 2 3 4 5 6 [7] 8 9 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved