homepage Welcome to WebmasterWorld Guest from 54.226.191.80
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

This 243 message thread spans 9 pages: < < 243 ( 1 2 3 4 5 6 7 [8] 9 > >     
A Close to perfect .htaccess ban list
toolman




msg:441824
 3:30 am on Oct 23, 2001 (gmt 0)

Here's the latest rendition of my favorite ongoing artwork....my beloved .htaccess file. I've become quite fond of my little buddy, the .htaccess file, and I love the power it allows me to exclude vermin, pestoids and undesirable entities from my web sites

Gorufu, littleman, Air, SugarKane? You guys see any errors or better ways to do this....anybody got a bot to add....before I stick this in every site I manage.

Feel free to use this on your own site and start blocking bots too.

(the top part is left out)

<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

 

WolfHawk




msg:442034
 6:29 am on Mar 7, 2003 (gmt 0)

Hello everyone...

I'm new to both this forum and perl & cgi scripts. I've been doing a lot of late night studing to learn it all quickly but it's just not possible to learn what I need to know in the short amount of time I have.

To make a long story short and to the point, I'm setting up my first web site and while searching for information about keeping nasty bots away from my site, I found this forum.

The information and knowledge I've come across in this thread is spectacular however, when I added the rewrite script I found here to my current .htaccess file I discovered that while it does a great job at keeping nasty bots away and provides me with an easy way to ban things not only by browser and name, but by IP address also, for some unknown reason every time I click the submit button on any of my online submition forms I get a 403 error message.

This is the only problem that adding rewrite rules appears to be causing. I went through a process of elimination by removing each rewrite rule one at a time until I only had the

RewriteEngine On
RewriteRule ^.* - [F,L]

portion left and I still kept getting a 403 error message every time I clicked on the submission button on any of my forms. Once I removed the remaining section of the rewrite script, my forms began functioning again.

Can any one help me out with this?

For your information when I first went into my .htaccess file I found the following content already in it which I'm aware was already brought up in this thread but I couldn't find any responce to the previous similar inquiry from veenerz...

# -FrontPage-

IndexIgnore .htaccess */.?* *~ *# */HEADER* */README* */_vti*

<Limit GET POST>
order deny,allow
deny from all
allow from all
</Limit>
<Limit PUT DELETE>
order deny,allow
deny from all
</Limit>
AuthName www.mydomainname.com
AuthUserFile /the /path/to/a/file.here
AuthGroupFile /and/the/path/to/another/file.here

The rewrite script was easy for me configure and use but this "mod_access" stuff with order deny, allow etc... I just can't understand or figure out.

Any and all assistance will be greatly appreciated.

Wolf

DerekT




msg:442035
 5:24 am on Mar 8, 2003 (gmt 0)

If anyone would like to prevent "Web Copiers" or "Offline Browsers" without the need to update a .haccess file visit this thread for a great PHP solution.

[webmasterworld.com...]

It monitors page requests and if a user requests too many within a set timeframe, they are given a custom 503 message.

Initially I used a long .htaccess file to prevent these programs however, it didnt always work and I always had to add USER_AGENTS to the file when new programs were released. This also doesnt protect against these programs when people change their USER_AGENT to IE or Netscape.

Once I placed this script on my site, I caught 8 different people (unique) over a 24 hour period trying to leech my site. They all had normal browser USER_AGENT settings so a .htaccess wouldnt help. Since my site is all PHP and mySQL generated, this copying really hit my server hard. Some were requesting up to 17 pages a second!

Now that they are caught in realtime, my server is performing much better and my regular visitors are very happy.

If you visit the thread notice a few changes I added to ensure Googlebot is exempted from the limits and can request as many pages as it wishes.

StopSpam




msg:442036
 5:31 pm on Mar 12, 2003 (gmt 0)

I had recently seen a post from some one who had written a perl code that could block a robot based on the amount or temps to connect to the server. Only problem with it . it blocked google as well for indexing etc...

i no longer can find tha post. Can anyone sent me the url of the post on this forum? i like to check the code again.

I have spent a whole day on trying to find it with site search but i cant find it ;-(

i saw it few days back ...

Oaf357




msg:442037
 6:01 pm on Mar 12, 2003 (gmt 0)

Can someone display the "latest" version of their .htaccess file, please.

jatar_k




msg:442038
 6:32 pm on Mar 12, 2003 (gmt 0)

was it this one StopSpam?
Blocking badly behaved runaway WebCrawlers [webmasterworld.com]

it is php though, I am not sure which one you mean.

and Welcome to WebmasterWorld WolfHawk. :)

StopSpam




msg:442039
 6:57 pm on Mar 12, 2003 (gmt 0)

jatar_k thank you very mush....
this is indeed the post i were looking for i think
I had allready given up on finding it again ...

so really thx.
from now on i flag posts that i find intressting
so ican easy find them back

;-)

DerekT




msg:442040
 7:35 pm on Mar 12, 2003 (gmt 0)

StopSpam

If you would have looked at my post you would have seen the reference to the code.

StopSpam




msg:442041
 7:56 pm on Mar 12, 2003 (gmt 0)

You are right sorry credits for you as well you foudn it first

;-) i had read your message but i gues my mind were somewere els at the moment i wrote the reply and i had forgot you sorry

thx

i tryto make a code that blocks a bot on ip for multi conection to a site. but i dont wnat to use a saparated data file that works as counters ... i want to keep it in the code ...

DerekT




msg:442042
 8:09 pm on Mar 12, 2003 (gmt 0)

StopSpam,

You could use a single flat text file and load it into an array, but I havent tried that. That would also probabily use more CPU threads than writing seperate files per IP. Another possibility would be a mySQL database but you would have even more overhead with reads/writes to the database under heavy load.

I have been hit really hard with these programs since my site hosts over 20,000 images and movies. I have changed the line in the code to have 4,096 IP MD5 hashes vice the 256 the script has by default and have had no preformance problems.

I even customized the 305 page that is displayed. The page explains why they are viwing the message, (to prevent leeching, slow performance for regular visitors, etc) and even has a javascript count down time that starts at 60 and when it reaches 0, forwards them to the page/image/movie they origionaly requested.

StopSpam




msg:442043
 8:33 pm on Mar 12, 2003 (gmt 0)

wouw i am impressed how it works for you...

what i want is to make peach of perl code to stop brute force atacks on a passsword protected directory.
lets say after 10 wrong atemts script will take again against the user forwared different page or something like that

DerekT




msg:442044
 8:51 pm on Mar 12, 2003 (gmt 0)

StopSpam,

Check your sticky mail.

residuals




msg:442045
 4:40 am on Mar 27, 2003 (gmt 0)

Wow-just some comments and humor-no questions yet ;)

As someone else stated in a previous post if I recall....This has been one hell of a read. I read every single post in one sitting tonight. To hell with books!

I just wanted to humorously/seriously note that all these attempts at getting rid of the bad bots/spammers etc.... they (the bad people) may all end up reading this post after they try a search on google to find out why their "system" isn't working any more, LOL. All the efforts you've all put into this post will possibly be read and this will help all the spammers. However the harder you make something to detour begginners, and robots that wouldn't read these forums anyway (unless robots get so smart that they can read forums), the better.

Maybe this forum should be encrypted in some arabic language that no one can read and only " good" people get the de-encryption software to read it. But to figure who who is bad and who is good, we are back to square one again...lol.

I find this forum possibly the most interesting and mind excersizing forum I've ever visited!

DrDoc




msg:442046
 6:07 am on Mar 27, 2003 (gmt 0)

So you only need to put the .htaccess in your root directory. What if the robots enter from another area?

If it's entering, say, directly at www.example.com/deep/path/to/some/file.html the server will look for an .htaccess file in each directory, starting at the root, process any information it finds, before sending the page.

pmkpmk




msg:442047
 2:41 pm on Apr 1, 2003 (gmt 0)

Yep. My file is probably outdated too....

Somewhere I read a sniplet of code that automatically places bots who get a forbidden page from robots.txt into the ban-list. Can't find it anymore though...

Anyone with more details?

pmkpmk




msg:442048
 3:38 pm on Apr 1, 2003 (gmt 0)

Here's my list. And it has a problem too - a syntax error hidden somewhere. Can anybody help?

XBitHack on
Options +FollowSymLinks
RewriteEngine On

RewriteCond %{HTTP_REFERER} iaea\.org [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} "DTS Agent" [OR] # OD
RewriteCond %{HTTP_USER_AGENT} "Fetch API Request" [OR] # OD
RewriteCond %{HTTP_USER_AGENT} "Indy Library" [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} "LINKS ARoMATIZED" [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} "Microsoft URL Control" [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} "^DA \d\.\d+" [OR] # OD
RewriteCond %{HTTP_USER_AGENT} "^Download" [OR] # OD
RewriteCond %{HTTP_USER_AGENT} "^Internet Explore" [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} "^Mozilla/4.0$" [OR] # dumb bot
RewriteCond %{HTTP_USER_AGENT} "^Mozilla/\?\?$" [OR] # formmail attacker
RewriteCond %{HTTP_USER_AGENT} "compatible ; MSIE 6.0" [OR] # spambot (note extra space before semicolon)
RewriteCond %{HTTP_USER_AGENT} "efp@gmx\.net" [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} "mister pix" [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} ^Atomz [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EasyDL/\d\.\d+ [OR] # OD
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlickBot [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} ^FrontPage [OR] # stupid user trying to edit my site
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^IE\ \d\.\d\ Compatible.*Browser$ [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^MSIECrawler [OR] # IE’s "make availableoffline" mode
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL\ Control [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^NG [OR] # unknown bot
RewriteCond %{HTTP_USER_AGENT} ^NPBot [OR] # NameProtect spybot
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^PersonaPilot [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^Sqworm [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SurveyBot [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^TurnitinBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^[A-Z]+$ [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} anarchie [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} cherry.?picker [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} crescent [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} e?mail.?(collector¦magnet¦reaper¦siphon¦sweeper¦harvest¦collect¦wolf) [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} express [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} extractor [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} flashget [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} getright [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} go.?zilla [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} grabber [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} httrack [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} imagefetch [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} net.?(ants¦mechanic¦spider¦vampire¦zip)[NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} nicerspro [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} ninja [NC,OR] # Download Ninja OD
RewriteCond %{HTTP_USER_AGENT} offline [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} snagger [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} tele(port¦soft) [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} vayala [OR] # dumb bot, doesn’t know how tofollow links, generates lots of 404s
RewriteCond %{HTTP_USER_AGENT} web.?(auto¦bandit¦collector¦copier¦devil¦downloader¦fetch¦hook¦mole¦miner¦mirror¦reaper¦sauger¦sucker¦site¦snake¦stripper¦weasel¦zip) [NC,OR] # ODs
RewriteCond %{REMOTE_ADDR} "^63\.148\.99\.2(2[4-9]¦[3-4][0-9]¦5[0-5])$" [OR] # Cyveillance spybot
RewriteCond %{REMOTE_ADDR} ^12\.148\.196\.(12[8-9]¦1[3-9][0-9]¦2[0-4][0-9]¦25[0-5])$ [OR] # NameProtect spybot
RewriteCond %{REMOTE_ADDR} ^12\.148\.209\.(19[2-9]¦2[0-4][0-9]¦25[0-5])$ [OR] # NameProtect spybot
RewriteCond %{REMOTE_ADDR} ^64\.140\.49\.6([6-9])$ [OR] # Turnitin spybot

RewriteCond %{HTTP_USER_AGENT} ^Zeus

RewriteRule!err_¦robots\.txt - [F,L]

[edited by: jatar_k at 4:36 pm (utc) on April 1, 2003]
[edit reason] sidescroll, had to shrink a line [/edit]

StopSpam




msg:442049
 5:24 pm on Apr 1, 2003 (gmt 0)

what those this line do?

RewriteRule!err_¦robots\.txt - [F,L]

can some one explain me what the

F and L means or point me to a site that explains it ;-)

i like to learn it as this coding is powerfull ;-)

PMKPMK thank you big time!

Tamsy




msg:442050
 7:41 am on Apr 2, 2003 (gmt 0)

Hi pmkpmk

Check your Syntax at line:
RewriteCond %{HTTP_USER_AGENT} net.?(ants¦mechanic¦spider¦vampire¦zip)[NC,OR] # OD

It should read:
RewriteCond %{HTTP_USER_AGENT} net.?(ants¦mechanic¦spider¦vampire¦zip) [NC,OR] # OD

You forgot the [Space] between ..¦zip) and [NC,OR]

pmkpmk




msg:442051
 10:40 am on Apr 2, 2003 (gmt 0)

StopSpam: I'm not an expert on this topic - rather doing some educated guessing combined with cut'n'paste :-)

The meaning of the "F" and "L" flags was discussed (much) earlier in this post. The "!err_¦robots.\txt" means, that the redirection is valid for ALL files EXCEPT for files beginning with "err_" (in my cas those are my error documents like err_403.html) and robots.txt (in order to give a bot a chance to see where it is not wanted).

franklin dematto




msg:442052
 6:11 am on Apr 3, 2003 (gmt 0)

Numerous UA's have been tagged on this thread. Could someone classify them? Many people only want to block one or the other. For instane, I want to block e-mail harvesters, but don't mind if people download my site for offline viewing or even archiving. And I want to avoid false positives.

ratboy




msg:442053
 8:59 pm on Apr 9, 2003 (gmt 0)

This is really useful info, I've been wondering how much energy to put out in attempting to block spiders, this has given me enough to make an educated move. Too bad that the way for spider programmers to bypass this htaccess method is so ridiculously easy, but it seems like these techniques will help at least over the short term, thanks especially to toolman and superman, you guys really put out some good stuff, saves us all a lot of work, and many hours of pointless trial and error.

ratboy




msg:442054
 9:16 pm on Apr 9, 2003 (gmt 0)

Oh, rather than clutter up this section with more samples, I put what I gather is more or less the version that includes most of the stuff people have added in a text file tech.ratmachines.com/downloads/sample_wbmw.txt

If there are more things that should be added please post them,
Thanks

notsleepy




msg:442055
 7:44 pm on Apr 10, 2003 (gmt 0)

ratboy: Good idea on the central location for the file.

I think I have one more for you to add:

RewriteCond %{HTTP_USER_AGENT} ^GornKer [OR]

I couldn't find any information on it but it never touched my robots.txt.

ratboy




msg:442056
 12:39 am on Apr 11, 2003 (gmt 0)

Thanks, I'll keep it as up to date as I can. The thing that became quickly obvious from reading this really educational discussion forum was that the technique I had been wanting to use, a robots.txt exclusion, was a complete waste of time, since the only thing any self-respecting spider/crawler programmer would do with that information would be to seek out the areas that were explicitly denied.

The .htaccess file idea seems like a much better stop gap measure, and much more versatile, and easier to implement. I'll stop in now and then and see if there is anything more to add to it. Kudos to webmasterworld for having forums and contributors that actually can teach you something and not waste your time.

Oaf357




msg:442057
 1:48 am on Apr 11, 2003 (gmt 0)

Okay. I tried to implement the central .htaccess file but got some unusual errors. Any ideas if there is anything missing from that file that would keep it from working?

ratboy




msg:442058
 6:44 am on Apr 11, 2003 (gmt 0)

Oaf357 - I don't claim any expertise in this stuff, all I can say is that this is what I cut and pasted directly out of this forum, with a few spider additions, which shouldn't change how the script runs. You might try cutting out the first lines of

RewriteCond %{HTTP_REFERER} q=guestbook [NC,OR]

(just the referer ones with guestbook, and see if that makes a difference?), also cut out the first line source comment, just to be on the safe side, then see if you get the same errors.

I've been running it for a few days, without any errors, but that's just one server on one webhoster, so I can't tell you there's nothing wrong with it, maybe some of the other people who have contributed can take a look at it tech.ratmachines.com/downloads/sample_wbmw.txt
here and let us know.

Here are the first and last lines of the script, however, if someone can spot an error (the dots represent the cut out part:
===============================

RewriteEngine On
RewriteCond %{HTTP_REFERER} q=guestbook [NC,OR]
.....................
RewriteCond %{HTTP_USER_AGENT} ^Zeus [OR]
RewriteCond %{HTTP_USER_AGENT} ^ZyBorg

RewriteRule ^.* - [F,L]

===============================

You might want to post what errors you got exactly, then somebody might be able to help you, I'm not very good at this stuff, but some of the people on this forum are.

ratboy




msg:442059
 8:07 pm on Apr 11, 2003 (gmt 0)

Here is a useful thing from [apache-server.com...]
on basic .htaccess trouble shooting. It might help.
==============================================
==============================================

Troubleshooting

Here are some of the most common problems I've seen people have (or have had myself) with .htaccess files. One thing I should stress first, though: the server error log is your friend. You should always consult the error log when things don't seem to be functioning correctly. If it doesn't say anything about your problem, try boosting the message detail by changing your LogLevel directive to debug. (Or adding a LogLevel debug line of you don't have a LogLevel already).

'Internal Server Error' page is displayed when a document is requested
This indicates a problem with your configuration. Check the Apache error log file for a more detailed explanation of what went wrong. You probably have used a directive that isn't allowed in .htaccess files, or have a directive with incorrect syntax.

.htaccess file doesn't seem to change anything
It's possible that the directory is within the scope of an AllowOverride None directive. Try putting a line of gibberish in the .htaccess file and force a reload of the page. If you still get the same page instead of an 'Internal Server Error' display, then this is probably the cause of the problem. Another slight possibility is that the document you're requesting isn't actually controlled by the .htaccess file you're editing; this can sometimes happen if you're accessing a document with a common name, such as index.html. If there's any chance of this, try changing the actual document and requesting it again to make sure you can see the change. this isn't happening.

I've added some security directives to my .htaccess file, but I'm not getting challenged for a username and password
The most common cause of this is having the .htaccess directives within the scope of a Satisfy Any directive. Explicitly disable this by adding a Satisfy All to the .htaccess file, and try again.

jdMorgan




msg:442060
 5:38 am on Apr 12, 2003 (gmt 0)

Syntax errors in the list posted here:

Don't use quotes for mod_rewrite patterns. That's for RedirectMatch syntax.
Comments should be on their own line - Otherwise, you will get warnings if you have that log-level set.

So,
RewriteCond %{HTTP_USER_AGENT} "Microsoft URL Control" [OR] # spambot
should be
# spambot
RewriteCond %{HTTP_USER_AGENT} Microsoft\ URL\ Control [OR]

In the rule,
RewriteRule !err_¦robots\.txt - [F,L]
the underscore in _"err_" needs to be escaped - precede it with a "\".
Also, the alternates in the pattern probably need to be delimited with parentheses:
RewriteRule !(err\_¦robots\.txt) - [F,L]

Also, the broken vertical pipe "¦" character above must be changed to a solid vertical pipe before it can be used in .htaccess.

HTH,
Jim

jdMorgan




msg:442061
 5:41 am on Apr 12, 2003 (gmt 0)

pmkpnk,

Bad bot script: [webmasterworld.com...]

(See the links at the top of that thread for even more "historical" information on the subject.)

Jim

ladymindy




msg:442062
 12:17 am on May 11, 2003 (gmt 0)

I added an htaccess script as described in this forum. It took away all incidences of the off line browsers except 1. I just noticed an entry for teleportpro/ in my logs
(Agent: Teleport Pro/1.29.1718) for my message board. How did this get through when I used the statement:
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR] ?

Did I write this wrong?
Thanks

ladymindy

boxturt




msg:442063
 3:57 am on May 11, 2003 (gmt 0)

I've been pouring over this forum for hours; trying, tweaking, etc. I have learned so much!

No problem blocking Teleport Pro. Except then I discovered it can be set to disguise itself as
(compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Hotbar 4.0) as well as a few other things.

Now I'm really confused. I can't very well block that right?!
Suggestions?

Ty

Oaf357




msg:442064
 4:18 am on May 11, 2003 (gmt 0)

Anything can be disguised. I could give my browser the same agent string as the Googlebot.

You win some, you lose some.

This 243 message thread spans 9 pages: < < 243 ( 1 2 3 4 5 6 7 [8] 9 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved