Forum Moderators: DixonJones

Message Too Old, No Replies

How to protect from site copiers like teleport?

How to stop undesired crawlers and spiders

         

silverbytes

1:42 pm on Sep 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In my logs a spider called "Stupid email harvester" is listed...

I wonder how to exclude or stop undesired spiders and crawlers, I was thinking in products that copies sites and stuff like that...

Experiences?

pendanticist

1:48 pm on Sep 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Googling "perfect .htaccess [google.com]" brings up many starting points depending on your specific needs.

Pendanticist.

silverbytes

1:15 am on Sep 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



sounds good but I´m newbie. Is it a file I can upload to .. where?

pendanticist

3:47 pm on Sep 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>Is it a file I can upload to .. where?

Ah, that's the simple part. :)

Create it (the file) with notepad and save it as .htaccess.txt.

(Do NOT use MSWord or any other application that transfers formatting as it'll drive you nuts!)

From there, you fill it in with what you want/need and upload to the same place your index.html goes.

Two things to always

[red][b]be aware of[/b][/red]
here in WebmasterWorld's forums, especially when considering copy and paste:

  • Vertical pipes come out as broken pipes.
  • (under certain conditions) Whitespaces are removed.

    Also, be vary aware that you can totally discomboobulate your domain if you go at this without a full understanding of what .htacess does. I'm still learning myself.

    I might also suggest A Close to perfect .htaccess ban list - Part 2 [webmasterworld.com] as required reading.

    If you run into trouble: ask as a stand-alone question/thread (if you can't find it elsewhere) and you'll get all the help you need.

    Oh, one more thing. Once you get your .htaccess file functional - copy it, rename (the extension) as '.bkp' rather than '.txt' (for backup) and upload immediately. That way, you have your latest version sitting there in case you need it.

    Hope this helps.

    Pendanticist.

  • silverbytes

    3:52 pm on Sep 18, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Thanks! Sounds like I´m gonna destroy my site...
    I feel some scary, thought it was less complicated.

    pendanticist

    4:00 pm on Sep 18, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Don't worry about it, you'll do fine. The precautions are important, but Do Not let that stand in your way.

    Believe me, the rewards make the consternation and effort well worth the time and energy it takes to build one.

    Pendanticist.

    claus

    4:18 pm on Sep 18, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    First off, your site will not be destroyed - what you do is simply to upload one extra file to your document root, you don't delete anything so don't worry too much everything can be restored. This extra file can be quite powerful, however.

    I really just wanted to add this:

    The real name for this file must be ".htaccess" - there is no file extension and there is a dot before the file name.

    When you create the file on your pc, just call it "htaccess.txt" - then, after you have uploaded it to your server, rename it to ".htaccess" otherwise it won't work.

    When you do this, the file will disappear from view. The dot-before-filename simply means "this file is invisible". No problem, it's still there and it will work.

    Problems? Restore:

    Make an empty file with any name and upload it. After upload, rename it to ".htaccess" and it will replace the one causing problems - then everything will revert to "normal". Here, your off-line backup of the file will be valuable.

    /claus

    keyplyr

    6:15 pm on Sep 18, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




    after you have uploaded it to your server, rename it to ".htaccess"... When you do this, the file will disappear from view

    Well, it depends on your server config and the ftp client. I see mine ;)

    bcolflesh

    6:18 pm on Sep 18, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    ...the file will disappear from view

    Add -al to the filemask of your FTP client - then you'll see all your files.

    silverbytes

    12:49 am on Sep 19, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Thank you! All of you!

    Is this ready to copy and paste?:

    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} ^attach [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BackWeb [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Bandit [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BatchFTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Buddy [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Collector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Copier [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DA [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DISCo\ Pump [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Download\ Wonder [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Downloader [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Drip [OR]
    RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^FileHound [OR]
    RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetSmart [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
    RewriteCond %{HTTP_USER_AGENT} ^gotit [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Grabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
    RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
    RewriteCond %{HTTP_USER_AGENT} ^HTTrack [OR]
    RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Iria [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JOC [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JustView [OR]
    RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^lftp [OR]
    RewriteCond %{HTTP_USER_AGENT} ^likse [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Magnet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mag-Net [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Memo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mirror [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetZip [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Ninja [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
    RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Pockey [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Pump [OR]
    RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Reaper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Recorder [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Siphon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Snake [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SpaceBison [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Stripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Vacuum [OR]
    RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Webster [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Whacker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Xaldon
    RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SearchExpress [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ZyBorg [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebBandit [OR]
    RewriteCond %{HTTP_USER_AGENT} FrontPage [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
    RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
    RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
    RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
    RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
    RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
    RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Zeus

    RewriteRule ^.* - [F,L]

    Some could be repeated.. is that bad?
    I pasted from several threads....

    What do you say?

    bcolflesh

    1:10 am on Sep 19, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Some could be repeated.. is that bad?

    Definitely remove your dupes - putting them in alphabetical order will make future editing easier as well.

    silverbytes

    1:44 am on Sep 19, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Well here´s my final list, please if you find something wrong tell me!

    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} ^attach [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BackWeb [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Bandit [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BatchFTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Buddy [OR]
    RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Collector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Copier [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DA [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DISCo\ Pump [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Download\ Wonder [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Downloader [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Drip [OR]
    RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^FileHound [OR]
    RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^FrontPage [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetSmart [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
    RewriteCond %{HTTP_USER_AGENT} ^gotit [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Grabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
    RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
    RewriteCond %{HTTP_USER_AGENT} ^HTTrack [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Iria [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JOC [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JustView [OR]
    RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^lftp [OR]
    RewriteCond %{HTTP_USER_AGENT} ^likse [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Magnet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mag-Net [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Memo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
    RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mirror [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetZip [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Ninja [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
    RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Pockey [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Pump [OR]
    RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Reaper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Recorder [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SearchExpress [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Siphon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Snake [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SpaceBison [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Stripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Vacuum [OR]
    RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebBandit [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Webster [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Whacker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Xaldon
    RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Zeus
    RewriteCond %{HTTP_USER_AGENT} ^ZyBorg [OR]
    RewriteRule ^.* - [F,L]

    Specially what is RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]?

    jdMorgan

    2:11 am on Sep 19, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    silverbytes,

    You've got a problem waiting to bite you, there. All lines that start with RewriteCond must end with [OR] except the last one. If you look down near the end of your list, you'll see some problems.

    Also, the RewriteRule can be simplified to:


    RewriteRule .* - [F]

    The Mozilla.*NEWT line blocks a user-agent that starts with "Mozilla" and has "NEWT" in the middle somewhere.

    Now, you need to add a line to block "Stupid email harvester" :)

    Ref: Introduction to mod_rewrite [webmasterworld.com]

    Jim

    claus

    9:15 am on Sep 19, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    silverbytes, you may want to consider the last line "ZyBorg" for a moment. I'm not sure you would want to ban that one. The full user-agent string of that one is one of these two:

    Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; http://www.WISEnutbot.com)
    Mozilla/4.0 compatible ZyBorg/1.0 DLC (wn.zyborg@looksmart.net; http ://www.WISEnutbot.com)

    As you see, it's a real Search Engine spider. On the other hand, if this specific search engine has caused problems to you, you might want to do it anyway.

    There's a few threads in "Search Engine Spider Identification" about it: [webmasterworld.com...] the most recent being this one with the little flattering name:

    "ZyBorg/1.0 violates robots.txt"
    [webmasterworld.com...]

    It seems as there are some sites where it behaves nicely and others where it does not. Post #34 of that thread mentions an alternative approach - in stead of banning it altogether, just enforce your robots.txt on it. That way you still get spidered on the right pages.

    /claus


    Added:

    Just saw that you were not banning it anyway. Your last line bans a User-Agent whose name starts with ZyBorg:

    RewriteCond %{HTTP_USER_AGENT} 
    ^
    ZyBorg [OR]

    - due to the little ^ sign, that means "from the beginning of the string", so you have no problems here. If you want to ban ZyBorg you must remove that sign, but then again, you would not want to ban #2 of the two user-agents posted above (it's a link-checker), so it would have to be:

    RewriteCond %{HTTP_USER_AGENT} ZyBorg/1\.0\ \( [OR]

    I've included the space and the parenthesis.

    spud01

    9:32 am on Sep 19, 2003 (gmt 0)

    10+ Year Member



    Well its all dandy if your site resides on a *nix server where the .htaccess file works, but how do Windows based webservers try to do the same thing, if the .htaccess file does not work on them.

    Is there a way to ban all those spam harverters?

    thx.

    p.s: nice list of bot harversters silverbytes, thx for sharring that with us.

    claus

    10:21 am on Sep 19, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    spud01, you're right we tend to think the whole world is Apache sometimes, sorry about that.

    >> how do Windows based webservers try to do the same thing

    You need to look into a feature called IIS ISAPI filters i believe. I know the term for it, but i don't really speak any IIS.

    Otherwise it can be done via the "Internet Information Services Manager" as explained in this thread:
    [webmasterworld.com...]

    Or via ASP as this thread deals with:
    [webmasterworld.com...]

    /claus

    Woz

    10:36 am on Sep 19, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    The best bot banner script for ASP that I have found is this one [evolvedcode.net] which I found very easy to setup and add rogue bots. You could perhaps add it into global.asa but I prefer to add it into every page to ensure nothing sneaks past. Very fast and doesn't seem to slow things down that I can tell.

    Onya
    Woz

    silverbytes

    4:08 pm on Sep 19, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    I FOUND this I think it´s cool!

    [fleiner.com...]

    how to trap undesired bots, and those who violate robots.txt

    jdMorgan

    4:23 pm on Sep 19, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    silverbytes,

    Be careful, there are errors on that site.

    For example, this code


    RewriteEngine on
    Options +FollowSymlinks
    RewriteBase /

    where the order of the first two directives is incorrect. Options should precede RewriteEngine on.

    For a nice robots trap script, try this thread [webmasterworld.com], and follow it back to preceding threads to get the background. Make sure you read message #13 in the cited thread - it contains an important correction to the modified scripts.

    Jim

    silverbytes

    11:18 pm on Sep 20, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Ok, I will drop that link.

    Now here´s my final list, with all annoyances I detected and others taken from what upset other people too; it took me a lot of work and you can use it freely, and please advise of any errors!

    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} 209.237.232.84 [OR]
    RewriteCond %{HTTP_USER_AGENT} ^http://www.iaea.org [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Advanced Email Extractor [OR]
    RewriteCond %{HTTP_USER_AGENT} ^attach [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Au2Email [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BackWeb [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Bandit [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BatchFTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Buddy [OR]
    RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Collector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Copier [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DA [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DISCo\ Pump [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Download\ Wonder [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Downloader [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Drip [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DTS Agent [OR]
    RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Fetch API Request [OR]
    RewriteCond %{HTTP_USER_AGENT} ^FileHound [OR]
    RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^FrontPage [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetSmart [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
    RewriteCond %{HTTP_USER_AGENT} ^gotit [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Grabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
    RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
    RewriteCond %{HTTP_USER_AGENT} ^HTTrack [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Almaden.ibm.com/cs/crawler [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
    RewriteCond %{HTTP_USER_AGENT} ^InternetSeer.com [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Iria [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Java1.x.x [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JOC [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JustView [OR]
    RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Linkman [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^lftp [OR]
    RewriteCond %{HTTP_USER_AGENT} ^likse [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Magnet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mag-Net [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Memo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Microsoft Data Access [OR]
    RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mirror [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*Indy [OR]
    RewriteCond %{HTTP_USER_AGENT} ^MSFrontPage [OR]
    RewriteCond %{HTTP_USER_AGENT} ^MS FrontPage [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetResearchServer[OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetZip [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Ninja [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NPBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
    RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Pockey [OR]
    RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Pump [OR]
    RewriteCond %{HTTP_USER_AGENT} ^QuepasaCreep [OR]
    RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Reaper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Recorder [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SearchExpress [OR]
    RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Siphon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Snake [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SpaceBison [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Stripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Stupid email harvester [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^TurnitinBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Vacuum [OR]
    RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Webster [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebGather [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Whacker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Xaldon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ZyBorg/1\.0\ \( [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Zeus
    RewriteRule .* - [F]

    Any feedback about it? :)

    jdMorgan

    11:50 pm on Sep 20, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    silverbytes,

    You probably want this for your first RewriteCond line:


    RewriteCond %{REMOTE_ADDR} ^209\.237\.232\.84$ [OR]

    Alternately, if this is a referer rather than a visitor, use {HTTP_REFERER}

    Please help save WebmasterWorld disk space and don't post that long list again until you get more comments on it. :)

    Jim

    claus

    12:05 pm on Sep 21, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Added:For new readers of this thread as well as for silverbytes:

    You really would want to read through the entire 20-something page thread called "A close to perfect .htaccess ban list" (!). A lot of long posts can be avoided by careful reading of that huge thread. Especially, in part 2 of it, there are some very good tips on optimizing those banning statements:

    (1) A Close to perfect .htaccess ban list [webmasterworld.com]
    (2) A Close to perfect .htaccess ban list - Part 2 [webmasterworld.com]


    silverbytes, you still have quite a few duplicates, although you probably don't see them. Also, you must escape the dots in the whole list like this

    \. (put a slash before them)

    -otherwise they mean "any character" which can be a dot but also anything else, and a dot is what you want, so the slash makes sure that it is a dot.

    You must do the same thing with spaces. Otherwise they will give you errors. Always put a backslash before them. Here are three lines i have corrected, but the third will be omitted further down in this post:

    RewriteCond %{HTTP_USER_AGENT} ^http://www\.iaea\.org [OR] 
    RewriteCond %{HTTP_USER_AGENT} ^Almaden\.ibm\.com/cs/crawler [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Stupid\ email\ harvester [OR]

    I would not correct this one, as the dot can be a space or whatever else:

    RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR] 

    This one:

    RewriteCond %{HTTP_USER_AGENT} ^Java1.x.x [OR]

    ...i would write this way, as anything could follow "Java" (numbers, dots, spaces, characters).

    RewriteCond %{HTTP_USER_AGENT} ^Java [OR] 

    Removing duplicates - or: "Reducing the number of lines"

    The .htaccess file has some strong tools available to reduce the size of your .htaccess file itself, as well as the processing time. For reference, please read the full WebmasterWorld thread "A Close to perfect .htaccess ban list", especially part 2.

    Here are a few lines that i would reduce to one:

    RewriteCond %{HTTP_USER_AGENT} ^Advanced Email Extractor [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Stupid email harvester [OR]

    - the above becomes:

    RewriteCond %{HTTP_USER_AGENT} Email.?(Coll¦Extr¦Harv¦Siph¦Wolf) [NC,OR] 

    Which is ("stupid email" or just) "Email" followed by any charater (eg. a space). The "?" means that this character does not need to be there, it's optional.

    Then follows either of: "Harv", "Extr", "Siph", "Coll", "Wolf".

    NC means that the User-Agent can use lowercase or UpperCase and this does not matter, eg. "emailsiphon" or "Emailsiphon" or "EmailSiphon". Be sure to use [NC,OR] on all the lines that are relevant (except of course, those that have only numbers.)

    Important!
    Be sure to replace the broken pipe "¦" with one you enter from your keyboard, otherwise it will not work.

    Likewise, these become one:

    RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Microsoft Data Access [OR]

    RewriteCond %{HTTP_USER_AGENT} ^Microsoft.(Data¦URL) [NC,OR] 

    And these:

    RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Download\ Wonder [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Downloader [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SmartDownload
    RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
    [OR]

    - become:

    RewriteCond %{HTTP_USER_AGENT} ^(Smart¦Mass)?.?Download [OR] 

    The "?" means that the expression in parenthesis can be there or not.

    And these:

    RewriteCond %{HTTP_USER_AGENT} ^FrontPage [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^MSFrontPage [OR]
    RewriteCond %{HTTP_USER_AGENT} ^MS FrontPage [OR]

    - become:

    RewriteCond %{HTTP_USER_AGENT} ^(MS.?)?FrontPage [NC,OR]

    And these:

    RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetSmart [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]

    -become:

    RewriteCond %{HTTP_USER_AGENT} ^Get(Right¦Smart¦Web) [NC,OR] 

    And these:

    RewriteCond %{HTTP_USER_AGENT} ^InternetSeer.com [OR]
    RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]

    - become:

    RewriteCond %{HTTP_USER_AGENT} InternetSeer.com [NC,OR] 

    And these:

    RewriteCond %{HTTP_USER_AGENT} ^JOC [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]

    - become:

    RewriteCond %{HTTP_USER_AGENT} ^JOC [OR] 

    - you already had no. 2 in your no. 1 statement!

    And these:

    RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]

    - become:

    RewriteCond %{HTTP_USER_AGENT} ^Image\ (Strip¦Suck) [NC,OR] 

    And these:

    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]

    - become:

    RewriteCond %{HTTP_USER_AGENT} ^Offline\ (Explorer¦Navigator) [NC,OR] 

    And these:

    RewriteCond %{HTTP_USER_AGENT} ^Xaldon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]

    - become:

    RewriteCond %{HTTP_USER_AGENT} ^Xaldon [OR] 

    - you already had no. 2 in your no. 1 statement!

    And these:

    RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*Indy [OR]

    - become:

    RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*(NEWT¦Indy) [OR] 

    And these:

    RewriteCond %{HTTP_USER_AGENT} ^Magnet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mag-Net [OR]

    - become:

    RewriteCond %{HTTP_USER_AGENT} ^Mag.?net [NC,OR] 

    And these:

    RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]

    - become:

    RewriteCond %{HTTP_USER_AGENT} ^Super(Bot¦HTTP) [OR] 

    And these:

    RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetResearchServer[OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetZip [OR]

    - become (one line):

    RewriteCond %{HTTP_USER_AGENT} ^Net.?(Ants¦ResearchServer¦Spider¦Vampire¦Zip) [OR] 

    And these:

    RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebGather [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Webster [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]

    - become (three lines - i had them on one, but i thought the three would be more accurate):

    RewriteCond %{HTTP_USER_AGENT} ^Web.?(Auto¦Band¦Fetc¦Gath¦Imag¦Leac¦Reap¦Saug¦Strip¦Suck¦Whack¦ZIP) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website.?(eXtr¦Ques) [NC,OR]

    You could even catch all these by this one rule, but as it will catch all starting with "Web" it might catch too much:

    RewriteCond %{HTTP_USER_AGENT} ^Web [NC,OR]

    You could optimize even more using the OR operator "¦" (remember to replace it with one entered from your keyboard), but i'll leave the rest up to you.

    Although the post is very long now, i think it is appropriate to include the lines from your last post that i did not optimize in the above, so that you know which are left. They are listed below.

    If you want feedback on some rules that you want to make, please post only these rules and not the whole list, we already have it now ;)

    /claus


    Un-optimized statements:

    RewriteCond %{HTTP_USER_AGENT} ^Au2Email [OR] 
    RewriteCond %{HTTP_USER_AGENT} ^attach [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BackWeb [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Bandit [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BatchFTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Buddy [OR]
    RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Collector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Copier [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DA [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DISCo\ Pump [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Drip [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DTS Agent [OR]
    RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Fetch API Request [OR]
    RewriteCond %{HTTP_USER_AGENT} ^FileHound [OR]
    RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
    RewriteCond %{HTTP_USER_AGENT} ^gotit [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Grabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
    RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
    RewriteCond %{HTTP_USER_AGENT} ^HTTrack [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
    RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Iria [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JustView [OR]
    RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Linkman [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^lftp [OR]
    RewriteCond %{HTTP_USER_AGENT} ^likse [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Memo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mirror [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Ninja [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NPBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
    RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Pockey [OR]
    RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Pump [OR]
    RewriteCond %{HTTP_USER_AGENT} ^QuepasaCreep [OR]
    RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Reaper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Recorder [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SearchExpress [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Siphon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Snake [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SpaceBison [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Stripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^TurnitinBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Vacuum [OR]
    RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Whacker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
    RewriteCond %{HTTP_USER_AGENT} ZyBorg/1\.0\ \( [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Zeus

    silverbytes

    3:40 pm on Sep 22, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Wonder reply! May I rate it? ;)

    Thanks a lot! How important is .httaccess to be short?
    Mine is actually 5.39 k ...

    claus

    5:14 pm on Sep 23, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    I'm not able to tell if 5K is a problem on your server and for your site(s), it all depends on so much.

    The thing about htaccess is that it's read for each and every request that comes into your site. So, if you've got a gazillion pageviews a day it's really critical to optimize, whereas if you've got a single pageview it will be less important.

    There are people in the two threads i posted a link to that claim that they have no performance problems with very large htaccess files. There is really no way to tell you if 5K is a problem on your server and for your amount of traffic, apart from saying "try it". Personally, i don't think it will be a major problem, but don't take my word for it - try it and decide if you think so yourself. If it infuences performance, it will do so mostly when traffic is at a high level (during peak-hours).

    silverbytes

    7:43 pm on Sep 23, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    What if I quit some entries in .htaccess and put it into robots.txt?

    lets say half of annoyances and use

    User-agent: Annoyance
    Disallow: /

    Is it a good or bad idea?

    jdMorgan

    12:08 am on Sep 24, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    On one of my sites, .htaccess is currently 26kB. It's a lightly-used site (<1000 visitors per day), and I haven't trimmed the lines inserted into .htaccess to ban bad robots recently. No-one notices.

    The only entries that will work in robots.txt are the ones that control legitimate spiders from "good" companies. Bad 'bots will ignore your robots.txt; some of them won't even fetch it. As wilderness says, "Use robots.txt to request that robots stay out" of certain subdirectories and pages. The good robots will stay out, and the bad (or broken) ones won't.

    One way to speed up .htaccess is to take care of the most-often received requests first. The sooner mod_rewrite exits, for example, the sooner the server can finish up with any other modules you also use in .htaccess. Start with your most general RewriteRules -- the ones that are needed most often, and procede to your most detailed rules -- the ones that are rarely needed. Use the [L] flag whenever possible to stop rewriting unless you actually need to continue rewriting the URL of the current HTTP request. Since doing a multi-step rewrite is pretty rare, you can almost always use the [L] flag.

    Jim