Welcome to WebmasterWorld Guest from 54.162.239.134

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

Can anyone add anything else to this ?

     
10:29 pm on Jan 19, 2011 (gmt 0)

10+ Year Member



Some excellent info here. This list below seems to be the latest - any others to add? Especially for email grabbers and scrapers please?
<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]
11:09 pm on Jan 19, 2011 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I hate to be a naysayer, but I have a list of literally thousands.

That's another reason why I whitelist instead of blacklist, the list is so long it puts a load on the server.
2:25 pm on Jan 21, 2011 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]


This is really a lame effort for your introduction to this forum :(

You simply copy and pasted the opening submission from a very long and old thread [webmasterworld.com](2001)!

I kinda doubt your reading progressed past page or even approached page 13 of that thread [webmasterworld.com] (as well as anything in between).

Most of the UA's in the 2001 thread are long gone (no longer used). Many participants in that long thread were actually making inquires for help and submitting copy and pasted lines from other sources (such as you did here). Many of those copy and pasted lines were invalid in 2001, and had you taken the time to read the entire thread, you had realized that.

There are not any copy and paste, or off-the-shelf solutions, as others have explained.
You simply start with a few lines and learn as you go along. Not only the syntax, but the User Agents and deceptions used by non-compliant visitors (bots, harvesters or otherwise).
3:18 pm on Jan 21, 2011 (gmt 0)

10+ Year Member



"This is really a lame effort for your introduction to this forum"
Don't worry, I wasn't trying to impress you!
10:58 pm on Jan 21, 2011 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



@ Dixie

Hi there,

Most webmasters today use a combination of User Agent blocking (via htaccess), IP blocking (via htaccess or server config), and request type/method filtering (via scripting or server config)

The example of User Agents (UA) used in your list is very old. Plus it can be made more succinct.

The anchor attribute "^" means "starts with" so all UAs that start with the same letters or words can be combined. Example:

These three lines:

RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector

Can be combined condensed to:

RewriteCond %{HTTP_USER_AGENT} ^Email [NC,OR]

Any UA that starts with "Email" would be blocked. It is probably safe to do so because IMO there are no legit UAs that actually start that way. Also the added "NC" allows for either upper or lower case (Email or email.)

Only you can determine what UA is a threat. Example: Some view the Internet Archive as a valuable resource while others see it as copyright infringement. Keep an eye on your raw logs and over time you'll get a better idea what to block and what to let through.

And keep asking questions :)
11:27 pm on Jan 21, 2011 (gmt 0)

10+ Year Member



Many thanks keyplyr for that friendly andhelpful answer, it's much appreciated ;-)

When I first started to look into this subject a couple of months ago, I was thinking of just whitelisting the good guys, so that all others were naturally excluded, but now, it seems as though it's probably best the other way around - blacklisting all the bad guys, and all others are let in - is that how you see it?

Also, to block any bad bots, you need to see the IP address's for them - how do you find that out please? or is that in the raw logs you referred to? If so, how do I get to those raw logs please?

Thanks.

Dexie
11:54 pm on Jan 21, 2011 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Also, to block any bad bots, you need to see the IP address's for them - how do you find that out please? or is that in the raw logs you referred to? If so, how do I get to those raw logs please?


"raw access logs" are dependent upon your server/host.
Most "shared hosting" providers offer standard "raw access logs", some providers require you turn that option on via CP, while others have logs on by default.

Other hosting providers use scripts or stats to provide a crippled form of logs, which are a real bother. In most instances, if this is the only option your host provides for logs, it would be in your best interest to acquire a new host that does provide "raw access logs".

Taking the IP from the "raw access logs", you do an inquiry at ARIN (North America, which us working rather poorly these days, and by their intention), RIPE (Europe), APNIC (Asia and Ocenaic), there are a few other smaller registrars, AFRNIC (Africa) is a real pest in spurts.

There are DNS and IP tools websites that combine all the registrar search options together, with new ones appearing all the time.

However the registrars provide the best and most accurate recent data.
12:01 am on Jan 22, 2011 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



There are a couple of dozen terms for User Agents that are perfectly unacceptable to webmasters and have been for a long a while, and yet bots and others tools continue to utilize these words in their UA's.

There are two threads near the top of the page that were intended to assist newbies (all new threads are automatically posted below these two priority threads).

One of these Quick primer on identifying bot activity [webmasterworld.com], and going down to Section 8 in Oceans opening, provide some common UA abuse names.
9:01 pm on Jan 22, 2011 (gmt 0)

10+ Year Member



Many thanks wilderness,that info is very useful.
9:31 pm on Jan 22, 2011 (gmt 0)

10+ Year Member



Keyplyr made quite a helpful post as well above, and just wondered if there was any way of doing the syntax, so that the word email, in whatever case and whereverit came in the word, how would you do that ?
9:43 pm on Jan 22, 2011 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



#keyplr's example provides "BEGINS" with, designated by the caret anchor. The traling NC in brackets designates NO CASE
RewriteCond %{HTTP_USER_AGENT} ^Email [NC,OR]

#Same example however ABSENT caret anchor designates "CONTAINS" , which is any location in UA
RewriteCond %{HTTP_USER_AGENT} Email [NC,OR]
9:59 pm on Jan 22, 2011 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Dexie,
There are fundamental anchors that are required understanding in both mod-setevif and mod_rewrite.

1) Begins with; designated by leading-caret character
2) Ends with; designated by trailing-dollar-sign character
3) contains; (anywhere in UA) and absent any leading or trailing character.
4) You may also (sometimes required) use both a leading caret and trailing dollar sign, which is explained as both "begins with and ends with".
5) in some instances' you may also used quotes to designate "EXACTLY AS", and in my example as "exactly as" would include the blank space.

Please note; example (5) "exactly as" doesn't usually work in mod-Rewrite, however does function as intended in mod_setenvif.

Don
 

Featured Threads

Hot Threads This Week

Hot Threads This Month