homepage Welcome to WebmasterWorld Guest from 54.237.184.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Can anyone add anything else to this ?
Dexie




msg:4255436
 10:29 pm on Jan 19, 2011 (gmt 0)

Some excellent info here. This list below seems to be the latest - any others to add? Especially for email grabbers and scrapers please?
<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

 

incrediBILL




msg:4255453
 11:09 pm on Jan 19, 2011 (gmt 0)

I hate to be a naysayer, but I have a list of literally thousands.

That's another reason why I whitelist instead of blacklist, the list is so long it puts a load on the server.

wilderness




msg:4256163
 2:25 pm on Jan 21, 2011 (gmt 0)

<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]


This is really a lame effort for your introduction to this forum :(

You simply copy and pasted the opening submission from a very long and old thread [webmasterworld.com](2001)!

I kinda doubt your reading progressed past page or even approached page 13 of that thread [webmasterworld.com] (as well as anything in between).

Most of the UA's in the 2001 thread are long gone (no longer used). Many participants in that long thread were actually making inquires for help and submitting copy and pasted lines from other sources (such as you did here). Many of those copy and pasted lines were invalid in 2001, and had you taken the time to read the entire thread, you had realized that.

There are not any copy and paste, or off-the-shelf solutions, as others have explained.
You simply start with a few lines and learn as you go along. Not only the syntax, but the User Agents and deceptions used by non-compliant visitors (bots, harvesters or otherwise).

Dexie




msg:4256189
 3:18 pm on Jan 21, 2011 (gmt 0)

"This is really a lame effort for your introduction to this forum"
Don't worry, I wasn't trying to impress you!

keyplyr




msg:4256445
 10:58 pm on Jan 21, 2011 (gmt 0)

@ Dixie

Hi there,

Most webmasters today use a combination of User Agent blocking (via htaccess), IP blocking (via htaccess or server config), and request type/method filtering (via scripting or server config)

The example of User Agents (UA) used in your list is very old. Plus it can be made more succinct.

The anchor attribute "^" means "starts with" so all UAs that start with the same letters or words can be combined. Example:

These three lines:

RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector

Can be combined condensed to:

RewriteCond %{HTTP_USER_AGENT} ^Email [NC,OR]

Any UA that starts with "Email" would be blocked. It is probably safe to do so because IMO there are no legit UAs that actually start that way. Also the added "NC" allows for either upper or lower case (Email or email.)

Only you can determine what UA is a threat. Example: Some view the Internet Archive as a valuable resource while others see it as copyright infringement. Keep an eye on your raw logs and over time you'll get a better idea what to block and what to let through.

And keep asking questions :)

Dexie




msg:4256455
 11:27 pm on Jan 21, 2011 (gmt 0)

Many thanks keyplyr for that friendly andhelpful answer, it's much appreciated ;-)

When I first started to look into this subject a couple of months ago, I was thinking of just whitelisting the good guys, so that all others were naturally excluded, but now, it seems as though it's probably best the other way around - blacklisting all the bad guys, and all others are let in - is that how you see it?

Also, to block any bad bots, you need to see the IP address's for them - how do you find that out please? or is that in the raw logs you referred to? If so, how do I get to those raw logs please?

Thanks.

Dexie

wilderness




msg:4256467
 11:54 pm on Jan 21, 2011 (gmt 0)

Also, to block any bad bots, you need to see the IP address's for them - how do you find that out please? or is that in the raw logs you referred to? If so, how do I get to those raw logs please?


"raw access logs" are dependent upon your server/host.
Most "shared hosting" providers offer standard "raw access logs", some providers require you turn that option on via CP, while others have logs on by default.

Other hosting providers use scripts or stats to provide a crippled form of logs, which are a real bother. In most instances, if this is the only option your host provides for logs, it would be in your best interest to acquire a new host that does provide "raw access logs".

Taking the IP from the "raw access logs", you do an inquiry at ARIN (North America, which us working rather poorly these days, and by their intention), RIPE (Europe), APNIC (Asia and Ocenaic), there are a few other smaller registrars, AFRNIC (Africa) is a real pest in spurts.

There are DNS and IP tools websites that combine all the registrar search options together, with new ones appearing all the time.

However the registrars provide the best and most accurate recent data.

wilderness




msg:4256470
 12:01 am on Jan 22, 2011 (gmt 0)

There are a couple of dozen terms for User Agents that are perfectly unacceptable to webmasters and have been for a long a while, and yet bots and others tools continue to utilize these words in their UA's.

There are two threads near the top of the page that were intended to assist newbies (all new threads are automatically posted below these two priority threads).

One of these Quick primer on identifying bot activity [webmasterworld.com], and going down to Section 8 in Oceans opening, provide some common UA abuse names.

Dexie




msg:4256750
 9:01 pm on Jan 22, 2011 (gmt 0)

Many thanks wilderness,that info is very useful.

Dexie




msg:4256773
 9:31 pm on Jan 22, 2011 (gmt 0)

Keyplyr made quite a helpful post as well above, and just wondered if there was any way of doing the syntax, so that the word email, in whatever case and whereverit came in the word, how would you do that ?

wilderness




msg:4256780
 9:43 pm on Jan 22, 2011 (gmt 0)

#keyplr's example provides "BEGINS" with, designated by the caret anchor. The traling NC in brackets designates NO CASE
RewriteCond %{HTTP_USER_AGENT} ^Email [NC,OR]

#Same example however ABSENT caret anchor designates "CONTAINS" , which is any location in UA
RewriteCond %{HTTP_USER_AGENT} Email [NC,OR]

wilderness




msg:4256786
 9:59 pm on Jan 22, 2011 (gmt 0)

Dexie,
There are fundamental anchors that are required understanding in both mod-setevif and mod_rewrite.

1) Begins with; designated by leading-caret character
2) Ends with; designated by trailing-dollar-sign character
3) contains; (anywhere in UA) and absent any leading or trailing character.
4) You may also (sometimes required) use both a leading caret and trailing dollar sign, which is explained as both "begins with and ends with".
5) in some instances' you may also used quotes to designate "EXACTLY AS", and in my example as "exactly as" would include the blank space.

Please note; example (5) "exactly as" doesn't usually work in mod-Rewrite, however does function as intended in mod_setenvif.

Don

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved