homepage Welcome to WebmasterWorld Guest from 54.161.191.154
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
eMail Harvester UA's
Identification by incoming spam
weesnich

10+ Year Member



 
Msg#: 1668 posted 12:08 pm on Feb 14, 2003 (gmt 0)

While reading the German mail-abuse newsgroup I found a posting you may find interesting.

Someone set up a script on his site to feed spiders with eMail-Adresses that contain the coded IP and time of access. For any incoming spam he then identified the spider-run.

His results: [st.thermoman.de...]
(Page in German, but the results speak for themselfs)

The results are a bit Germany-centered as you expect from a German webpage (t-dialin.net a dialin of Germany's biggest ISP), but show today's eMail-Harvester mostly give IE-UA's, which can hardly blocked by .htaccess

If someone knows similar experiments: I'm highly interested.

 

bull

10+ Year Member



 
Msg#: 1668 posted 2:20 pm on Feb 14, 2003 (gmt 0)

The referer would be quite of interest, as some programs seek google for e.g. "guestbook", "gästebuch" etc. As a result, I denied those containing anything guestbook-like in the referer (most of course with IE as UA...)

Dreamquick

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 1668 posted 3:04 pm on Feb 14, 2003 (gmt 0)

As weesnitch pionts out you are mostly going to see IE UA's when these things decide they want to harvest emails "undercover" as if they want to appear inconspicuous they will want to look like the majority of the other traffic.

For what it's worth I often tend to see a lot of suspicious Netscape 4.X traffic and while I could blame some of this on proxy caches it doesn't feel quite right to blame it all on proxies.

I thought I'd share the three most useful bits of anti-spambot advise I have found out - they will hinder most of the "dumb" harvesters but wont fool a well coded one - thankfully the dumb ones are the most common :)

1) If you need to put email addresses on a page you should stategically use character encoding on the addresses you want to protect.

This is a really simple defence but works because the majority of the off-the-shelf harvesters are dumb as dirt and focused purely on speed - meaning that they wont decode the page before they start looking for email addresses.

This means that all the clues to where an email address is on the page cannot be found which makes it harder to capture them. Equally any reasonably standards compliant browser from Netscape to Mozilla to IE to Lynx will render the address correctly because they understand what's going on an decode the page!

2) Spambots often ship with a default spambot UA in addition to a user-defined UA but they make a very half-hearted job of pretending to be a real browser!

The main reason for this is the same as earlier - they are after speed and not quality because at the end of the day it's a crowded marketplace and the guy with the biggest number to put on the box looks the best to the sort of people who buy this stuff.

With some trial & error you can analyse the headers a request uses and compare them to what you were expecting for that type of browser. Spambot requests often include vastly different headers to real requests, missing some of the really trivial common elements for example.

3) Dumb spambots can't handle cookies correctly, again this is a really simple thing to check but is a dead giveaway for a badly coded crawler.

You often find that they just keep on collecting / using cookies from each site they visit, showing a big list of cookies which look like nothing your site uses - also you may find that they use bits of cookie that they aren't supposed to.

No browser does this because its totally insecure behaviour and none of the professional crawlers do it so whenever you see this you know you have a crawler coded by a total amateur.

- Tony

amoore

10+ Year Member



 
Msg#: 1668 posted 3:28 pm on Feb 14, 2003 (gmt 0)

This is absolutely brilliant. It is one of the only ways of verifying that some particular browsers are indeed malicious spiders. I think I may implement my own to see what kinds of results I get.

This type of verifiable proof is one major piece that is missing from my plan to set up a realtime blackhole list for malicious spiders, similar to those maintained for open relays and spam-generating mail servers. More on that is at [gotany.org...] I would love to hear any more comments about that plan. Unfortunately, by the time a spam is received at one of these coded addresses, the information about the useragent and IP address of the spider is rather old, and it's my conjecture that this information loses value very quickly as time progresses.

I would also love to hear more about similar experiments or plans for them.

thermoman

10+ Year Member



 
Msg#: 1668 posted 6:42 am on Feb 15, 2003 (gmt 0)

Thanks for mention my page and spamtrap to fool spammer :-)

I was editing my .htaccess and searched the web for some new UAs to block and found this forum and this post.

Short explanation to my spamtrap: My page contains a hidden email adresse in a mailto link, linked with an 1x1 transparent gif. This Emailadress is generated dynamically and contains encoded the ip address of of requesting client and date + time. A catch-all gets all the spam-mails at these emailaddresses and so i can figure out which ip at which moment spidered my emailaddress. In addition i lookup automatically the UA string in access log and there it is the spamtrap statistic.

for now my .htaccess contains amongst other things:

SetEnvIf Remote_Addr ^12\.148\.209\.19[67]$ client_is_bad
SetEnvIf Remote_Addr ^12\.148\.196\.14(2¦[4-9])$ client_is_bad
SetEnvIf Remote_Addr ^195\.154\.174\.[0-9]+$ client_is_bad
SetEnvIf Remote_Addr ^211\.101\.[45]\.[0-9]+$ client_is_bad
SetEnvIf Remote_Addr ^195\.145\.98\.(3[2-9]¦[45][0-9]¦6[0-3])$ client_is_bad
SetEnvIf Remote_Addr ^63\.148\.99\.(22[4-9]¦2[3-5][0-9])$ client_is_bad
SetEnvIfNoCase Request_URI "^[^?]*(owssvr¦vti_bin¦_vti_¦strmver¦exploit¦victim¦MSOffice¦DCShop¦msadc¦winnt¦system32¦script¦autoexec¦_mem_bin¦NULL\.idq)" client_is_bad
SetEnvIfNoCase User-Agent "^-?$" client_is_bad
SetEnvIfNoCase User-Agent "atSpider" client_is_bad
SetEnvIfNoCase User-Agent "autoemailspider" client_is_bad
SetEnvIfNoCase User-Agent "CherryPicker" client_is_bad
SetEnvIfNoCase User-Agent "DSurf" client_is_bad
SetEnvIfNoCase User-Agent "DTS Agent" client_is_bad
SetEnvIfNoCase User-Agent "EliteSys\ Entry" client_is_bad
SetEnvIfNoCase User-Agent "Email(Collector¦Wolf¦Siphon)" client_is_bad
SetEnvIfNoCase User-Agent "ExactSeek\ Crawler" client_is_bad
SetEnvIfNoCase User-Agent "ExtractorPro" client_is_bad
SetEnvIfNoCase User-Agent "Indy Library" client_is_bad
SetEnvIfNoCase User-Agent "Internet Explore 5.x" client_is_bad
SetEnvIfNoCase User-Agent "larbin.*unspecified.*mail" client_is_bad
SetEnvIfNoCase User-Agent "LinkWalker" client_is_bad
SetEnvIfNoCase User-Agent "Mail Sweeper" client_is_bad
SetEnvIfNoCase User-Agent "MSFrontPage" client_is_bad
SetEnvIfNoCase User-Agent "munky" client_is_bad
SetEnvIfNoCase User-Agent "NICErsPRO" client_is_bad
SetEnvIfNoCase User-Agent "NPBot-" client_is_bad
SetEnvIfNoCase User-Agent "Robozilla" client_is_bad
SetEnvIfNoCase User-Agent "Roverbot" client_is_bad
SetEnvIfNoCase User-Agent "sitecheck\.internetseer\.com" client_is_bad
SetEnvIfNoCase User-Agent "Telesoft" client_is_bad
SetEnvIfNoCase User-Agent "WebBandit" client_is_bad
SetEnvIfNoCase User-Agent "WebEmailExtrac" client_is_bad
SetEnvIfNoCase User-Agent "Zeus.*Webster" client_is_bad
SetEnvIfNoCase User-Agent "^Microsoft\ Data\ Access\ Internet\ Publishing\ Provider\ Protocol\ Discovery$" client_is_bad
SetEnvIfNoCase User-Agent "^Microsoft\ URL\ Control\ -\ 6\.00\.8169$" client_is_bad
SetEnvIfNoCase User-Agent "^Mozilla/3\.01\ \(compatible;\)$" client_is_bad
SetEnvIfNoCase User-Agent "DigExt\)$" client_is_bad

Perhaps this is useful for someone.

greetings from germany,
Marcel.

ncw164x

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 1668 posted 1:09 pm on Feb 16, 2003 (gmt 0)

Why ban Robozilla?

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 1668 posted 1:21 pm on Feb 16, 2003 (gmt 0)

ncw
Welcome to WebmasterWorld.

Good point.
Looks like the last four lines (after Zeus)were later additions.

"DigExt" not a good addition either.

bird

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 1668 posted 2:13 pm on Feb 16, 2003 (gmt 0)

Very interesting experiment!

Although...:

"Robozilla"

Obviously, someone doesn't want to be found in Google...

"^Mozilla/3\.01\ \(compatible;\)$"

This is usually a proxy server. The client behind it may or may not be a bad robot, most of the time it's some innocent visitor.

"DigExt\)$"

This is a browser-extension, which is installed by roughly 6% of all MSIE users according to my logs. Are you sure you want to block all of those?

thermoman

10+ Year Member



 
Msg#: 1668 posted 2:23 pm on Feb 16, 2003 (gmt 0)

> Why ban Robozilla?

Think had found this on a site to be a email harvester - now searched the web and think i better disable banning this UA ;-)

> Looks like the last four lines (after Zeus)
> were later additions.

No, only the 2 last are new.

> ^Mozilla/3\.01\ \(compatible;\)$"

My logs are full of these UAs ignoring robots.txt ...

> DigExt\)$"

> This is a browser-extension, which is installed
> by roughly 6% of all MSIE users according to my
> logs. Are you sure you want to block all of those?

I've scanned my logs for spider activities and found DigExt only in harvesting Agents. This addon brings crawling functionalities to IE, or?

greetings,
Marcel.

ncw164x

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 1668 posted 2:58 pm on Feb 16, 2003 (gmt 0)

I scanned my logs and found this "_". Thousands of them so I added it to the ban list only to release that Googlebot also has reference to them - Phew out they came very quick

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 1668 posted 3:12 pm on Feb 16, 2003 (gmt 0)

<snip>I've scanned my logs for spider activities and found DigExt only in harvesting Agents. This addon brings crawling functionalities to IE, or?</Snip>

Marcel,
DigExt is primarily the term for a non-pentium processor.
Amd-K6-2 mine :(
Althalon and some others.

If you sticky mail me a url and provide a specific target which you can note of in your logs?
I'll visit and show you.
I'm not harvesting nothing short bad bots and non-friendly visitors to my sites :)

Don

ncw164x

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 1668 posted 3:36 pm on Feb 16, 2003 (gmt 0)

DigExt

When someone has used "Make available offline"
followed by:

"If this favorite links to other pages, would you like to make those pages available offline too? [y/n] ... Download pages [xxx] links deep from this page"

The useragent is: Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)

Proceeds to crawl the site with 0-wait time between requests

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 1668 posted 3:48 pm on Feb 16, 2003 (gmt 0)

ncw,
You may well be correct. As Marcel may have been in including it in his denies.
I just cannot recall having enough intrusions or crawls related to DigExt to associate it with any given wrong doing.

I would more likely look for a more specific definition in the UA.
In additiion I would pay more attention to the feedback provided by the IP range.
Same goes for all the IE and Mozilla's and others conatined in UA's. They are so common that to deny based on browser is more over-bearing than I desire to be.
In fact yesterday I initiated a thread "Underprotecting in IP ranges" of which I included a deny on a IE patch. I had to choke on my words and removed that SetENV almost immediately because of a valued visitor.

Don

thermoman

10+ Year Member



 
Msg#: 1668 posted 3:52 pm on Feb 16, 2003 (gmt 0)

DigExt is primarily the term for a non-pentium processor.

Why the hell MS sends this info in UAs? I think it is like ncw164x mentioned: It's the crawling function in MSIE which produces this addon in UA.

I've banned (but now diasabled the ban) DigExt because of my spamtrap statistic (link at top). Who is using this 'make those pages available offline' function? Modem Users, or?

Marcel.

Other Topic: What do u send to people who are on your ban list? Only 'forbidden' or do you redirect them to an explaining site telling them why they are banned? I do so, but text is only in german right now.

ncw164x

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 1668 posted 4:11 pm on Feb 16, 2003 (gmt 0)

Error 403 Page
"You Are Not Allowed To Access This Server"
with text relating to why they are not allowed to view the site

bird

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 1668 posted 4:14 pm on Feb 16, 2003 (gmt 0)

> ^Mozilla/3\.01\ \(compatible;\)$"

My logs are full of these UAs ignoring robots.txt ...

Not surprisingly, as most of them won't be robots. A proxy (and/or caching server) is not supposed to respect robots.txt, as it only executes the requests of its users. Those can be many different users in parallel, and you won't have any way to distinguish the good from the bad.

I've banned DigExt because of my spamtrap statistic (link at top). Who is using this 'make those pages available offline' function? Modem Users, or?

Most likely, yes.

Note also, that this is not technically an autonomous robot, so it is not bound to robots.txt either. All the requests made this way are more or less explicitly requested by a human.

Robots.txt is only relevant for software that crawls without human intervention, and decides in a fully automated way which URLs to fetch.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 1668 posted 4:59 pm on Feb 16, 2003 (gmt 0)

<snip>What do u send to people who are on your ban list? Only 'forbidden' or do you redirect them to an explaining site telling them why they are banned? I do so, but text is only in german right now.</snip>

I found for me that either custom page redirects or any form of notification other than the standard 403 denied are not a good practice.

I had at one time a custom 403 upon which a DE (German) visitor kept reloading for hours in an attempt to overload the processor and gain entry. Although he was not successful even though he was able to get a couple of images. The event did serve notice to me and I removed the custom redirect pages.

thermoman

10+ Year Member



 
Msg#: 1668 posted 5:21 pm on Feb 16, 2003 (gmt 0)

@bird: i know what robots.txt is for ;-)

@wilderness: i send them a white page with only a little bit of text, not more. so they can't overloead the processor as more as i would send them a normal 403 forbidden page.

Marcel.

ncw164x

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 1668 posted 5:24 pm on Feb 16, 2003 (gmt 0)

If there was a list of 403's from the same IP number for hours then the IP range of numbers are added to ipfw.
They would not get anywhere near the server, it's then up to you if you remove it at a later date, ie following day week etc. or leave it in for life.

Romeo

10+ Year Member



 
Msg#: 1668 posted 8:22 pm on Feb 16, 2003 (gmt 0)

Locking out ppl just on UA strings is a bad idea, because lot of innocent ppl get kicked out, while most of the bad bots don't use 'special' recognizable UAs but come in hiding behind masqueraded UA strings pretending they are just user browsers -- you will end up locking out all 'Mozilla' or 'MSIE'-Users some day this way ...
A client is free to send any UA string he wants.
I my eyes, the only valid filter criterium is on IP addresses, but ymmv.
Regards,
R.

weesnich

10+ Year Member



 
Msg#: 1668 posted 12:25 pm on Feb 17, 2003 (gmt 0)

@thermoman: Welcome and I'm glad you joined the discussion. Thank you for making and publishing this statistics.

@romeo: Some time ago many spambots could be identified by UA as spammers were not thinking about it. It took some time for the spammers recognise this, some more time to change their tools and bring them to "market". Seems like they did, finally.
I posted the link to thermoman's site because it was the fist hard evidence I saw that most spambots pretend to be ordinary IE's. I think many people were already suspecting this. But evidence is another thing. Many times I see a spiderrun in logfiles I just have no clue what the purpose of this run was, especially when the spiders IP resolves to a dialin of a more or less popular ISP. A similar statistics for an US-site might be different - who knows.

The huge majority of the harvester-runs seen on the page of thermoman originate from dialins. Sadly, banning dialins by IP from Germany biggest ISP and from AOL is not an option for me. It would be nice if ISP at least terminate accounts of such customers, but as far as I know at least telekom (t-dialin.net) refused it. They argued that spamming is illegal (it is in Germany), but spidering isn't. Even if we block IP's of suspicious dialins, spammers may one day try to spider through open proxies.

That means the easy .htaccess - approach to ban spammers has lost much of its effect. We need more complex measures like botstraps and/or things mentioned by Dreamquick. I personally use a wild mix of hex and decimal unicode and Ascii-code for eMailcontacts on my website fpr over a year now with good success. There are plenty of other alternatives like javascript-links, flashlinks etc, which all have drawbacks. If I have more time I'll think about this cookie-thing, it sounds interesting.

BTW: Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)
is clearly fake, no MSIE ever had a space between "compatible" and ";". But I would not bet on spammers making this mistake again.

thermoman

10+ Year Member



 
Msg#: 1668 posted 10:25 pm on Mar 9, 2003 (gmt 0)

Hi,

<snip>The referer would be quite of interest, as some programs seek google for e.g. "guestbook", "gästebuch" etc. As a result, I denied those containing anything guestbook- like in the referer (most of course with IE as UA...)</snip>

I've added support for referer to my script so u can now see where bad email spiders came from. url is in first post: [webmasterworld.com...]

Like bull said - most of them searched google, msn etc. for "gästebuch" (german for guestbook) plus an additional word.

Because almost all of them came with "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" as UA, i now block this:

In .htaccess in root of my domain-dirs:

SetEnvIfNoCase User-Agent "^Mozilla/4\.0\ \(compatible;\ MSIE\ 5\.0;\ Windows\ NT;\ DigExt\)$" client_is_suspicious1
SetEnvIfNoCase Referer "(q¦query)=.*g(%C3%A4¦ae)stebuch" client_is_suspicious2

In .htaccess in domain-dir where bad boys aren't acceptable:

Rewritecond %{ENV:client_is_suspicious1} ^1$
Rewritecond %{ENV:client_is_suspicious2} ^1$
Rewriterule ^.*$ [somewhere.but.not.here...] [R,L]

Perhaps the referer are usefull to someone of you.

Greetings from germany.

bird

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 1668 posted 10:52 pm on Mar 9, 2003 (gmt 0)

Ah, with the guestbook connection, the DigExt ban starts to make a lot more sense. Very interesting observation!

Now am I glad I don't have a guestbook on any of my sites... ;)

WebJoe

10+ Year Member



 
Msg#: 1668 posted 11:24 pm on Mar 9, 2003 (gmt 0)

general Q first: Am I the only in this forum who's hosting provider runs MS-only servers with no support or .htaccess?

Anyway, I found a way to deal with unwanted visitors in a very similar way as thermoman does, with the only downside that (I think) it produces more server load. But my provider never complained, so i never asked to find a way to support .htaccess.
If anyone is interested in how I did that, i'd be happy to explain. But I didn't want to post a link to my site here...

I just wanted to state, that I also redirect banned visitors to a page where they find an explanation why they were redirected.

Re: DigExt and IE ignoring robots.txt: If a user add a site to his favourites and chooses to make tha available off-line, two things happen (just tested w/ IE 5.5 & 6):
- IE grabs the robots.txt and follows its rules
- it adds "MSIECrawler" to the UA-string

I hope I could help

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved