Forum Moderators: open

Message Too Old, No Replies

Mozilla/3.Mozilla/2.01 (Win95; I)

from nucleus.com

         

idiotgirl

11:09 am on Oct 5, 2001 (gmt 0)

10+ Year Member Top Contributors Of The Month



Host: 207.34.94.87
Agent: Mozilla/3.Mozilla/2.01 (Win95; I)

What in the HECK is this thing? It came through my sites like wildfire - scraped up every graphic, tried to roar through my cgi-bins, and blatantly disregarded robots text. I think poor ban-bot.cgi had a freaking meltdown. Weird- didn't look like ban_bot.cgi 'punished', for some reason (redirected to other URL)??? I'm stumped.

I came up with some dialup service from Canada, but haven't a clue why this thing was so aggressive... and rude. Anyone else spotted this animal? yikes.

Idiotgirl

littleman

6:34 am on Oct 6, 2001 (gmt 0)



I see that UA all the time from a wide rang of IPs. Often it acts like a normal browser, I'm not sure what the application is, if anyone knows speak up. The IP is points to pm4-0-s0-i87-cgy.nucleus.com, the whole class C looks like that. Rwhois also points to Nucleus as the end user. You could email them and ask what they were up to.

idiotgirl

8:11 am on Oct 6, 2001 (gmt 0)

10+ Year Member Top Contributors Of The Month



Right after I posted it came sweeping through again - through the same domains. It just cruised back in and proceeded to suck up everything in its path. I haven't seen that UA in my logs before. If I had, I sure didn't notice that kind of behavior. I'd recently moved several domains to that box. Maybe that caused the fervor?

Anyway, I added the UA to my ban_bot text file and banned the IP just... because. I had also added Mozilla/2.0 to my ban_bot text file, but see that's now showing as Ask Jeeves. What a dilemma. (Hmmm - where'd my regex book go, anyway?)

Is it just me, or should we (webmasters, admins, SEO's) all start wearing crime-fighter super-hero capes?

Will

8:32 am on Oct 8, 2001 (gmt 0)



Hi idiotgirl,

I have only ever seen this UA being used by a product called EmailSiphon (others might know it as Sonic) from www.earthonline.com.

As the name implies, this is email extraction software - it is multithreaded so this would explain why it hits sites quickly (depending on bandwidth, it can submit upwards of 10 requests to the same site at once).

EmailSiphon allows UA spoofing (I think about 30 or so separate UAs can be selected), but the one you have seen is the default. IP address will be whichever spammer is running the software :(

If you want to identify it irrespective of the UA used, try using the HTTP_ACCEPT header which stays the same at all times. This should look like:

www/source, text/html, video/mpeg, image/jpeg, image/x-tiff,image/x-rgb, image/x-xbm, image/gif, */*, application/postscript

with "www/source" and "application/postscript" being the unusual entries.

Hope this helps!

idiotgirl

3:34 pm on Oct 8, 2001 (gmt 0)

10+ Year Member Top Contributors Of The Month



aha! well - the EmailSiphon would certainly explain its persistence. It attempted access again last night using a different IP and was shut out. I did email nucleus.com to ask about what it (the UA) was, why it was coming from them, etc. but have received no response. Like someone would blurt out, "Oh - we are just harvesting email addresses. Nothing to worry about."

Now, about this HTTP-ACCEPT header... is that in my config files under mime-types or my logging program or in the actual HTML??? Sorry, I'm not-so-bright sometimes.

Will

9:18 am on Oct 9, 2001 (gmt 0)



The HTTP_ACCEPT header (note underscore, not hyphen) is passed as part of the HTTP request when someone accesses your page.

To examine it you need to use some form of server-side scripting. The actual method used to read the information depends on what system you are using. For example, in ASP you would need

<%
TheData = request.servervariables("HTTP_ACCEPT")
%>

but it would vary according to whether you use Perl, JSP, etc.

A word of warning - this is not the "holy grail" of spider detection - I would not recommend banning based on the contents of this variable alone. That said, you can often glean a lot of useful spider identification info from this and several other server variables.