homepage Welcome to WebmasterWorld Guest from 54.205.254.108
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Curious sequence of events
phred




msg:3794957
 10:24 pm on Nov 26, 2008 (gmt 0)

Noticed something strange in the logs today. Could be coincidence but then again I believe the Brooklyn Bridge already has at least one owner..

My site is driven through a central routing application that Apache launches from the browser url/i request. The software router manages sessions, manages security (checks/updates ip and UA data, feeds 403’s, etc), manages internal states, logs and routes to the appropriate sub application. So I can see and log things like direct navigation to a sub-page from a browser without an existing session. Unfortunately my hosting company doesn't allow direct Apache log access so I can't see what exactly was gotten.

31:02 *IP1*/ No Session - routing to home
31:02 *IP2*/ No Session - routing to home
31:03 *IP3*/ No Session - routing to home

That’s a normal first entry to the site – basic url only and no previous/existing session.

31:14 *IP1*/Products/ No Session - routing to products
31:14 *IP2*/Products/ No Session - routing to products
31:15 *IP3* routing to products

IP1 and IP2 are behaving like a bot – scraped a direct navigation url/i from the home page and then started a new “browser” session. I regenerate a session, log and route. IP3 is behaving like a user who clicked on the link in the home page and therefore has an existing session so I just route.

32:25 *IP2*/History/ No Session - routing to history
32:26 *IP3* routing to history

IP1 has gone but IP2 and IP3 are behaving as above.

Note the relative timing, the 10-11 second gaps between the groups of accesses and the exact same navigation within the site. IP3 looks like a real user however the relative timing and exact same navigation is just too much of a coincidence. Highly suspicious.

IP1 – 65.46.48.#*$! – Mozilla/4.0 - XO Communications
IP2 - 204.246.129.#*$! – Mozilla/4.0 – ViaWest Internet Services, Inc.
IP3 - 204.54.36.#*$! - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; MS-RTC LM 8; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022) - Deere & Company

 

wilderness




msg:3795086
 2:52 am on Nov 27, 2008 (gmt 0)

IP2 - 204.246.129.zzz – Mozilla/4.0 – ViaWest Internet Services, Inc.

Have an unidetified bot from that IP during 2006.
204.246.129.zzz - - [01/Sep/2006:06:38:42 -0700] "GET /MyFolder/MySubFolder/ HTTP/1.1" 200 4951 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"

IP1 – 65.46.48.zzz – Mozilla/4.0 - XO Communications

XO has presented many problems over a long period.
At one time, I had many of their provider ranges denied.
Some time back I removed those denials, however occasionaly I still see what might be called "un-explainable" activity from their ranges.

Samizdata




msg:3795087
 2:53 am on Nov 27, 2008 (gmt 0)

The first two IPs with the "Mozilla/4.0" user-agent are proxies, the last is that of the actual user.

I took a quick look at my logs and saw the exact sequence - in my case both proxies (with the same IP ranges you specified) were served a 403 and the real user following up was served the requested content with no ill-effects.

Highly suspicious

An understandable reaction, but I would say the only thing you have to worry about is where to find a decent hosting company that allows you access to the raw Apache logs.

...

Samizdata




msg:3795088
 2:58 am on Nov 27, 2008 (gmt 0)

Some time back I removed those denials

To clarify, I don't deny the IPs but block on the "Mozilla/4.0" user-agent.

...

Megaclinium




msg:3828721
 3:07 am on Jan 18, 2009 (gmt 0)

I had something similar happen

user from address
205.124.145.xx
Utah education network
205.118.0.0 - 205.127.255.255

came in to my site from a valid link
UA mutates in middle of retrieving pages
referrer goes blank so I start serving up 302s
already thinking not really a user but a bot
they start bypassing the next page in middle of grabbies

65.46.48.194 hits right in middle of sequence
XO communications
65.44.0.0 - 65.47.255.255
maybe an AV service site for the education network?

I think between mutating UA and no referrer to my sub pages must be a zombie bot

Strange thing is it comes back for some of the 302's pages normally

saw previous link in 2002 about them

thetrasher




msg:3828906
 3:13 pm on Jan 18, 2009 (gmt 0)

65.46.48.194 hits right in middle of sequence
XO communications
65.44.0.0 - 65.47.255.255
maybe an AV service site for the education network?
Try RWhois:
65.46.48.192/30 = (a proxy-server manufacturer)

jdMorgan




msg:3828940
 4:51 pm on Jan 18, 2009 (gmt 0)

Yeah, 65.46.48.192-195 is Bluecoat:

Hardware proxy appliances for corporate networks offering web caching, virus scanning, content filtering, instant messaging control and bandwidth management.

Their caching/filtering proxy used to use a UA string that would match this regex pattern:

^Mozilla/4\.0\ \(compatible;\ MSIE\ 6\.0;\ Bluecoat\ DRTR\)$

I'm not sure if that's still true, but i used that to explicitly allow their non-browser behaviors.

Jim

Megaclinium




msg:3829172
 1:30 am on Jan 19, 2009 (gmt 0)

I was less bothered by the apparent AV scan in the middle than the original 'user' who changed UAs in middle of access one page. I see that sometimes. I only ban them if they are stupid enuf to cause errors.

The original user from 205.118 tried to grab media without referrer SOME of the time. That I found real odd. Maybe a zombified student? or someone hacked into Utah's network?

Megaclinium




msg:3829189
 1:43 am on Jan 19, 2009 (gmt 0)

Try RWhois:
65.46.48.192/30 = (a proxy-server manufacturer)

I don't mean to sound stupid,but how do I get more detail than the whois that showed XO Communications range?

is this with the > search prefix function I saw in an early webmasterworld post?

wilderness




msg:3829230
 3:23 am on Jan 19, 2009 (gmt 0)

I don't mean to sound stupid,but how do I get more detail than the whois that showed XO Communications range?

Megaclinium,
For backbones and/or providers that do not have large blocks of IP's broken down to either subnet or commercial customers sub-ranges?

We basically use most any method we may beg, borrow or steal.

Tracert or ping in some instances offers some focus.
subnet searches are possible at ARIN in some intsances, however the difficulty in obtaining results along with the 256 output limitation by ARIN presents additional frustration.

Faced with a backbone providers range, and with a good possibility that a portion of the range is a culprit generally requires multi-conditional restrictions/denies.

There are many of us that "have" and "have had" either entire providers denied access or large ranges from specific providers. These denials are a result of unaccountability that is very similar to the frustration that XO Communications (and other providers) leave webmasters to contend with.

BTW, in some instances?
If you sticky mail a person that offers a forum insight, you may gain additional reference which is not possible (i. e., charter) to present in the open-forum.

Don

jdMorgan




msg:3830076
 3:47 am on Jan 20, 2009 (gmt 0)

"Sam Spade" is your friend -- as is Google. :)

Jim

Megaclinium




msg:3831056
 6:35 am on Jan 21, 2009 (gmt 0)

Thanks, Wilderness;

My theory also, is that even if XO com or Level 3 ARE backbone providers, to me it doesn't make sense that an END USER would be actually using an IP address in their blocks. They might be traversing these same network segments but (and I may be wrong) that it is more likely some server running something that I don't want on my system. Same way with Amazon cloud.

So I got a recently scraper or other unexplainable hit that isn't what a normal user on a browser could do; That resolved to Level 3 and had no problem deep sixing the entire block. (picture of Patty and Selma smoking a butt and saying 'Oh that felt good' :)

I occasionally run into probs with this. I checked 302's due to multi-ip hits that don't share the referrer even tho from same one user session, like AOL does. I looked it up, resolved to ABCDCorp. ABCDCorp is listed on NYSE, "major defense contractor" blah blah blah. Block only has 5 addresses, yes 5, #*$!.2 to #*$!.7

Then a friend in one of our clubs emails me (from home), at bottom, his name and below that "Sr Scientist", plus "ABCDCorp"

So he was trying to access link from a newsletter I sent out from work. Oh well. I re-enabled it but due to their stupid setup and my leach prevention the page won't display completely for him at work.

I won't turn it off tho. This seems to be a great way to get scrapers to show up as 404s so you notice them quickly. Too bad for the occasional site that can't share referrer among multiple IP sessions. AOL Europe I banned for a while, as they sent, for one session what looked like differing UAs spread out among dift IPs despite being same session grabbing files on one page in order. Then sent '-' thru as one UA! maybe an AV checking member web page access?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved