Stealth bot? - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Stealth bot?

Same-site referers lack post-suffix slash

Pfui

7:14 pm on Sep 16, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

In recent days I've noticed bunches of top level hits with abnormal referers and I'm stumped. And concerned. And curious.

1.) Currently 99.9% of visitors whose UAs show referers show "http://www.example.com/" after they come to the front door and load graphics and html pages. (If they use a bookmark or type in the sitename, there's no initial same-site referer, of course, but then they all show "http://www.example.com/".)

Well, almost all.

In the past two days at least four separate 'visitors' showed a no-slash referer. The UAs varied, as did the Hosts/IPs.

2.) Confused yet? Sorry! Here's an example of a normal referer, where example.com is my site. Note how example.com ends in a slash:

nnn.nnn.nnn.nnn - - [14/Sep/2009:11:16:37 -0700] "GET /dir/file.gif HTTP/1.1" 200 6079 "http://www.example.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"

And here's an abnormal-referer hit from the same host in the same second -- no slash:

nnn.nnn.nnn.nnn - - [14/Sep/2009:11:16:37 -0700] "GET /dir/file.html HTTP/1.1" 403 1499 "http://www.example.com" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"

(The latter 403'd because of the fake ref.)

3.) If you eyeball your raw access_log(s), you'll see the normal, same-site, top-level, slash-ending referers. Do you see any no-slash abnormalities?

jdMorgan

3:33 pm on Sep 17, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Thanks for the heads-up on this. I checked my rules, and found that domain-root referrers without a trailing slash *would* be allowed, so I corrected that error.

Looking through my logs, I haven't seen any slashless referrals in a long time (many, many years). I've never seen any that switch user-agents as you described, but obviously that's a huge red flag...

Well-spotted!

Jim

Pfui

6:06 pm on Oct 3, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

No-slash fake referers still appearing at my end, both with and without the www, e.g.:

"http://example.com"

jdMorgan

7:30 pm on Oct 3, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Yeah, here too. Now all 403'ed.

I've received no "reports" through my 403-Forbidden contact form complaining about this, either. So 99.99% sure it's just 'bots.

Note that I can be sure that it's OK to 403 these guys because if that slash were missing when a legitimate home-page request arrived (as from a type-in slashless URL not already auto-corrected by the browser), then the server would issue a redirect to correct that request by adding a slash. So I should never see any 'real' referrers from "example.com" with no slash.

Jim

Pfui

5:27 pm on Oct 20, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

More fake refs from legit Hosts. No e-mails from real people redirected due to same. Reminiscent of the /(null) hits mystery (#1 [webmasterworld.com]; #2 [webmasterworld.com]) but no common denominator here other than non-Mac.

Some of the old UAs:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.14) Gecko/2009082707 Firefox/3.0.14

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; MDDS)

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0

Beats heck outta me.

Pfui

11:48 pm on Nov 2, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

While procrastinating, I grepped my largest site's access_log files back to July 1. This 'fake ref' thing -- a.k.a. domain-root referers without a trailing slash, a.k.a. --

http://example.com
http://www.example.com

-- appeared in all grepped logs, for html and image files; sometimes from the first hit to /, sometimes only after that hit. A few misc. observations:

1.) Platforms were overwhelmingly PC (one unknown; one PDA; no Mac), ditto UAs but for one Opera, one nasty --

POE-Component-Client-HTTP/0.65 (perl; N; POE; en; rv:0.650000)

-- a few Chromes, and a bunch of BlackBerries:

BlackBerry8700/4.2.1 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/100

All BlackBerries came from the same IP range (206.53.153.nn), the rDNS of which is blackberry.net.

2.) Hosts ranged from individual ISP accounts like Comcast to major leaguers like foxinc.com, edwardjones.com and .bbc.co.uk. The latter make me wonder if there's some network-related process going on.

But...

3.) Another ill-fitting puzzle piece is that the percentage of bare IPs (no rDNS) was much, much higher in the 'fake ref' group than is usually seen in other grepped for X,Y,Z groups. Manually checking a few of the IPs showed a mix of server farms and apparently 'normal' ISPs.

4.) Known bad Hosts and IPs were mixed in with the maybe-okay Hosts (e.g., the big corporations), making whitelisting, or more narrow blacklisting, really tricky.

Bottom line?

Clear as mud. Still. Thoughts?

jdMorgan

12:04 am on Nov 3, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Note that all of the browser UAs you listed a couple of post above share one trait... They're all obsolete. Typical of un-updated browser-spoofing 'bots.

Since the IPs are 'all over the map,' just continue blocking them by missing-slash referrer, and forget about it. If you *ever* get an e-mail complaint, I'll be surprised. The only ones worth the least bit of worry are the mobile UAs -- since phones are typically rushed to market, their HTTP compliance is quite spotty...

Jim

Pfui

4:14 am on Nov 3, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Thanks again for puzzling this with me, Jim:)

The most irksome part for me is blocking the ones that first hit without any referer. They're able to snag everything page-related before they even try a fake ref on-site. Of course, the second they fake it, they're 403'd, but they still get in that first bite. Some even switch back and forth, starting out w/o a fake, then faking it, then not.

Fortunately, this is all more geekily intriguing (& procrastination-worthy) than seriously problematic, at least at this moment.

If only I could adapt the likes of Key_Master's Perl script [webmasterworld.com] from way back and automate the 403 process for the repeat fakers. Then new ones would get one hit, but that'd be it.

jdMorgan

5:07 am on Nov 3, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

You don't necessarily need to "adapt" that script, just rewrite the first detected self-referring request *to* that script, and it will do exactly what it does for robots-disallowed fetches... :)


# Rewrite self-referring page requests to Key_Master script
RewriteCond %{REQUEST_URI}>%{HTTP_REFERER} ^(.+)$
RewriteCond %1 ^/somepage\.html>http://www\.example.com/somepage\.html$ [OR]
RewriteCond %1 ^/otherpage\.html>http://www\.example.com/otherpage\.html$ [OR]
RewriteCond %1 ^/thirdpage\.html>http://www\.example.com/thirdpage\.html$
RewriteRule ^ /bad_bot.pl [L]

It's not perfect, being "clunky" because of mod_rewrite's lack of a built-in variable-to-variable comparison function, but if you put your top few percent "high-runner" URLs in those RewriteConds, you'll catch the scrapers fairly reliably and quickly. Also, newer servers can make use of 'atomic back-references' in regular expressions to implement a variable-to-variable compare using commutativity, that is, if A+A = A+B, then A=B. We discussed this technique years ago in the context of POSIX 1003.2 regular expressions, and it may be useful in this application to Webmasters who control their own servers and can dictate/control the regex library used on the server. Using it would make it unnecessary to list out each URL/referrer pair one-by-one as done above.

Jim

Umbra

5:07 am on Dec 10, 2009 (gmt 0)

10+ Year Member

Be careful about applying a blanket ban on referer www.example.com (without the slash)

If you have an htaccess/conf rule that redirects www.example.com to www.example.com/ -- then IE6 specifically (or at least my build of IE6) gives a referrer of www.example.com (without the slash)

Pfui

4:23 pm on Dec 14, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Jim's the mod_rewrite expert but that kind of a rule strikes me as superfluous. For example, our build of apache automatically adds the slash. Come to think of it, don't most servers, and/or browsers, automatically add the trailing slash at least for home directories?

jdMorgan

4:40 pm on Dec 14, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

mod_dir usually does this add-a-slash redirect, but it could be implemented as a mod_rewrite function for other reasons -- e.g. as part of a hostname-canonicalization redirect.

I specifically exclude directory 'index page' request URLs (requested URLs ending with a slash) from the missing-slash-referrer test for the reason Umbra stated.

Also, some legitimate search and SEO tool robots will omit the slash, so additional exceptions may be needed on your site.

Jim