Forum Moderators: open
1.) Currently 99.9% of visitors whose UAs show referers show "http://www.example.com/" after they come to the front door and load graphics and html pages. (If they use a bookmark or type in the sitename, there's no initial same-site referer, of course, but then they all show "http://www.example.com/".)
Well, almost all.
In the past two days at least four separate 'visitors' showed a no-slash referer. The UAs varied, as did the Hosts/IPs.
2.) Confused yet? Sorry! Here's an example of a normal referer, where example.com is my site. Note how example.com ends in a slash:
nnn.nnn.nnn.nnn - - [14/Sep/2009:11:16:37 -0700] "GET /dir/file.gif HTTP/1.1" 200 6079 "http://www.example.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"
And here's an abnormal-referer hit from the same host in the same second -- no slash:
nnn.nnn.nnn.nnn - - [14/Sep/2009:11:16:37 -0700] "GET /dir/file.html HTTP/1.1" 403 1499 "http://www.example.com" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"
(The latter 403'd because of the fake ref.)
3.) If you eyeball your raw access_log(s), you'll see the normal, same-site, top-level, slash-ending referers. Do you see any no-slash abnormalities?
Looking through my logs, I haven't seen any slashless referrals in a long time (many, many years). I've never seen any that switch user-agents as you described, but obviously that's a huge red flag...
Well-spotted!
Jim
I've received no "reports" through my 403-Forbidden contact form complaining about this, either. So 99.99% sure it's just 'bots.
Note that I can be sure that it's OK to 403 these guys because if that slash were missing when a legitimate home-page request arrived (as from a type-in slashless URL not already auto-corrected by the browser), then the server would issue a redirect to correct that request by adding a slash. So I should never see any 'real' referrers from "example.com" with no slash.
Jim
Some of the old UAs:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.14) Gecko/2009082707 Firefox/3.0.14
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; MDDS)
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0
Beats heck outta me.
http://example.com
http://www.example.com
-- appeared in all grepped logs, for html and image files; sometimes from the first hit to /, sometimes only after that hit. A few misc. observations:
1.) Platforms were overwhelmingly PC (one unknown; one PDA; no Mac), ditto UAs but for one Opera, one nasty --
POE-Component-Client-HTTP/0.65 (perl; N; POE; en; rv:0.650000)
-- a few Chromes, and a bunch of BlackBerries:
BlackBerry8700/4.2.1 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/100
All BlackBerries came from the same IP range (206.53.153.nn), the rDNS of which is blackberry.net.
2.) Hosts ranged from individual ISP accounts like Comcast to major leaguers like foxinc.com, edwardjones.com and .bbc.co.uk. The latter make me wonder if there's some network-related process going on.
But...
3.) Another ill-fitting puzzle piece is that the percentage of bare IPs (no rDNS) was much, much higher in the 'fake ref' group than is usually seen in other grepped for X,Y,Z groups. Manually checking a few of the IPs showed a mix of server farms and apparently 'normal' ISPs.
4.) Known bad Hosts and IPs were mixed in with the maybe-okay Hosts (e.g., the big corporations), making whitelisting, or more narrow blacklisting, really tricky.
Bottom line?
Clear as mud. Still. Thoughts?
Since the IPs are 'all over the map,' just continue blocking them by missing-slash referrer, and forget about it. If you *ever* get an e-mail complaint, I'll be surprised. The only ones worth the least bit of worry are the mobile UAs -- since phones are typically rushed to market, their HTTP compliance is quite spotty...
Jim
The most irksome part for me is blocking the ones that first hit without any referer. They're able to snag everything page-related before they even try a fake ref on-site. Of course, the second they fake it, they're 403'd, but they still get in that first bite. Some even switch back and forth, starting out w/o a fake, then faking it, then not.
Fortunately, this is all more geekily intriguing (& procrastination-worthy) than seriously problematic, at least at this moment.
If only I could adapt the likes of Key_Master's Perl script [webmasterworld.com] from way back and automate the 403 process for the repeat fakers. Then new ones would get one hit, but that'd be it.
# Rewrite self-referring page requests to Key_Master script
RewriteCond %{REQUEST_URI}>%{HTTP_REFERER} ^(.+)$
RewriteCond %1 ^/somepage\.html>http://www\.example.com/somepage\.html$ [OR]
RewriteCond %1 ^/otherpage\.html>http://www\.example.com/otherpage\.html$ [OR]
RewriteCond %1 ^/thirdpage\.html>http://www\.example.com/thirdpage\.html$
RewriteRule ^ /bad_bot.pl [L]
Jim
I specifically exclude directory 'index page' request URLs (requested URLs ending with a slash) from the missing-slash-referrer test for the reason Umbra stated.
Also, some legitimate search and SEO tool robots will omit the slash, so additional exceptions may be needed on your site.
Jim