Forum Moderators: DixonJones
"Partials."
Not sexy, I know. But that pretty much sums up what are hundreds of hits to truncated -- or "partial" -- filenames.
For example, say I have these files:
/dir1/a01.Alphabet-List.html
/dir2/b02.Alphabetical.html
/dir3/c03.Alphabeta.html
/dir4/d04.Alphanumeric.html
All of the files are stable, the case-sensitive file names and file system haven't changed for five-plus years, neither has the server software, nor the DNS, IP, etc. But suddenly, ISPs via host names and IPs apparently just from the U.S. -- including a well-known federal agency, a national non-profit, a university, a culinary establishment, an international engineering consultancy, a state government, numerous national telcos and cablecos -- using non-Macintosh UAs (Hmmm... an MSIE 6.0; Windows NT 5.1 thing? See P.S.), giving no referer info -- ALL are hitting/erring these kinds of variations:
/dir1/a01.Alphabet-List.ht
/dir2/b02.Alph
/dir3/c03.Alphabeta.h
/dir4/d04.Alphanu
See what I mean by "partials"? The hits are to partial filenames. As of right now, maybe 10 or so files in multiple directories, each with maybe three truncated variations. And each suspect visitor only hits one partial filename one time. Other visitors -- thousands of them -- have no problems whatsoever.
Huh-wha?
At first I though some site coded a lot of links incorrectly. Really incorrectly. But then I realized there are too many hits, to too many filename variations, across too many different directories. And unlike most of my linkees, none of the partials ever come in on or stick around to go to ANY other pages. They hit my custom error page and they're gone.
Drat. So much for the Incoming Links Theory.
So after a week of watching the partials hit and run, I decided to refresh/redirect them to a special IP with instructions on how to e-me for access. I figured I'd get at least a couple of people touching base and then I could ask them from whence they came.
No such luck. Some partials follow the redirect, some don't. And those that do -- no e-mails.
So I'm left to wonder: Did some crawler make a mess of my URLs? Why are all these visitors suddenly hitting on similarly wrong URLs? What in the heck is going on?
Thoughts?
---
P.S.
For you UA sleuths, a sampling. Curiously, all are MSIE 6.0; Windows NT 5.1:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; NYU-2002; SV1)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; FunWebProducts; (R1 1.5); .NET CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FunWebProducts; .NET CLR 1.1.4322; .NET CLR 1.0.3705)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; (R1 1.5); .NET CLR 1.0.3705; .NET CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; InfoPath.1)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; FunWebProducts; InfoPath.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
Well at least I know that we are not alone in this. I've been tearing out my hair, trying to investigate.
The only observation that I have to add is the 40 character "limit".
Let's see:
- I use Apache (1.3.x) on a Linux box.
- Status codes are 302 with the redirect to the special page.
- The full URLs I'm seeing are always 47 characters to from http to the end. I'd been looking at the file names but fiestagirl's 40-character limit observation made me revisit the total URL length including directory names. Sure enough, ALL are 47 except for one at 51 -- 47 if you lop off the triple-dubya prefix.
How odd that we're seeing similar, but different, limits.
Hmm...
Given that at least two of us started to see partials on darn near the same day, and that the browsers' basic platform is the same, and that the suspect visitors don't appear to be 'real'...
Was there a recent browser or system upgrade vis-a-vis MSIE 6 and/or NT 5.1 (for U.S. systems) that might suggest an URL-truncating bug? Perhaps in connection with bookmark-, favicon- or link-checking?
I'm a Mac person so here's hoping there are MSIE 6 and/or NT 5.1 folks here who might help solve this ongoing mystery.
Remember that the hostname (domain) and the URL-path are sent separately. So if you subtract the length of your domain name from the 40/47-chracter limit, you'll probably find a common URL-path length. In other words, I suspect that fiestagirl's domain is seven characters shorter than pfui's, or eleven with the "www." prefix.
What I mean is that HTTP/1.1 requests are sent like this:
GET /index.html
Host: www.example.com
So, from the evidence here, it is the "GET" line that has the character limit, and the Host header can vary in length independently, accounting for the reported differences in the overall URL length.
Jim
(I feel a movie coming on...)
Seriously, if yes, and if a scraper suddenly spawned more tentacles than Medusa, its creator strikes me as a lot smarter than any GET length-limiting 'dumbness' might suggest.
I mean, shoot, the servers I'm seeing are addressed to everything from a federal alphabet entity to apparently individual cable accounts. The latter I half-expect to be at risk for trouble from 'outside.' But the former --
Thank God the partials are generating errors, not payloads.
So, seeing as how who/whatever's behind the partials is still successfully cloaked, any tips on how to forbid the hits from the get-go?
jomaxx provided a ton of data about the IPs -- "2,800 requests of this type from 1,400 unique IP addresses and 200+ unique browser identifiers" -- and bobothecat reports seeing this snafu, too. Here's jomaxx's info: "000s of Truncated Page Requests from Many IPs [webmasterworld.com]
I have nowhere near as many hits as does jomaxx but their observations -- "I did a reverse IP lookup on the top 15. 4 of them resolved to universities/colleges around the US" -- echo mine. A surprising majority of hits hail from .edu hosts (in no particular order):
Columbia, Fordham, James Madison, Prince George's Community College, Binghamton, Sacred Heart, Indiana, Nevada, Texas, South Carolina, and Western Oregon Univ.
Aside from one lone hit from Finland, and one from the UK, I'm still seeing U.S.-only hits, ranging from government entities in Alaska, Michigan, Nevada and Wisconsin, to the FCC and the VA, plus assorted PPPoX IPs and these ISPs, my Top Ten:
.br.br.cox.net
.hr.hr.cox.net
.lv.lv.cox.net
.oc.oc.cox.net
.ri.ri.cox.net
.dyn.optonline.net
.dsl.chcgil.ameritech.net
.ma.charter.com
.hsd1.ma.comcast.net
.hsd1.wa.comcast.net
Is anyone else seeing a preponderance of those Usual Suspects, too?
Only a few "partials" have ever come back a second time, and all still always hit too-short URLs and no other pages, always sans referers. The UAs are also still variations of "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1" -- with one 5.0 yesterday.
Coming Up: Got Mobile?
.
P.S.
Thanks to Receptional and engine for clearing the floobydust out of this thread's bit bucket;)
[edited by: Pfui at 12:02 am (utc) on May 11, 2006]
Mobile devices?
For example, Sitescooper (Sitescooper.org [sitescooper.org]). You can cloak its UA. Plus here's a snippet from an older [backpan.perl.org] Perl script (emphasis mine):
if (!defined $name{$url}) {
$name{$url} = $url;
if ($url =~ m,/([^/]+)$,) {
$_ = $1;
if (length ($_) > 40) {
# trim out spare stuff to keep it short.
s,^([^:]+://[^/]+)/.*/([^/]+$),$1/.../$2,i;
$name{$url} = $_;
} else {
$name{$url} = $_;
Not being a Perl pro, I'm probably reading that all wrong:) But seeing as how Google's Transcoder UA is causing me all sorts of grief right now because it's coming in through IPs, I've got mobiles on my mind. So I was just wondering...
Might mobiles be behind the "partials"?
I tried to get some information using Javascript on my 404 page, but the first 5 or 10 of the cases we're discussing all had Javascript disable and executed the NOSCRIPT block instead.
I have to make some kind of a configuration change to my server in order to make my 404 page handle PHP. I'm not quite curious enough to do that yet. I was a few days ago, but as soon as I verified that a bunch of people were seeing the same thing, the matter became a little less pressing.