Forum Moderators: DixonJones

Message Too Old, No Replies

Suddenly, scads of hits to "partial" filenames.

Okay, sleuths. Who -- or what -- goes there?

         

Pfui

10:46 pm on Apr 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Since April 4, I've seen the strangest things in my logs, on just one site. And I'm stumped as to what to call this oddity other than what I've been calling the suspect visitors:

"Partials."

Not sexy, I know. But that pretty much sums up what are hundreds of hits to truncated -- or "partial" -- filenames.

For example, say I have these files:

/dir1/a01.Alphabet-List.html
/dir2/b02.Alphabetical.html
/dir3/c03.Alphabeta.html
/dir4/d04.Alphanumeric.html

All of the files are stable, the case-sensitive file names and file system haven't changed for five-plus years, neither has the server software, nor the DNS, IP, etc. But suddenly, ISPs via host names and IPs apparently just from the U.S. -- including a well-known federal agency, a national non-profit, a university, a culinary establishment, an international engineering consultancy, a state government, numerous national telcos and cablecos -- using non-Macintosh UAs (Hmmm... an MSIE 6.0; Windows NT 5.1 thing? See P.S.), giving no referer info -- ALL are hitting/erring these kinds of variations:

/dir1/a01.Alphabet-List.ht
/dir2/b02.Alph
/dir3/c03.Alphabeta.h
/dir4/d04.Alphanu

See what I mean by "partials"? The hits are to partial filenames. As of right now, maybe 10 or so files in multiple directories, each with maybe three truncated variations. And each suspect visitor only hits one partial filename one time. Other visitors -- thousands of them -- have no problems whatsoever.

Huh-wha?

At first I though some site coded a lot of links incorrectly. Really incorrectly. But then I realized there are too many hits, to too many filename variations, across too many different directories. And unlike most of my linkees, none of the partials ever come in on or stick around to go to ANY other pages. They hit my custom error page and they're gone.

Drat. So much for the Incoming Links Theory.

So after a week of watching the partials hit and run, I decided to refresh/redirect them to a special IP with instructions on how to e-me for access. I figured I'd get at least a couple of people touching base and then I could ask them from whence they came.

No such luck. Some partials follow the redirect, some don't. And those that do -- no e-mails.

So I'm left to wonder: Did some crawler make a mess of my URLs? Why are all these visitors suddenly hitting on similarly wrong URLs? What in the heck is going on?

Thoughts?

---
P.S.
For you UA sleuths, a sampling. Curiously, all are MSIE 6.0; Windows NT 5.1:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; NYU-2002; SV1)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; FunWebProducts; (R1 1.5); .NET CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FunWebProducts; .NET CLR 1.1.4322; .NET CLR 1.0.3705)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; (R1 1.5); .NET CLR 1.0.3705; .NET CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; InfoPath.1)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; FunWebProducts; InfoPath.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)

cgrantski

1:05 pm on Apr 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What status code are these hits getting? Anything other than 404 would be interesting... Also check the Time-to-serve field (only IIS logs have this).

fiestagirl

8:46 pm on Apr 18, 2006 (gmt 0)

10+ Year Member



Same here.
Since 4/5.
Partial file names - ALWAYS 40 CHARACTERS - from www to the end.
All MSIE 6 and NT 5.1
IPs from all over the US.
No referrers.
Most request the partial url 2-5 times.
Never attempt to use another link on the custom 404.

Well at least I know that we are not alone in this. I've been tearing out my hair, trying to investigate.

The only observation that I have to add is the 40 character "limit".

Alex_Miles

8:54 pm on Apr 18, 2006 (gmt 0)

10+ Year Member



I found these in my stats as well, listed under 404s

Pfui

9:59 pm on Apr 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Good sleuthing, gang! Thanks for your replies and three cheers you could make sense out of what I was trying to describe. ('Twas a bit like explaining how to tie one's shoes -- in writing.)

Let's see:

- I use Apache (1.3.x) on a Linux box.
- Status codes are 302 with the redirect to the special page.

- The full URLs I'm seeing are always 47 characters to from http to the end. I'd been looking at the file names but fiestagirl's 40-character limit observation made me revisit the total URL length including directory names. Sure enough, ALL are 47 except for one at 51 -- 47 if you lop off the triple-dubya prefix.

How odd that we're seeing similar, but different, limits.

Hmm...

Given that at least two of us started to see partials on darn near the same day, and that the browsers' basic platform is the same, and that the suspect visitors don't appear to be 'real'...

Was there a recent browser or system upgrade vis-a-vis MSIE 6 and/or NT 5.1 (for U.S. systems) that might suggest an URL-truncating bug? Perhaps in connection with bookmark-, favicon- or link-checking?

I'm a Mac person so here's hoping there are MSIE 6 and/or NT 5.1 folks here who might help solve this ongoing mystery.

jdMorgan

11:33 pm on Apr 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is probably some dumb scraper using those other servers as open proxies, with a character-count limit on the request URL-path.

Remember that the hostname (domain) and the URL-path are sent separately. So if you subtract the length of your domain name from the 40/47-chracter limit, you'll probably find a common URL-path length. In other words, I suspect that fiestagirl's domain is seven characters shorter than pfui's, or eleven with the "www." prefix.

What I mean is that HTTP/1.1 requests are sent like this:

GET /index.html
Host: www.example.com

So, from the evidence here, it is the "GET" line that has the character limit, and the Host header can vary in length independently, accounting for the reported differences in the overall URL length.

Jim

Pfui

7:34 am on Apr 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Jim, you're the server pro plus I admit I'm not the least bit savvy about proxies. So are you saying it's possible all the servers I'm/we're seeing are being used as (presumably) unwitting relays -- that they're all actively compromised in some way? By something born and widely viral in, say, 24 hours?

(I feel a movie coming on...)

Seriously, if yes, and if a scraper suddenly spawned more tentacles than Medusa, its creator strikes me as a lot smarter than any GET length-limiting 'dumbness' might suggest.

I mean, shoot, the servers I'm seeing are addressed to everything from a federal alphabet entity to apparently individual cable accounts. The latter I half-expect to be at risk for trouble from 'outside.' But the former --

Thank God the partials are generating errors, not payloads.

So, seeing as how who/whatever's behind the partials is still successfully cloaked, any tips on how to forbid the hits from the get-go?

Pfui

11:40 pm on May 10, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And the hits go on.

jomaxx provided a ton of data about the IPs -- "2,800 requests of this type from 1,400 unique IP addresses and 200+ unique browser identifiers" -- and bobothecat reports seeing this snafu, too. Here's jomaxx's info: "000s of Truncated Page Requests from Many IPs [webmasterworld.com]

I have nowhere near as many hits as does jomaxx but their observations -- "I did a reverse IP lookup on the top 15. 4 of them resolved to universities/colleges around the US" -- echo mine. A surprising majority of hits hail from .edu hosts (in no particular order):

Columbia, Fordham, James Madison, Prince George's Community College, Binghamton, Sacred Heart, Indiana, Nevada, Texas, South Carolina, and Western Oregon Univ.

Aside from one lone hit from Finland, and one from the UK, I'm still seeing U.S.-only hits, ranging from government entities in Alaska, Michigan, Nevada and Wisconsin, to the FCC and the VA, plus assorted PPPoX IPs and these ISPs, my Top Ten:

.br.br.cox.net
.hr.hr.cox.net
.lv.lv.cox.net
.oc.oc.cox.net
.ri.ri.cox.net
.dyn.optonline.net
.dsl.chcgil.ameritech.net
.ma.charter.com
.hsd1.ma.comcast.net
.hsd1.wa.comcast.net

Is anyone else seeing a preponderance of those Usual Suspects, too?

Only a few "partials" have ever come back a second time, and all still always hit too-short URLs and no other pages, always sans referers. The UAs are also still variations of "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1" -- with one 5.0 yesterday.

Coming Up: Got Mobile?

.
P.S.
Thanks to Receptional and engine for clearing the floobydust out of this thread's bit bucket;)

[edited by: Pfui at 12:02 am (utc) on May 11, 2006]

Pfui

12:00 am on May 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is a quickie. (Well, kinda-sorta:) I'm still iffy that we're seeing a robot, at least not the usual kind of robot (or scraper; crawler; spider; whatever). So I've been trying to think of what else might want/need to truncate URLs to 40 or less --

Mobile devices?

For example, Sitescooper (Sitescooper.org [sitescooper.org]). You can cloak its UA. Plus here's a snippet from an older [backpan.perl.org] Perl script (emphasis mine):

if (!defined $name{$url}) { 
$name{$url} = $url;
if ($url =~ m,/([^/]+)$,) {
$_ = $1;
if (length ($_) > 40) {
# trim out spare stuff to keep it short.
s,^([^:]+://[^/]+)/.*/([^/]+$),$1/.../$2,i;
$name{$url} = $_;
} else {
$name{$url} = $_;

Not being a Perl pro, I'm probably reading that all wrong:) But seeing as how Google's Transcoder UA is causing me all sorts of grief right now because it's coming in through IPs, I've got mobiles on my mind. So I was just wondering...

Might mobiles be behind the "partials"?

jomaxx

2:13 am on May 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Probably a waste of time, but I wonder if anyone has tried identifying this spider using PHP variables such as HTTP_ACCEPT_LANGUAGE, HTTP_ACCEPT_CHARSET, HTTP_ACCEPT_ENCODING, etc.

Pfui

8:46 pm on May 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



jomaxx, am not a PHP person, sorry. But one Q for you, please? What program do you use such that you're able to distill such exquisite tracking/logging info? (Now watch it be PHP-specific...) TIA

jomaxx

11:46 pm on May 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Oh, I use a custom log processing program to analyze activity (including identifying 404's each day). For additional research I usually open up the log file directly using TextPad.

I tried to get some information using Javascript on my 404 page, but the first 5 or 10 of the cases we're discussing all had Javascript disable and executed the NOSCRIPT block instead.

I have to make some kind of a configuration change to my server in order to make my 404 page handle PHP. I'm not quite curious enough to do that yet. I was a few days ago, but as soon as I verified that a bunch of people were seeing the same thing, the matter became a little less pressing.