Forum Moderators: phranque

Message Too Old, No Replies

Is it possible to block matching pairs of Hosts with one RewriteRule?

Spider(?) using twin Comcast accounts in 2 states at exact same time.

         

Pfui

10:18 am on Jul 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm being spidered/crawled/scraped by someone using techniques I've never seen before and don't know how to combat. If you're curious but want to skip the minutiae, feel free to jump from #1 to #6:)

1.) In a nutshell, someone's using "pairs" of Comcast accounts simultaneously and they keep changing the accounts, and browsers* -- and the accounts are in more than one state.

2.) Yeah, I know that sounds slightly nutty but look at this session excerpt (xx = obfuscated), and the seamless switch between Minnesota and Washington:

c-24-16-243-xx.hsd1.mn.comcast.net - - [10/Jul/2006:23:59:38 -0700]
c-24-16-243-xx.hsd1.wa.comcast.net - - [10/Jul/2006:23:59:38 -0700]
c-24-16-243-xx.hsd1.mn.comcast.net - - [10/Jul/2006:23:59:38 -0700]
c-24-16-243-xx.hsd1.wa.comcast.net - - [10/Jul/2006:23:59:38 -0700]
c-24-16-243-xx.hsd1.mn.comcast.net - - [10/Jul/2006:23:59:39 -0700]
c-24-16-243-xx.hsd1.wa.comcast.net - - [10/Jul/2006:23:59:39 -0700]
c-24-16-243-xx.hsd1.mn.comcast.net - - [10/Jul/2006:23:59:39 -0700]
c-24-16-243-xx.hsd1.wa.comcast.net - - [10/Jul/2006:23:59:39 -0700]

3.) Thus far I've rewritten the pairs as I've see them, but yesterday, for example, they switched Hosts (and browsers) about every hour, and looking through my logs they've apparently been at this for quite some time (dangit).

RewriteCond %{REMOTE_HOST} ^c-24-22-194-xx\.hsd1\.wa\.comcast\.net$ [OR]
RewriteCond %{REMOTE_HOST} ^c-24-22-194-xx\.hsd1\.mn\.comcast\.net$ [OR]
RewriteCond %{REMOTE_HOST} ^c-24-16-243-xx\.hsd1\.wa\.comcast\.net$ [OR]
RewriteCond %{REMOTE_HOST} ^c-24-16-243-xx\.hsd1\.mn\.comcast\.net$ [OR]
RewriteCond %{REMOTE_HOST} ^c-24-126-246-xx\.hsd1\.mn\.comcast\.net$ [OR]
RewriteCond %{REMOTE_HOST} ^c-24-126-246-xx\.hsd1\.ca\.comcast\.net$

4.) Then last night, while I was manually blocking every Minnesota+Washington pair I spotted coming in the door, they switched to Minnesota+California!

[11/Jul/2006:01:25:49] - /index.html - GET - 24.126.246.xx - c-24-126-246-xx.hsd1.ca.comcast.net
[11/Jul/2006:01:25:49] - /index.html - GET - 24.126.246.xx - c-24-126-246-xx.hsd1.mn.comcast.net
[11/Jul/2006:01:25:50] - /index.html - GET - 24.126.246.xx - c-24-126-246-xx.hsd1.ca.comcast.net
[11/Jul/2006:01:25:50] - /index.html - GET - 24.126.246.xx - c-24-126-246-xx.hsd1.mn.comcast.net
[11/Jul/2006:01:25:51] - /index.html - GET - 24.126.246.xx - c-24-126-246-xx.hsd1.ca.comcast.net
[11/Jul/2006:01:25:51] - /index.html - GET - 24.126.246.xx - c-24-126-246-xx.hsd1.mn.comcast.net

That IP verified as the California Host, btw. Not Minnesota. Ditto the multiple Washington IPs used -- those verify. So it looks like the spoofs are all the .mn.comcast.net Minnesotas. So I added the following, with a redirect to a private e-me-for-access page so I wouldn't lose too many legit Minnesotans:

RewriteCond %{REMOTE_HOST} \.hsd1\.mn\.comcast\.net$ [OR]

Didn't slow the bad guy a bit. They just hit the special page w/ another faux Minnesota account and kept on hitting the main site with a California account. Again simultaneously, and seamlessly.

5.) I don't know how anyone can use the same IP with different Hosts in different states, but they're doing it, and on the fly, too. I did find an interesting record of the same thing -- here's Google's cache [66.102.7.104] of the page. Look at the last two July entries in the " HF Propagation Logger" section ("wp4, n2's into wa.state" -- huh?) and you'll see .mn.comcast.net and .wa.comcast.net Hosts with identical in-name IPs.

6.) So-o-o I don't know what's going on with all of the above but I want to stop it. Do you think there's any way to craft a mod_rewrite rule such that --

If a .comcast.net visitor uses two accounts with the identical IP-as-Host-prefix at the exact same time, both get blocked?

(And if there isn't a mod_rewrite solution, do you have any suggestions as to how to stop what's going on?) TIA!

*
P.S.
Some of the pairs' UAs:

"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 3.1)"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; MSDigitalLocker Vista 1.3; SV1; (R1 1.5); SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; WinFX RunTime 3.0.50727; .NET CLR 1.1.4322)"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Tablet PC 1.7; .NET CLR 2.0.50727; InfoPath.1)"
"Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4"
"Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/416.11 (KHTML, like Gecko) Safari/416.12"

jdMorgan

2:37 pm on Jul 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



As far as Apache is concerned, there is no such thing/concept as "simultaneous" requests; Apache handles requests one at a time in whatever order they arrive, and no request has any effect or dependency or any previous or subsequent request.

This is because HTTP is a "stateless" protocol; The server itself does not track the previous or current stste of "user sessions" in any way. Apache receives a request for a resource, serves that resource, and promptly forgets all about it.

This is one reason that cookies were invented; to track the state of a user session across multiple requests, and even across multiple sessions. And for clients that do not support (or disable) cookies, we have the "SessionID=1a3f8c" - style query string tracking method.

I'd suggest a modified version of xlcus' and AlexK's "runaway 'bot' script, published in the WebmasterWorld PHP forum library. The modification would be to change it to use RDNS lookups instead of straight IP addresses, and to mask off the specific IP embedded in the hostname returned by the lookup. It's an involved project, as you wlll also have to modifiy the hashing function used to 'remember' previous requests during the sample period.

A simpler approach might be to detect comacast, and present a 'captcha' page, explaining the current problem and asking them to type in what they see in order to enter your site proper. The amount of script work and testing for this approach would be far less than the other method.

This is either a bot-net, or a whole bunch of consumer PCs somehow configured as open proxies. If the level of accesses begins to affect your site's performance, then file a Denial of Service (DOS) report with ComCast and let them investigate -- They have far better tools and infrastructure to track this kind of abuse.

Jim

incrediBILL

2:58 pm on Jul 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think you may be being confused about 'pairs' due to what looks like arbitrary addresses being picked from a reverse DNS lookup.

Did a manual lookup from my server, check this out:

nslookup 24.16.243.x
x.243.16.24.in-addr.arpa name = c-24-16-243-x.hsd1.mn.comcast.net.
x.243.16.24.in-addr.arpa name = c-24-16-243-x.hsd1.wa.comcast.net.

nslookup 24.126.246.1
x.246.126.24.in-addr.arpa name = c-24-126-246-x.hsd1.mn.comcast.net.
x.246.126.24.in-addr.arpa name = c-24-126-246-x.hsd1.ca.comcast.net.

I'm getting 2 responses per each IP address in that group claiming to be from different geolocations according to Comcast naming conventions. They aren't pairs of IP's or locations exactly, Comcast MAY be transitioning IPs to a new geolocation or their reverse DNS is just a mess, who knows with Comcast.

I use Comcast and checked my IP address and mine was normal.

The variety of UA's is the only thing amusing.

Comcast is a hotbed of scraping, but without seeing your log I can't tell just from this information if it's a real scrape or just a case of DNS confusion at work here.

[edited by: incrediBILL at 3:14 pm (utc) on July 11, 2006]

Pfui

5:17 pm on Jul 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks, gents, for your thoughtful replies! Overnight, they continued the MN+WA and MN+CA switching, always getting blocked for all MNs but still hitting things like there was nothing wrong. And no e-mails to moi.

Then this morning, MN+CT.

We have over a quarter million files in archive so my first goal is to protect those (and bandwidth!). So I just added blocks for the various states in the archived areas and hope I don't get snowed under e-mails from too many real Comcast people all day long:)

Jim, I'll look into the captcha thing ASAP, thanks. Last Fall, I had trouble getting the Perl version of the bad bot script to work (the allow,deny aspects) but I'm on the cusp of revisiting it because of whatever is going on.

Bill, your "pairs" lookups certainly resemble what I'm seeing. Thing is, I thought non-biz (or regular rate) people with Comcast had a static IP until they rebooted, at which point they were assigned another one. But even if this is a new network-wide mapping of sorts, it's only for some Comcast Hosts, and there's always a MN connection.

I'll be happy to e- or Sticky you a chunk o' log for your processing pleasure (you, too, Jim, seeing as how I know youse guys:) But it all pretty boils down to this --

c-24-2-220-xx.hsd1.mn.comcast.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
07/11 09:28:24 /index.html
07/11 09:28:35 /index.html

c-24-2-220-xx.hsd1.ct.comcast.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
07/11 09:27:59 /index.html
07/11 09:28:14 /index.html
07/11 09:28:55 /index.html
07/11 09:29:13 /index.html
07/11 09:29:28 /index.html

I left the graphics out of those excerpts but those retrievals look the same -- almost as if MN+? is a huge network of caching servers, but each with completely different names and IPs. Or something!

[edited by: Pfui at 5:20 pm (utc) on July 11, 2006]

incrediBILL

8:18 pm on Jul 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That's because this block of IP's seems to show the same duality:

nslookup 24.2.220.x
x.220.2.24.in-addr.arpa name = c-24-2-220-x.hsd1.ct.comcast.net.
x.220.2.24.in-addr.arpa name = c-24-2-220-x.hsd1.mn.comcast.net.

If they're loading images and pages, probably not scrapers even if your stats program is ping-ponging between the two names. Very odd.

If you want, send sticky me ALL activity for 24.2.220.x and I'll tell you what I think about it.

Also, I don't think trying to firewall them by host name is effective at all:

RewriteCond %{REMOTE_HOST} ^c-24-22-194-xx\.hsd1\.wa\.comcast\.net$ [OR]

Just block the acual IP address like 24.22.194.xx otherwise you have to do two entries for the same IP.