Forum Moderators: open

Message Too Old, No Replies

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0;)

Repeated hits by this UA, with thedomain.com as referer.

         

montclairguy

5:49 pm on Dec 14, 2006 (gmt 0)

10+ Year Member



I have a bot hammerring my site repeatedly with this forged User-agent:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0;)

Additionally, the Referer is always exactly my domain name.. i.e. [my_domain_name.com...]

It's definitely a bot of some type and has been going on for I don't know how long. I've only recently started investigating this, since bandwidth and machine usage has been through the roof.

I see this forged IE User-agent coming from multiple IP's, most of them in the 69.230.*.* range, which are SBC / Pacific Bell dial-up or DSL accounts. I've started banning these IP's as they come up, but this is hardly a solution as these people (or person?) can just reconnect to get a new IP.

Anyone else seeing this activity?

[edited by: volatilegx at 7:13 pm (utc) on Dec. 14, 2006]
[edit reason] fixed broken user agent [/edit]

wilderness

8:25 pm on Dec 14, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Try this
#UA Ends with
RewriteCond %{HTTP_USER_AGENT} 5\.0\;)$
#from IP 69.230.all.all
RewriteCond %{REMOTE_ADDR} ^69\.230\.
RewriteRule .* - [F]

[edited by: volatilegx at 3:18 pm (utc) on Dec. 15, 2006]
[edit reason] fixed unintended smiley [/edit]

wilderness

8:26 pm on Dec 14, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



the smiley face above is should be a parenthenses

CORRECTION

the smiley face replaces both the semi-colon and the parenthenses

montclairguy

9:08 pm on Dec 14, 2006 (gmt 0)

10+ Year Member



Thanks wilderness, but I believe that would block anybody actually using that version of the browser -- assuming that User-agent is even a valid one for IE. And, most of the hits are coming from there, but not all of them.

I was really more curious if anyone is seeing this type of activity in their logs.... lots of fast hits with that User-agent with your domain name as the referer for every hit.

incrediBILL

9:28 pm on Dec 14, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've seen some activity in this range that got snared in my bot blocker challenges, but they seem to come in short bursts of 30 pages or less.

wilderness

12:13 am on Dec 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks wilderness, but I believe that would block anybody actually using that version of the browser -- assuming that User-agent is even a valid one for IE. And, most of the hits are coming from there, but not all of them.

The lines I propvided ONLY deny access to that UA and only IF it comes from that IP.
You may also expand the IP lines by adding other ranges.

The multiple criteria reduces the chance of innocents.

montclairguy

2:36 pm on Dec 15, 2006 (gmt 0)

10+ Year Member



Hi Wilderness -- that was kinda my point, but you said it better.

I would risk blocking a real person (assuming "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0;)" really is an IE user agent) from pacbell's dymanically allocated DSL block. Also, for more scrutiny, the rule should test that the referer is the domain name.

That would all be great, if that user-agent was NOT a true IE UA, but I believe it is.

Unfortunately, this bot isn't taking the bait of my current bot trap (i.e. non-visible link to a page denied in robots.txt, linked to a banning script), so I'm going to have to program something in to my current shopping cart script to watch for this behavior and ban for 24-48 hours.

[edited by: volatilegx at 3:01 pm (utc) on Dec. 20, 2006]
[edit reason] fixed unintended smiley [/edit]

thetrasher

7:30 pm on Dec 15, 2006 (gmt 0)

10+ Year Member



That would all be great, if that user-agent was NOT a true IE UA, but I believe it is.
Why do you think that this is an IE UA? Does any real IE UA end with a semicolon and a parenthesis? (No)

BTW: Welcome to WebmasterWorld!

Umbra

7:34 pm on Dec 15, 2006 (gmt 0)

10+ Year Member



Why do you think that this is a IE UA? Does any real IE UA end with a semicolon and a parenthesis? (No)

I see tons of examples, from Mozilla/4.0 (compatible;) (proxy?) to Google Wireless Transcoder's user agent. Admittedly, these are not actual IE browsers...

[edited by: Umbra at 7:39 pm (utc) on Dec. 15, 2006]

[edited by: volatilegx at 3:02 pm (utc) on Dec. 20, 2006]
[edit reason] fixed unintended smiley [/edit]

motorhaven

9:57 pm on Dec 17, 2006 (gmt 0)

10+ Year Member Top Contributors Of The Month



A lot of scrapers are now using Google translation and wireless "proxies" to do their dirty deads.

incrediBILL

2:38 am on Dec 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A lot of scrapers are now using Google translation and wireless "proxies" to do their dirty deads.

Glad to see I'm not the only one that notice this trend.

wilderness

11:49 am on Dec 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A lot of scrapers are now using Google translation and wireless "proxies" to do their dirty deads.

Glad to see I'm not the only one that notice this trend.

It's always been Jim's position/courtesy to allow access to those that jump through the hoops required by translator tools (anybody whose ever used one may attest to the difficulty).

However, and from my point of view, in the event the IP comes from a registrar of which the region does not hold any possibility of benefit to your website (s)?
No need to allow access from a translator tool when access is not normally allowed from the IP range.

Proxies and colo's are of the same nature (at least in my book) and demand denial.

My own pages simply contain too much text for the small wireless devices. Hopefully a time will come that we as webmasters will be able to determine the difference between a cell phone with web access and a laptop with either a wireless or cell phone connection.

Even most of the new web accelerators provide the IP range of the tool rather than the user and also demand the aforementioned attention.

As ususal, each webmaster must determine what is benficial or detrimental to their own website (s).

jdMorgan

12:32 am on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I see tons of examples, from Mozilla/4.0 (compatible;) (proxy?) to Google Wireless Transcoder's user agent.

I sent a note to Google for the Transcoder application group. Hopefully, they'll fix their UA string.

Jim

Umbra

4:27 pm on Dec 19, 2006 (gmt 0)

10+ Year Member



Proxies and colo's are of the same nature (at least in my book) and demand denial.

Even AOL proxies? I started a thread here [webmasterworld.com] about handling proxies... but that didn't seem to produce any good conclusion yet.

wilderness

5:19 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Even AOL proxies? I started a thread here about handling proxies... but that didn't seem to produce any good conclusion yet.

Unfortuantely I'm stuck with the perils of AOL :(

Many of the visitors of widgets that come to my sites are AOL users because of the reliable flexibility in dial-up access when the move from state-to-state or even state-to-province.

Applying my denial to these AOL ranges is not an option for me.

Colo's and non-AOL proxies are a different issue for me.

Don

Umbra

5:31 pm on Dec 19, 2006 (gmt 0)

10+ Year Member



Colo's and non-AOL proxies are a different issue for me.

How do you distinguish non-AOL proxies from regular static/dynamic IPs?

wilderness

9:49 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How do you distinguish non-AOL proxies from regular static/dynamic IPs?

It's a most difficult task to confirm anything with AOL. (as I'm sure your aware).

As you previously mention "that AOL does not abide by meta tags".

Nearly ALL my pages are meta-tagged; No Cache.
My images are in their own folders and excluded in robots.txt. (AOL is the only provider that I allow exceptions for in spidering images. ANY other bot or provider would be denied, regardless of how many customers they represent.)

I have the same ranges (class C) both attempting to access pages and images with blank refers (or those short, begins and ends with Mozilla UA's) and than later returning to images-only with full-UA's.

It's merely a matter of "assumption" on my part, however after watching it happen for more than seven years?
It's safe to assume it's not a casual visitor just browsing or scraping.

wilderness

10:00 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My aplogies Umbra (for the previous response), it's been an unusually tough past three weeks and I jump from fire to fire with the flames only getting higher ;)

How do you distinguish non-AOL proxies from regular static/dynamic IPs?

There are pages across the internet that list active proxy servers. I haven't had the need to visit any such page in a long while.

My denial procedures are unusually harsh and many pests were denied long ago. As a result my current crawls/spiders are rather limited compared to most webmasters (and newbies).
I make note of what I refer to as "Snoops" and follow up on any subsequent acrivity of "Snoops" taking action.

Nearly ALL of RIPE, APNIC and LACNIC are denied access to my sites and that preference is NOT POSSIBLE for most webmasters.
The RIPE, APNIC and LACNIC denials allow me to focus on other areas.

incrediBILL

1:28 am on Dec 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



BTW, I ran into another web site offline reader (aka ripper) program today that by default uses "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1.)" as the user agent and doesn't even bother including their product name anywhere in the UA.

Good for ripping sites in stealth mode but a horrible marketing strategy.

Let's face it, most of the web tools are now hiding by design and it's going to get a lot worse before it gets better.

wilderness

6:04 am on Dec 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Let's face it, most of the web tools are now hiding by design and it's going to get a lot worse before it gets better.

Bill,
About the only comfort we have ;)
is that many major providers have broken formerly large blocks into smaller and more localized subsets.

It's too bad that all providers don't follow that same lead.

Umbra

7:40 pm on Dec 21, 2006 (gmt 0)

10+ Year Member



I've added Header append Cache-Control: "private" to .htaccess

I understand that this should be a reliable way of stopping proxies from caching files.

Will see if this stops those silly Mozilla/4.0 (compatible;) user agents, and those that don't pay attention to the cache control headers will likely be banned.

jdMorgan

7:59 pm on Dec 21, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> I understand that this should be a reliable way of stopping proxies from caching files.

It is, indeed. Now, whenever someone behind that caching proxy requests resources from your site, and they don't have a copy cached in their browser, the request will be passed through that proxy to your server, resulting in more traffic from that proxy. Is that what you want?

The problem is that most caching proxies are 'good' and they save us and the network a lot of bandwidth. That's why ISPs and corporations use them. But there's no reliable prima-facie way to tell a 'good' caching proxy from a 'bad' caching proxy or a non-caching anonymous proxy that is being used for nefarious purposes. You can look at the Via and X-Forwarded-For request headers, but support is so spotty that it's even less reliable than trying to use referrer-based access control.

But bear in mind that most proxies are good things -- good for their users, good for your server, and good for the network. Just like anything else, though, they can be abused.

Jim

Umbra

8:10 pm on Dec 21, 2006 (gmt 0)

10+ Year Member



...resulting in more traffic from that proxy. Is that what you want?

I will be comparing before and after to determine how significant is the increase in bandwidth. I've assumed that proxies are becoming less prevalent, if only because dialup is going the way of the dinosaur, but I'll see what happens. If there is significant increase in bandwidth, I'll weigh that against time saved while analyzing log files -- less chasing after strange hits (ie, proxy requests) that don't seem to correspond to normal browsing patterns.

jdMorgan

4:23 pm on Dec 24, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'd predict that caching proxies will become more and more prevalent -- Their purpose is to reduce traffic between subnets, and as overall 'net traffic increases due to high-speed connections making such things as video distribution more popular, the need to decrease network traffic will only become greater.

ISPs and corporations have to choose between increasing the bandwidth of their connections to the 'net, or adding caching proxies to avoid redundant 'net traffic. Obviously, if a resource can be cached locally, doing so is a very effective way to reduce traffic into and out of a corporation's or ISP's subnet, and avoids the recurrent cost of leasing higher-capacity network connections.

Because cacheable resources won't be repeatedly requested after being cached, this can lead to the appearance of 'atypical browsing patterns.' So be careful when analyzing logs to bear this in mind.

One thing you can do is to use a "Web beacon" -- a small 1x1 transparent .gif, for example. Mark it as uncacheable. Then, even if a visitor is sitting behind a caching proxy, his/her browser will be forced to re-fetch that small (43-byte) image from your server on each page (re)load. You can use this to assure yourself that the user-agent does in fact load images, so it's not a text-only scraper.

As I said before, most caching proxies are good -- good for the 'net and good for your server. They help to maximize performance across the 'net and minimize the load on your server by eliminating redundant traffic. The fact that some (mostly anonymous, non-caching) proxies are used to abuse our sites does not make all proxies bad.

Jim

Umbra

12:57 am on Jan 4, 2007 (gmt 0)

10+ Year Member



Unfortuantely I'm stuck with the perils of AOL :(

Not sure I understand the whole story, but according to Wikipedia, AOL introduced X-Forwarded-For headers in December 2006 and no longer hides the user's real IP behind a shared proxy IP.

Umbra

1:18 am on Jan 4, 2007 (gmt 0)

10+ Year Member



As I said before, most caching proxies are good -- good for the 'net and good for your server. They help to maximize performance across the 'net and minimize the load on your server by eliminating redundant traffic.

Maybe, but with so many dynamic pages out there and/or content that changes daily, I wonder if proxies are really so useful in that regard, at least for very dynamic websites.

Also, I introduced no public cache headers for 2 weeks now, and so far, I don't see any real increase in bytes per session. I don't know if that's because of a) the holidays, b) proxy users are a tiny percentage of internet traffic at this time, and/or c) many public proxies ignore the Cache-Control: private header (because they're not HTTP/1.1 compliant or just badly behaved)

wilderness

3:58 am on Jan 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



X-Forwarded-For headers

Umbra,
I realize that I'm supposed to be a "whiz kid", however I haven't a clue what such a thing is.
In addition, I'v gotten by just fine for quite a long while without knowing.

I don't have my own servers and my sites have always been hosted.

In 1997 or 1998 I was deeply involved in an email discussuion list on widgets.
Having poked around in some Usenet, I had found access to ARIN and started documenting the many AOLers that I was emailing with their known location.
Many times the AOL locale name was consistent however it could change to another AOL locale randomly and change back just as randomly.
There absolutely NO consistency at all.

Nor did AOL have any information as to what AOL locale utilized what range.
Perhaps that's changed today?

EX:
This is somebody that I've emailed with for more than 10-years. AOL is and has always been used for the connection.

Does this tell you anything "imo-m28.mx.aol.com"

Don

edited by Wilderness.

DNS provides the following (which is absolutely uselss)
http ://www.dnsstuff.com/tools/lookup.ch?name=imo-m28.mx.aol.com&type=A

jdMorgan

4:37 am on Jan 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



X-Forwarded-For is an eXtended HTTP header, which indicates the IP address that is requesting your pages through the proxy. Some proxies pass this information, and some don't. The ones that don't are called anonymous proxies, although it's the user who is anonymous, not the proxy itself.

So using mod_rewrite variable naming conventions, the logic might be:
If %{REMOTE_ADDR} or %{REMOTE_HOST} is a known proxy
..If %{HTTP:X-Forwarded-For} IP is non-blank
....Log %{HTTP:X-Forwarded-For} as a "known-user-IP by proxy" visit
..Else
....Log %{REMOTE_ADDR} as a unknown-user-IP/known-proxy-IP visit
..EndIf
End

Since this X-Forwarded-For header is not logged by standard logging configurations on off-the-shelf shared hosting, and since most Webmasters on such hosting won't be able to get into the server config to change mod_log_config settings to include it in their access logs, the remaining options are to use on-page PHP or PERL scripting via server-side includes to log this info to a separate file.

Once you've nailed down a troublesome proxy user's IP address(es) with %{HTTP:X-Forwarded-For}, you could use


SetEnvIf X-Forwarded-For ^bad\.IP\.address\.here$ getout

just like you would for a bad user-agent or Remote_Addr.

I haven't actually done this last step in a long time, so the variable name for SetEnvIf might not be 'just right', but it works the same way as, for example, detecting Google Accelerator prefetches with


SetEnvIf X-moz prefetch getout

If you had a particularly nasty IP address abusing your site, you could deny from *both* Remote_Addr and X-Forwarded-For, though trying to cover both cases for a lot of IP addresses would bloat the size of your deny list.

Jim

balam

5:27 am on Jan 4, 2007 (gmt 0)

10+ Year Member



> Does this tell you anything "imo-m28.mx.aol.com"

That's an AOL Mexico [aol.com.mx] hostname, mi amigo. I'm not familiar enough with the Mexican hostnames (or AOL's coverage in the country) to be able to peg it any better than that, but I can tell you that the prefix ("imo") does not (seem to) match any of the Mexican states.

Since that hostname is in AOL's 64.12.*.* block, I expect it's being forwarded. If it's being forwarded, and if it's true that AOL is now passing X-Forwarded-For, you could very well see a LACNIC IP in X-Forwarded-For.

wilderness

5:38 am on Jan 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hey Jim,
Seems to me that we've had this discussion
previously ;)

Is the expalnation regarding headers a bunch of gibberish or does it say that a visitor using a proxy while connected from the IP range #*$!.#*$!.#*$!.#*$! would actually show up as a visitor from ZZZ.ZZZ.ZZZ.ZZZ?

Or possibly has either IP range could be a log entry (depending on the proxy configuration)?

Under this possibility!
It would seem theory that a user/visitor could log in at Comcast and then be shown as Qwest or any other provider?

If so?

This is not what AOL does.
AOL has their own server ranges, which are not explained anywhere in AOL's support and the customer is changed servers randomly as the need arises by the AOL router/server.
This makes confirmation of locale/identity quite impossible.

In the event that an AOL user logs in to another provider and then logs on to AOL the initial provider log will always be provided in emails as "original"

The AOL server that I provided in the previous response is somebody that I KNOW has utilized AOL and ONLY AOL for his connection for the past ten years.
In addition the possibility exists that he could have another server assigned to his AOL emails in seconds, minutes or days, depending on server load and AOL's discretion. (No consistencty or explantion even when I documtented the process from a variety of locales across the country.)

Don

This 33 message thread spans 2 pages: 33