Forum Moderators: open
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0;)
Additionally, the Referer is always exactly my domain name.. i.e. [my_domain_name.com...]
It's definitely a bot of some type and has been going on for I don't know how long. I've only recently started investigating this, since bandwidth and machine usage has been through the roof.
I see this forged IE User-agent coming from multiple IP's, most of them in the 69.230.*.* range, which are SBC / Pacific Bell dial-up or DSL accounts. I've started banning these IP's as they come up, but this is hardly a solution as these people (or person?) can just reconnect to get a new IP.
Anyone else seeing this activity?
[edited by: volatilegx at 7:13 pm (utc) on Dec. 14, 2006]
[edit reason] fixed broken user agent [/edit]
I was really more curious if anyone is seeing this type of activity in their logs.... lots of fast hits with that User-agent with your domain name as the referer for every hit.
Thanks wilderness, but I believe that would block anybody actually using that version of the browser -- assuming that User-agent is even a valid one for IE. And, most of the hits are coming from there, but not all of them.
The lines I propvided ONLY deny access to that UA and only IF it comes from that IP.
You may also expand the IP lines by adding other ranges.
The multiple criteria reduces the chance of innocents.
I would risk blocking a real person (assuming "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0;)" really is an IE user agent) from pacbell's dymanically allocated DSL block. Also, for more scrutiny, the rule should test that the referer is the domain name.
That would all be great, if that user-agent was NOT a true IE UA, but I believe it is.
Unfortunately, this bot isn't taking the bait of my current bot trap (i.e. non-visible link to a page denied in robots.txt, linked to a banning script), so I'm going to have to program something in to my current shopping cart script to watch for this behavior and ban for 24-48 hours.
[edited by: volatilegx at 3:01 pm (utc) on Dec. 20, 2006]
[edit reason] fixed unintended smiley [/edit]
Why do you think that this is a IE UA? Does any real IE UA end with a semicolon and a parenthesis? (No)
I see tons of examples, from Mozilla/4.0 (compatible;) (proxy?) to Google Wireless Transcoder's user agent. Admittedly, these are not actual IE browsers...
[edited by: Umbra at 7:39 pm (utc) on Dec. 15, 2006]
[edited by: volatilegx at 3:02 pm (utc) on Dec. 20, 2006]
[edit reason] fixed unintended smiley [/edit]
A lot of scrapers are now using Google translation and wireless "proxies" to do their dirty deads.
Glad to see I'm not the only one that notice this trend.
It's always been Jim's position/courtesy to allow access to those that jump through the hoops required by translator tools (anybody whose ever used one may attest to the difficulty).
However, and from my point of view, in the event the IP comes from a registrar of which the region does not hold any possibility of benefit to your website (s)?
No need to allow access from a translator tool when access is not normally allowed from the IP range.
Proxies and colo's are of the same nature (at least in my book) and demand denial.
My own pages simply contain too much text for the small wireless devices. Hopefully a time will come that we as webmasters will be able to determine the difference between a cell phone with web access and a laptop with either a wireless or cell phone connection.
Even most of the new web accelerators provide the IP range of the tool rather than the user and also demand the aforementioned attention.
As ususal, each webmaster must determine what is benficial or detrimental to their own website (s).
Proxies and colo's are of the same nature (at least in my book) and demand denial.
Even AOL proxies? I started a thread here [webmasterworld.com] about handling proxies... but that didn't seem to produce any good conclusion yet.
Even AOL proxies? I started a thread here about handling proxies... but that didn't seem to produce any good conclusion yet.
Unfortuantely I'm stuck with the perils of AOL :(
Many of the visitors of widgets that come to my sites are AOL users because of the reliable flexibility in dial-up access when the move from state-to-state or even state-to-province.
Applying my denial to these AOL ranges is not an option for me.
Colo's and non-AOL proxies are a different issue for me.
Don
How do you distinguish non-AOL proxies from regular static/dynamic IPs?
It's a most difficult task to confirm anything with AOL. (as I'm sure your aware).
As you previously mention "that AOL does not abide by meta tags".
Nearly ALL my pages are meta-tagged; No Cache.
My images are in their own folders and excluded in robots.txt. (AOL is the only provider that I allow exceptions for in spidering images. ANY other bot or provider would be denied, regardless of how many customers they represent.)
I have the same ranges (class C) both attempting to access pages and images with blank refers (or those short, begins and ends with Mozilla UA's) and than later returning to images-only with full-UA's.
It's merely a matter of "assumption" on my part, however after watching it happen for more than seven years?
It's safe to assume it's not a casual visitor just browsing or scraping.
How do you distinguish non-AOL proxies from regular static/dynamic IPs?
There are pages across the internet that list active proxy servers. I haven't had the need to visit any such page in a long while.
My denial procedures are unusually harsh and many pests were denied long ago. As a result my current crawls/spiders are rather limited compared to most webmasters (and newbies).
I make note of what I refer to as "Snoops" and follow up on any subsequent acrivity of "Snoops" taking action.
Nearly ALL of RIPE, APNIC and LACNIC are denied access to my sites and that preference is NOT POSSIBLE for most webmasters.
The RIPE, APNIC and LACNIC denials allow me to focus on other areas.
Good for ripping sites in stealth mode but a horrible marketing strategy.
Let's face it, most of the web tools are now hiding by design and it's going to get a lot worse before it gets better.
Let's face it, most of the web tools are now hiding by design and it's going to get a lot worse before it gets better.
Bill,
About the only comfort we have ;)
is that many major providers have broken formerly large blocks into smaller and more localized subsets.
It's too bad that all providers don't follow that same lead.
I understand that this should be a reliable way of stopping proxies from caching files.
Will see if this stops those silly Mozilla/4.0 (compatible;) user agents, and those that don't pay attention to the cache control headers will likely be banned.
It is, indeed. Now, whenever someone behind that caching proxy requests resources from your site, and they don't have a copy cached in their browser, the request will be passed through that proxy to your server, resulting in more traffic from that proxy. Is that what you want?
The problem is that most caching proxies are 'good' and they save us and the network a lot of bandwidth. That's why ISPs and corporations use them. But there's no reliable prima-facie way to tell a 'good' caching proxy from a 'bad' caching proxy or a non-caching anonymous proxy that is being used for nefarious purposes. You can look at the Via and X-Forwarded-For request headers, but support is so spotty that it's even less reliable than trying to use referrer-based access control.
But bear in mind that most proxies are good things -- good for their users, good for your server, and good for the network. Just like anything else, though, they can be abused.
Jim
...resulting in more traffic from that proxy. Is that what you want?
I will be comparing before and after to determine how significant is the increase in bandwidth. I've assumed that proxies are becoming less prevalent, if only because dialup is going the way of the dinosaur, but I'll see what happens. If there is significant increase in bandwidth, I'll weigh that against time saved while analyzing log files -- less chasing after strange hits (ie, proxy requests) that don't seem to correspond to normal browsing patterns.
ISPs and corporations have to choose between increasing the bandwidth of their connections to the 'net, or adding caching proxies to avoid redundant 'net traffic. Obviously, if a resource can be cached locally, doing so is a very effective way to reduce traffic into and out of a corporation's or ISP's subnet, and avoids the recurrent cost of leasing higher-capacity network connections.
Because cacheable resources won't be repeatedly requested after being cached, this can lead to the appearance of 'atypical browsing patterns.' So be careful when analyzing logs to bear this in mind.
One thing you can do is to use a "Web beacon" -- a small 1x1 transparent .gif, for example. Mark it as uncacheable. Then, even if a visitor is sitting behind a caching proxy, his/her browser will be forced to re-fetch that small (43-byte) image from your server on each page (re)load. You can use this to assure yourself that the user-agent does in fact load images, so it's not a text-only scraper.
As I said before, most caching proxies are good -- good for the 'net and good for your server. They help to maximize performance across the 'net and minimize the load on your server by eliminating redundant traffic. The fact that some (mostly anonymous, non-caching) proxies are used to abuse our sites does not make all proxies bad.
Jim
As I said before, most caching proxies are good -- good for the 'net and good for your server. They help to maximize performance across the 'net and minimize the load on your server by eliminating redundant traffic.
Maybe, but with so many dynamic pages out there and/or content that changes daily, I wonder if proxies are really so useful in that regard, at least for very dynamic websites.
Also, I introduced no public cache headers for 2 weeks now, and so far, I don't see any real increase in bytes per session. I don't know if that's because of a) the holidays, b) proxy users are a tiny percentage of internet traffic at this time, and/or c) many public proxies ignore the Cache-Control: private header (because they're not HTTP/1.1 compliant or just badly behaved)
X-Forwarded-For headers
Umbra,
I realize that I'm supposed to be a "whiz kid", however I haven't a clue what such a thing is.
In addition, I'v gotten by just fine for quite a long while without knowing.
I don't have my own servers and my sites have always been hosted.
In 1997 or 1998 I was deeply involved in an email discussuion list on widgets.
Having poked around in some Usenet, I had found access to ARIN and started documenting the many AOLers that I was emailing with their known location.
Many times the AOL locale name was consistent however it could change to another AOL locale randomly and change back just as randomly.
There absolutely NO consistency at all.
Nor did AOL have any information as to what AOL locale utilized what range.
Perhaps that's changed today?
EX:
This is somebody that I've emailed with for more than 10-years. AOL is and has always been used for the connection.
Does this tell you anything "imo-m28.mx.aol.com"
Don
edited by Wilderness.
DNS provides the following (which is absolutely uselss)
http ://www.dnsstuff.com/tools/lookup.ch?name=imo-m28.mx.aol.com&type=A
So using mod_rewrite variable naming conventions, the logic might be:
If %{REMOTE_ADDR} or %{REMOTE_HOST} is a known proxy
..If %{HTTP:X-Forwarded-For} IP is non-blank
....Log %{HTTP:X-Forwarded-For} as a "known-user-IP by proxy" visit
..Else
....Log %{REMOTE_ADDR} as a unknown-user-IP/known-proxy-IP visit
..EndIf
End
Since this X-Forwarded-For header is not logged by standard logging configurations on off-the-shelf shared hosting, and since most Webmasters on such hosting won't be able to get into the server config to change mod_log_config settings to include it in their access logs, the remaining options are to use on-page PHP or PERL scripting via server-side includes to log this info to a separate file.
Once you've nailed down a troublesome proxy user's IP address(es) with %{HTTP:X-Forwarded-For}, you could use
SetEnvIf X-Forwarded-For ^bad\.IP\.address\.here$ getout
I haven't actually done this last step in a long time, so the variable name for SetEnvIf might not be 'just right', but it works the same way as, for example, detecting Google Accelerator prefetches with
SetEnvIf X-moz prefetch getout
Jim
That's an AOL Mexico [aol.com.mx] hostname, mi amigo. I'm not familiar enough with the Mexican hostnames (or AOL's coverage in the country) to be able to peg it any better than that, but I can tell you that the prefix ("imo") does not (seem to) match any of the Mexican states.
Since that hostname is in AOL's 64.12.*.* block, I expect it's being forwarded. If it's being forwarded, and if it's true that AOL is now passing X-Forwarded-For, you could very well see a LACNIC IP in X-Forwarded-For.
Is the expalnation regarding headers a bunch of gibberish or does it say that a visitor using a proxy while connected from the IP range #*$!.#*$!.#*$!.#*$! would actually show up as a visitor from ZZZ.ZZZ.ZZZ.ZZZ?
Or possibly has either IP range could be a log entry (depending on the proxy configuration)?
Under this possibility!
It would seem theory that a user/visitor could log in at Comcast and then be shown as Qwest or any other provider?
If so?
This is not what AOL does.
AOL has their own server ranges, which are not explained anywhere in AOL's support and the customer is changed servers randomly as the need arises by the AOL router/server.
This makes confirmation of locale/identity quite impossible.
In the event that an AOL user logs in to another provider and then logs on to AOL the initial provider log will always be provided in emails as "original"
The AOL server that I provided in the previous response is somebody that I KNOW has utilized AOL and ONLY AOL for his connection for the past ten years.
In addition the possibility exists that he could have another server assigned to his AOL emails in seconds, minutes or days, depending on server load and AOL's discretion. (No consistencty or explantion even when I documtented the process from a variety of locales across the country.)
Don