Forum Moderators: open

Message Too Old, No Replies

Mysterious User Agents

         

ratman

10:46 pm on Sep 30, 2002 (gmt 0)

10+ Year Member



Hi All,

I have just been going through my server logs and noticed these UA's:

63.155.196.249 - Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; AT&T CSM6.0)
64.156.198.78 - Mozilla/5.0 (X11; Linux i686; en-US; rv:1.0rc5; OBJR)
213.121.69.199 - Mozilla/4.0 (compatible; MSIE 5.5; Windows 95; sniffout_or_w9x)
64.0.99.201 - Mozilla/4.0 (compatible; MSIE 5.01; Windows 98; BROADPAGE; NetCaptor 6.5.0)
62.252.64.6 - IE 4 Win XP
62.251.22.163 - Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.1) Gecko/20020826
64.246.44.19 - lwp-trivial/1.35
64.246.44.19 - PHP/4.2.1
203.88.129.166 - DA 4.0
202.188.200.186 - contype
12.252.45.24 - Mozilla/9.9
213.122.107.212 - Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Circle0701)

I don't recognise any of them and all of them have either made too many requests in a short time, read the robots.txt files and totally ignored it, attempted to break into password protected areas or have done nothing wrong (I am just curious!). I have traced the IP's but most are commercial companies (AT&T, etc.) Some others I have already researched using this forum (this list was twice as long).

One I have traced and have banned (in case some of you haven't heard of it yet)

61.6.159.128 - Mozilla/4.0 (compatible; MSIE 6.0; Win32 <a href=\"http://www.zylox.com/ua.asp\">Internet Research Software</a> )

I also seem to be getting a lot of hits from FrontPage, is there any real way to block it using htaccess?
I can't block using the IP address because they are coming from several different addresses.

Thanks
ratman

jdMorgan

3:34 am on Dec 5, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



spinnercee,

SetEnvIfNoCase User-Agent "^Scooter" kick_me_out

You're blocking Altavista's "Scooter"? You might want to allow Scooter\3 and disallow Scooter\1, but why block both?

Jim

spinnercee

3:44 am on Dec 5, 2002 (gmt 0)

10+ Year Member



Good Question, but I have a good answer :)

The server is a File Transfer only httpd ... The server does not deliver any indexable content... so... I banned the agent because in context, it was reading all the .exe files (exactly 8192 bytes) and dropping the connection, filling both my error and access log ... and, I just got tired of chasing IPs.

But you bring up another good question... I only saw "Scooter/3.2" is there a difference between them (in function and/or purpose)?

jdMorgan

1:50 am on Dec 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, Got it...

Scooter\1.0 was turned loose again awhile ago, and was indexing without regard to robots.txt. You should be able to find a recent thread on it here in this forum.

Scooter\3.2.SFO has been well-behaved, at least on my sites... And it gives me an AV "fresh tag", too! :)

Jim

spinnercee

4:13 am on Dec 6, 2002 (gmt 0)

10+ Year Member



Thanks JD ... I'll revisit the issue, and Un-ban the bot for now... but if he comes back at me like that again... I'm gonnna..... :)

jdMorgan

4:40 am on Dec 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



spinnercee,

I would block "^Scooter\1" in your .htaccess, since it has definitely ignored robots.txt. But on the site you describe above (with nothing but .exe files) I would also disallow Scooter using robots.txt - unless there are some exe's you want it to index(?). The intent is to stop Scooter\1 by any means possible, but to politely inform Scooter\3 and other variants that you don't want the site spidered.

I am granting AV the benefit of the doubt, since Scooter1 disappeared shortly after some complaints about its recent misbehaviour were lodged. Scooter\3 is welcome to most of my site, and it has been well-behaved. So Scooter\1 is blocked for now, and Scooter\3 is only disallowed in semi-private areas, and in areas where indexing would be pointless. But I'm running plain-vanilla mostly-html sites, and your case sounds different.

Jim

spinnercee

6:15 am on Dec 7, 2002 (gmt 0)

10+ Year Member



I hear you, JD -- My case is somewhat different, but I DO want to continue to enjoy the AV traffic.

My situation is this: I have two Apache servers running on different ports, I use two servers, instead of one with a virtual host setup, because I want to have a different connection limit for each server --- the main HTTPd contains all the index stuff, the second server only delivers downloads to clients and is connection limited. This is because I am running this particular site at the small end of a Redidential DSL (128K/out), and I don't want file hounds and download accelerators consuming my full bandwidth.

Now you guys that "know" are probably saying "Just use MOD_Throttle" or something to that effect, but the fact is I just can't seem to find [the] one that will work properly with my version of Apache [Apache/1.3.12 (Win32)]. There is a reason why I use this particular version as well.

I don't like the idea of throwing a 403 at a spider, and that brings up a question? Can 4xx status codes cause a spider to ignore the anchor (or alt) text in the referring page that led to the 4xx code?

guabito

12:52 pm on Dec 7, 2002 (gmt 0)

10+ Year Member



I had this one in the logs today -

Scooter/3.3.QA.pczukor

It also goes to Alta Vista

spinnercee

4:16 am on Dec 9, 2002 (gmt 0)

10+ Year Member



I was visited by this Uagent today: "HTTP agent"

Any Ideas? This client was retrieving href code that was commented out in a source document. (the 404 messages initially got my attention)

reverse-IP resolved to "dorm76-128.bridgewater.edu"

jdMorgan

5:09 am on Dec 9, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



spinnercee,

I don't know the answer to your question above about 404s and link text.

Regarding "HTTP agent", I would assume this is somebody playing with 'bot scripts in his dorm. My site would automatically ban a user-agent behaving as you describe.

Did "HTTP agent" fetch robots.txt? If not, I'd ban it manually, even if it didn't get auto-banned. Your mileage may vary on this point.

Jim

upside

9:31 pm on Dec 9, 2002 (gmt 0)

10+ Year Member



I just got a hit from Metacarta with an ip of 66.28.23.147 and their useragent was "Lynx/2.8.4rel.1 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.6c human-guided@lerly.net". They didn't obey robots.txt.

spinnercee

3:17 pm on Dec 10, 2002 (gmt 0)

10+ Year Member



Hey JD, the 404 (not found) stuff was just how i discovered that the request was from the commented code section.

The client only made 3 requests, one for the page with the comments, and three for the commented out stuff. I know it wasn't a bot, I was just interested in knowing if there was a client that normally uses such a UA string.

I wouldn't ban an agent on those grounds, but I would keep an eye on its behavior, either manually, or by setting up a log for "watchables"...

upside: Lynx is a tool used by webmasters (and hackers alike)... it's essentially a text-mode browser. I use it to "see what the bots see" since it does not render graphics and stuff. It can also be used to display http response headers like the tool here on WebmasterWorldorld. That's another one I wouldn't ban, nor expect to respect robots.txt. *Possibly SE editors use it to see what a page looks like sans images and javascript and styles. It can also traverse links, so it may exhibit bot-like behavior, but again, unless it stresses your server in some way (it shouldn't, since it will only be inteested in .html files), welcome it. [lynx.browser.org...]

guabito

8:24 pm on Dec 11, 2002 (gmt 0)

10+ Year Member



On the topic of user agents - It may be worth it to have your server logs reviewed by engineers and legal staff. This past week they started pouring over logs on the company my wife owns. They made some valid points in regards to user agents:

1. Don't count on your hosting company to block user agents unless they are going to bring the server down. They make money from the sale of the space and bandwidth - the more bandwidth, the more money in their pockets. (We have been very lucky in this regard as our hosting company is very cooperative in blocking abusive agents above our access level.)

2. If you do not do (or solicit) business in a geographic region shut down those IP's. In other words, if you don't do business in China, don't expose yourself to user agents in that part of the world in the first place. I think it would be very difficult for an "informational" website, but if you are selling goods or services and don't have clients in a region - it makes sense.

3. Have a strong user agreement, privacy policy and copyright notice posted on your site (written by legal staff). Include in your policy any user agents or browsing actions that would violate your policy. For example, one of our polocies prohibits the collection of data or information for third party profit without the company's consent.

4. If you block an agent or an IP range and they continue to come at you even after e-mail notification have them served with legal notification. If the agents still continue to come after you and they are coming out of the United States or the European Union the odds are very high that you have civil recourse.

5. Related to the above... If you own a server or rent server time and space you "own the bandwidth". Because it is connected to the Internet is not an open invitation for uninvited user agents to consume your bandwidth - this is especially true with a stong User Agreement (UA). If your front door is unlocked it does not mean anyone off the street can pass through it without knocking.

6. If there is any suspected illegal activities in your logs report it to the proper authorities - not necessarly to the "abuse" departments that may be spending 99.99 percent of their time sorting out spam reports.

Related to the above, there was an amazing incident in the logs this morning... We had been working with legal staff and engineers the past week and the last log had a massive number of hits (more than could be posted on several pages on this site) from three IP's using several agents. The hits were so frequent that it degraded the server - two of the IP's were registered to a company in Portland, Oregon that was already posted on Webmaster World so they went 403, the other hits went 200, except when the user agent was prohibited. That IP was registered to another company also in Portland, Oregon that offers a web content filtering service. The log indicated that the agents and IP's were acting "in concert" and although many of the hits were going 403 they stressed the server hitting the htaccess file so many times.

To make a long story short, after the legal/engineers saw the logs, their little stunt got them a complaint to the FBI, etc., etc. User Agents - BANNED OR NOT - KNOWN OR NOT - are NEVER allowed to hit your server in a way that would cause degraded server service (or shut the server down).

I know the staff we are working with have been sending out and drafting up the legal notices for locations in the United States and next week they will do the same for some firms in the United Kingdom and France through their European offices. Banning user agents is one way, but you do have recourse depending on your individual situation a geographic location.

spinnercee

9:16 pm on Dec 11, 2002 (gmt 0)

10+ Year Member



To make a long story short, after the legal/engineers saw the logs, their little stunt got them a complaint to the FBI, etc., etc. User Agents - BANNED OR NOT - KNOWN OR NOT - are NEVER allowed to hit your server in a way that would cause degraded server service (or shut the server down).

This kind of stuff scares me in a way... much ado over nothing... it is possible that your spidering friends were totally legitimate, or their behavior was possibly unintended...

I don't like to ban IPs, UAs or browse patterns, because I feel it goes against the free nature and the purpose of the internet as a whole... If a browser pattern constitutes an attack, fine, I'll throw you 403s all day, but I won't call the cops... Far be it for me to determine why --- there are far more "novice" browsers (people) and "ill-behaved" software out there than truly malicious people --- many times, we just don't speak the same language, so they can't understand "one at a time." But that's par for the course, and I wouldn't have it any other way.

guabito

12:42 pm on Dec 13, 2002 (gmt 0)

10+ Year Member



I can see spinnercee's post - to a degree. In fact, they claim that there was an issue with their software. It still does not make it legal to do what they did. In fact, it is not legal.

I also look at other legal arguments - why three IP's registered to two different firms in the same city? One of the firms had no information on the web and thus no privacy policy with what they do with the information.

But, as they said, it all depends on the type of site you have and how you do business and other market issues. The other question... Is it the comptetetion attepting to gain marketing information, etc. on your company. We are banning the IP's until they are proven innocent.

Many people don't care about some agents like ia_archiver since it goes to archive.org That's OK with me, but the fact is the information is collected by United Layer and then passed off to archive.org. So, where is the information going? Well, they have no stated policy where the information goes...

Like I say, we were informed that it is a personal/company decision. We have decided to take very stong mesures to halt many of the problems and abuses indicated by the logs.

In fact, most of the company's business will be migrated to a 1-800 number with Citrix Client/Server.

guabito

12:57 pm on Dec 13, 2002 (gmt 0)

10+ Year Member



On user agents - I wanted to make note of two user agents that were found in the logs. The first one was a bit of a surprise, until the engineers reviewing the logs informed us that a user agent can be created with almost any string (as long as it does not violate the copyright or trademarks of other agents).

In this specific case I had referer - and agent - going 200. This was fine except that the htaccess had the condition referer null and agent null going 403. How did they do it? The agent they used was named "-"!

The .htaccess was modified to stop agents named - and now all "-" agents go 403. So, these people were using a visual trick to gain access.

This bot that was seen in the logs and it did not obey robots.txt I did not see it on Webmaster World, so here it is:

Agent: lcabotAccept: */*

This goes to/comes from:

142.177.168.175

Stentor National Integrated Communications (Canada)

Watch out for those "-" They may not be what they seem!

guabito

4:17 pm on Dec 14, 2002 (gmt 0)

10+ Year Member



I was thinking a little bit more about spinnercee's post last night as I was doing some of my own browsing.

I think this is my point of view user agents, bots, etc. (known or unknown): When you own a company, when a company develops a site, when a company pays for the bandwidth, legal fees, taxes, etc., etc., then I am of the firm opinion that any site has the right (and indeed they do) to ban user agents, IP's, bots without comment or any reason at all.

For example, if a company is "monitoring" a site, using that company's bandwidth to drive traffic to their own site, to increase their business, why shouldn't they be banned? They are attempting to gain business at the expense of someone else's bandwidth without paying that site for their expenses - and bandwidth is only the start of the overhead involved (ie some of the above $$$).

I think that the "Wild West" attitude of the Internet had to change. When everybody had unlimited VC $$$ to throw at the Internet the bandwidth and overhead was, again, at someone else's expense. Those days are long gone, as are the days when bots, user agents and monitors should get free rides and an "open door" to do what they will at the expense of others.

There is a bottom line. Having a site on the web costs $$$ and when it is no longer cost effective it will die. Check out the web sites for sale on Ebay sometime - complete sites with domain names for $30.00, if that.

Now, if the site is informational, like on the Civil War or Spanish American War and it is not intended to make a profit and the people putting up the site don't care about bandwidth use. Let the bots and user agents roll and suck them dry. It is their choice.

My spouse's company had a top ranking on a major US Directory and they will receive notice to remove her site due to the actions of some "partner" user agents (intended or not). Here is the a tidbit... Out of all of the years of top rankings it resulted in one (1) new client. The survey indicated all other new clients came via: "Word of mouth."

This 76 message thread spans 3 pages: 76