homepage Welcome to WebmasterWorld Guest from 54.226.80.55
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

Featured Home Page Discussion

This 80 message thread spans 3 pages: 80 ( [1] 2 3 > >     
welcome to the web?
Dan99




msg:4681886
 5:17 pm on Jun 22, 2014 (gmt 0)

OK, this is a "welcome to the web" question, I know.

I run a small scale server that serves about a thousand document copies per week to my colleagues around the world. Apache 2.2 on Mac OS 10.8.

In my logs, I see a "malicious user" who, a few weeks ago, started downloading the
    same
20 MB file, four times in a row, every 20-40 minutes. The requesting IP is always different. Sometimes it's an IP that I've already denied service to, being on a standard China blacklist. But sometimes it's on an IP that has no registered complaints. So how do I know it's the same malicious user? Because the request is ALWAYS for the same file, and ALWAYS four times. That is his or her hacker "signature".

So, OK, I just changed the filename slightly. My regular users will figure that out. But the requests keep coming with the old name. So instead of a 200 code, and a lot of megabytes, they're now getting a 404, and a few tens of bytes. Bandwidth-wise, there is no problem anymore.

But the requests are kind of littering my log. Any suggestions for mitigation? Is this a case where someone has infected machines around the world, and has commanded them to bang on me? Is he/she likely to get bored and go away? If the goal is to use bandwidth, they don't seem to be paying any attention to it. It's not anymore. Is there any way to notify the managers of these various IPs that their machine is being pirated?

I can handle malicious users, by banning their IP. No sweat. But this guy/gal is using LOADS of IPs to do the job. No way I can ban them all. I've been webserving for years, but this is the first time I've seen this.

 

aristotle




msg:4681900
 6:49 pm on Jun 22, 2014 (gmt 0)

Can you post some examples of log entries for recent download attempts. Instead of blocking IPs, it might be possible to block it another way, such as through a user agent or referrer string.

Dan99




msg:4681902
 7:14 pm on Jun 22, 2014 (gmt 0)

Well, here's a recent example. Of course, 27.37.104.67 is an unblacklisted Hong Kong IP address. They're getting a 302 code here because the filename they're requesting is wrong, and I've redirected their 404 to a page where they can look up the right one, which is how I treat my cooperative users. The document they're after used to be called Art_5-1-14.pdf which was the name of a 20MB file.

27.37.104.67 - - [22/Jun/2014:11:13:24 -0600] "GET /mydocs/tele/Art_5-1-14/Art_5-1-14.pdf HTTP/1.1" 302 235 "http://me.myself.andi.org/mydocs/tele/Art_5-1-14/"
27.37.104.67 - - [22/Jun/2014:11:13:24 -0600] "GET /mydocs/tele/Art_5-1-14/Art_5-1-14.pdf HTTP/1.1" 302 235 "http://me.myself.andi.org/mydocs/tele/Art_5-1-14/"
27.37.104.67 - - [22/Jun/2014:11:13:24 -0600] "GET /mydocs/tele/Art_5-1-14/Art_5-1-14.pdf HTTP/1.1" 302 235 "http://me.myself.andi.org/mydocs/tele/Art_5-1-14/"
27.37.104.67 - - [22/Jun/2014:11:13:28 -0600] "GET /mydocs/tele/Art_5-1-14/Art_5-1-14.pdf HTTP/1.1" 302 235 "http://me.myself.andi.org/mydocs/tele/Art_5-1-14/"

Now, the redirect they're coming from is just my folder holding that file. That folder still holds the file, but the file has a different name now.

Let me be clear. They aren't getting much of anything anymore from me. This is no longer a bandwidth issue. They are just making a request over and over and over and over. It's about littering my logs. (Yeah, small deal ...) So it has nothing to do with "blocking" their access. It has to do with making them stop requesting over and over and over and over. If it were one IP doing it, I could redirect that request to a super-long YouTube. I've done that, and it sure slows 'em down! That's a good way to get malicious users to go away. But I can't do that with a range of hundreds of seemingly random IPs.

lucy24




msg:4681905
 8:03 pm on Jun 22, 2014 (gmt 0)

You can't stop the requests outright. If it's your own server you can sometimes set up a firewall so unwanted requests never even reach the server, but it doesn't sound appropriate here.

But there are two other aspects of the request you can look at.

One's the UA, which you haven't mentioned in your post. If it's something clearly robotic, you can and should block it on its own.

The other is referer. In the past I've been plagued with robots that request extra-large files. For me this means any one html file larger than 200k, so even if they request nothing else, it's still a drain. Happily they often come in with an auto-referer; this is probably intended to get past referer-based lockouts, but can backfire.

The bad news is that Apache by itself can't detect an auto-referer; for that you'd have to detour to a php or similar script. But if there's only a small number of possible files, you can code them individually:

RewriteCond %{HTTP_REFERER} /paston/paston2\.html$
RewriteRule ^ebooks/paston/paston2\.html - [F]


and so on. My current list is only about 10 specific pages.

I've also got a custom redirect for certain patterns that robots seem to latch onto. For example:

RewriteCond %{HTTP_REFERER} /(fonts)/$
RewriteRule ^hovercraft/april_(blues)\.html http://example.com/boilerplate/redirect.php?oldpage=%1&newpage=$1 [R=301,L]


The "redirect.php" page says something like "I'm awfully sorry, but you've accidentally replicated the behavior of a nasty robot" and then has human-accessible links to the old and new page. Currently I've only got three packages-- meaning a whopping six indexed arrays-- and it isn't likely to increase. Robots do get tired and go away, so you only need to deal with the currently active ones. Note that this has to be coded as a redirect, not a rewrite, or else human visitors will just keep getting bounced back to the same page.

If it were one IP doing it, I could redirect that request to a super-long YouTube.

Please don't do it. There may be times it's tempting to redirect unwanted visitors to the CIA or NSA or similar entity, but sooner or later you'll shoot yourself in the foot. If you must redirect-- and admittedly it does work well for some robots-- use either 127.0.0.1 or the offender's own IP. (The second version has the advantage that if it's an exceptionally stupid robot it may get them kicked off their hosting. Use this only with dedicated robots, not with botnets that include infected human machines.)

Dan99




msg:4681910
 8:17 pm on Jun 22, 2014 (gmt 0)

Thank you very much. That is helpful. A firewall isn't appropriate, since the website is basically public.

Excuse me. What is the "UA"?

The referer (as shown) is my own machine.

The redirect is exactly what I've done. The bad dude is asking for a file of a certain name. That name doesn't exist anymore, so it redirects to a text page that points to an index. As in, "What you're asking for isn't here by this name, but look in this index and find the right name." So far, they haven't done it. Suggesting a bot.

Gee, I really like the super-long YouTube trick. Why would that be shooting myself in the foot? In fact, I repelled a particularly annoying IP by redirecting it to a Daffy Duck video. They never came back. Maybe they don't like Daffy Duck? Interesting idea of redirecting them back to their own IP.

Yes, I'm assuming that a robot will eventually get tired and go away.

not2easy




msg:4681911
 8:27 pm on Jun 22, 2014 (gmt 0)

UA is the User Agent and it might look something like this:
"Mozilla/5.0 (Windows NT 6.3; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0" it tells you the "browser" the robot is set up to mimic to send requests. It is easily spoofed and can be changed from one request to the next, but usually not. Often enough there is something unique in there you can use to block it with.
aristotle




msg:4681913
 8:40 pm on Jun 22, 2014 (gmt 0)

Yes, I'm assuming that a robot will eventually get tired and go away.

Maybe. But what you could be seeing is the ongoing process of the creation of a giant botnet. If so, the individual requests you're seeing now are just tests that take place every time a new machine is infected. As long as the botnet is in the process of creation, this is all you'll see. But when the botnet gets big enough, all these requests could be unleashed again simultaneously in a DDOS attack.

The reason I mention this is because I believe that this is exactly what I'm seeing on one of my sites. It started about four months ago, and the botnet, if that's what it is, appears to be growing at an average rate of about 200 new infected machines per day. With the help of other members of this forum, I've tried to set up a defense, but I have no idea if it will hold up if a full attack eventually comes.

Dan99




msg:4681915
 8:55 pm on Jun 22, 2014 (gmt 0)

Very interesting. I haven't traced out all the machines, but I wrote down the IPs for twenty or so successive requests, and they were ALL different. I suspect there may be hundreds.

Not sure how you set up a defense. Am I supposed to deny service to each and every one of these? That would be some work. As I said, many of these IPs are already blacklisted.

As to the User Agent, where can I find that? My log doesn't now tell me that. Can I set up my logging to record that? Might be interesting.

not2easy




msg:4681918
 9:19 pm on Jun 22, 2014 (gmt 0)

The reason for trying to block via UA is because after you block unwanted traffic, is isn't that unusual to see them trying a different door - coming in via proxies. IPs are useful but you want to get all the information you can collect.

aristotle




msg:4681921
 9:31 pm on Jun 22, 2014 (gmt 0)

It's hopeless to try to block individual IPs, as there could eventually be thousands of them.

In my case there are also numerous different user agents, but the same user agent is used for dozens of machines, so a list would be much shorter than an IP list. I've been able to block a lot of them individually because many of them are fakes made using real old versions of Firefox, Opera, and Chrome. But some of them mimic up-to-date browsers, so can't be blocked without also blocking some real humans.

My final and best line of defense is based on the self-referal blocking that Lucy discussed in her post above. In fact, if it wasn't for her, I wouldn't have been able to set it up.

lucy24




msg:4681923
 10:23 pm on Jun 22, 2014 (gmt 0)

As to the User Agent, where can I find that?

You did say it's your own server, right? If it weren't, I would advise you to change hosts, because omitting the UA from logs is inexcusable. I'm sure the Apache docs tell you how to set up your logs-- both what to show and what to omit, and how to format the parts you include. As I remember, it's a string of single letters with pluses and minuses. In fact I'm surprised the UA isn't included by default, so check whether you've already overridden the usual settings.

If so, the individual requests you're seeing now are just tests that take place every time a new machine is infected. As long as the botnet is in the process of creation, this is all you'll see.

Huh, that's interesting. It would explain the enduring fascination with, say, my contact page. But how long does the setting-up stage last? Typically when I've recognized a pattern in a botnet, it goes on for months and months.

:: detour to check ::

index.php botnet: labeled since September 2012, but probably started several months earlier
contact botnet: going since July of last year, i.e. within a few weeks after I created the Contact page :(

Or is there some boilerplate botnet code that everyone uses, so these aren't actually all the same botnet? That would also explain why the index.php botnet's behavior changes periodically. It started out as exactly 4 pages in a consistent pattern; now it's exactly 10, winding up with three iterations of the same pair.

Gee, I really like the super-long YouTube trick.

This would only work on botnets that involve infected human browsers. Other robots typically don't follow-up redirects; they've got a shopping list and they stick to it. Now, if it is an infected human machine, redirecting to something time-consuming might make the human more likely to notice that something is up. Especially if it's something audible. Another approach would be redirecting to some lunatic-fringe site whose offenses include playing music unasked.

Dan99




msg:4681924
 10:53 pm on Jun 22, 2014 (gmt 0)

Yes, I recall that User Agent option logging in my Apache log setup. It is not set up by default. I will set that up. I frankly didn't see the need earlier. I do now.

Very interesting about this botnet. Is this botnet hitting everyone with precisely four requests as it is doing to me?

As to my YouTube trick, this perpetrator IS following redirects. Because when they hit me with a request for a file name that doesn't exist, I redirect them to a page of mine that tells them where to go to look for it. They get to that page (and thus have been redirected), but they don't do anything about it. Of course, they might only tolerate a redirect if it's to one of my own pages.

aristotle




msg:4681928
 11:24 pm on Jun 22, 2014 (gmt 0)

Instead of re-directing or returning a 404, you need to return a 403 forbidden. If you set it up properly, that uses much less bandwidth and server resources.

Dan99




msg:4681941
 1:08 am on Jun 23, 2014 (gmt 0)

Well, but the bot is trying to get the file with the old filename. Permitted users could be trying to do the same thing. So I can't respond with a 403 to everyone trying to get that filename. My only recourse is to redirect everyone to a note that says that the filename has changed, go to this index to find the right filename. A bot shouldn't be able to do that. A genuine user can. So far, the bot hasn't done it.

The redirect just costs me about half a kilobyte, a 403 about a quarter. I can see that because this bot uses servers that are both allowed and denied. No big deal.

lucy24




msg:4681948
 2:38 am on Jun 23, 2014 (gmt 0)

Instead of re-directing or returning a 404, you need to return a 403 forbidden. If you set it up properly, that uses much less bandwidth and server resources.

Well, that's where redirecting to someone else's site becomes attractive. A 301/302 response header is smaller than just about anything else you could come up with; the size only mounts up when your server is sending out content (such as a 403 page) at the same time.

Is your unwanted visitor picking up stylesheets and other non-page content? That's generally a pretty good identifier for infected machines vs. pure robots. Sure, some robots pick up everything, but most stop at html. Or pdf, as the case may be. If the redirect points to an html page, then presumably there's a stylesheet or two along with it. Or at least a favicon.

Very interesting about this botnet.

botnets are the internet's answer serial killers. At any given time, there's a certain number of them at work, generally more than you'd prefer to contemplate. Sometimes you can only identify them by their signature-- and often only after the fact, like making four consecutive requests for the same page. (Four is funny. I get a lot of robots that make requests in sets of three.)

These are complete 200 requests, right? It's tricky with large pdf files because a lot of perfectly normal human browsers will break the file into multiple 206s.

Edit:
Your big pdf file is linked from somewhere, right? So if your on-site link has changed, then only a robot or botnet would be asking for the old filename citing the current html page as referer. Bookmarks don't come with a referer, and there's a limit to how long a page stays in a browser's cache. Technically I guess it could stick around forever-- until the browser slows to a crawl and you have to walk your computer-illiterate friend through Emptying The Cache ;)-- but if it's too old, the browser will make a fresh request even when it's got a cached copy. How long ago did you rename the pdf file and fix your local links?

Dan99




msg:4681949
 2:59 am on Jun 23, 2014 (gmt 0)

My 404 redirect points to an html text page, though I suppose the favicon can be seen as well. No stylesheets go with it.

I've actually put in a "Header set accept-ranges none", so I don't get any 206s anymore.

The original name for the big pdf file was listed in an index file, and the whole shebang is regularly indexed by search engines. In fact, as I said, I'm inviting those requesting using the old name to go into the index file to get the new one. So there is nothing to stop anyone from getting the new address.

As you say, only a zombie robot or botnet would keep banging on the old name and not even try to correct itself. I renamed the file about a week ago. That is, if the goal was to hog bandwidth, it really isn't trying very hard.

lucy24




msg:4681951
 4:59 am on Jun 23, 2014 (gmt 0)

My 404 redirect

You've used this phrase twice now, I think, and it's making me uneasy. What exactly does it mean?

aristotle




msg:4681982
 10:31 am on Jun 23, 2014 (gmt 0)

You can set up the 403 response so that it only uses 13 bytes.
See this thread: [webmasterworld.com ]

aristotle




msg:4681984
 10:41 am on Jun 23, 2014 (gmt 0)

So I can't respond with a 403 to everyone trying to get that filename

That's why you need to set up a defense that lets the real humans through but blocks everything else with a 403. Than you can go back to the original file name and avoid telling people to look up the new file name.

Dan99




msg:4682002
 11:37 am on Jun 23, 2014 (gmt 0)

What I mean by 404-redirect is that when someone is handed a 404, they are redirected to a page that says "The file you asked for isn't here. Please go to this index page and find what you want." That's useful when I have to change a name (to make formatting consistent, perhaps), and someone who has that page bookmarked with the old name gets turned away.

As to setting up a defense that lets real humans though but blocks everything else with a 403 is, I suppose, what this thread is about. Still not quite sure how to do that.

aristotle




msg:4682006
 12:17 pm on Jun 23, 2014 (gmt 0)

As to setting up a defense that lets real humans though but blocks everything else with a 403 is, I suppose, what this thread is about. Still not quite sure how to do that.


You need to study the behavior of your opponent before you try to set up a defense. That's why we need to see your logs, including the user-agents.

aristotle




msg:4682009
 1:02 pm on Jun 23, 2014 (gmt 0)

Lucy wrote:
botnets are the internet's answer serial killers. At any given time, there's a certain number of them at work, generally more than you'd prefer to contemplate

Lucy you said that botnets are "at work", but what are they at work doing? That's still not clear to me.

lucy24




msg:4682029
 3:52 pm on Jun 23, 2014 (gmt 0)

That's still not clear to me.

That makes two of us ;) They obviously get something out of it, or there wouldn't be so ### many of them.

How does the "404 redirect" manifest itself at the server level? Seems like if you're redirecting all requests for the page, there should never be the opportunity for a 404 to come up at all. What's the actual response that is sent out?

Dan99




msg:4682052
 4:55 pm on Jun 23, 2014 (gmt 0)

How does the "404 redirect" manifest itself at the server level? Seems like if you're redirecting all requests for the page, there should never be the opportunity for a 404 to come up at all. What's the actual response that is sent out?

Yes, I suppose that's true that a 404 never happens, at least from this website. The page I redirect to is about 500 bytes of text, pointing to my index page. The point being, if you don't know what you're requesting, see if you can find it here. That index page is public. No need to hide it.

Dan99




msg:4682054
 4:59 pm on Jun 23, 2014 (gmt 0)

They obviously get something out of it, or there wouldn't be so ### many of them.


Exactly right. As I said, to the extent they want to lay waste to my bandwidth, they're doing a pretty crappy job of it. What they are doing, to be precise, is mild littering of my log. Not sure what they get out of that. That's hardly even an inconvenience to me.

lucy24




msg:4682092
 8:46 pm on Jun 23, 2014 (gmt 0)

I suppose that's true that a 404 never happens, at least from this website.

Uh-oh. That really sounds like what Google calls a "soft 404". It's what they're looking for when they request a nonsense URL like bc896oe5utjjkb.html --if a site knows what's good for it, that request had better yield a 404!

Dan99




msg:4682101
 9:00 pm on Jun 23, 2014 (gmt 0)

Uh-oh. That really sounds like what Google calls a "soft 404". It's what they're looking for when they request a nonsense URL like bc896oe5utjjkb.html --if a site knows what's good for it, that request had better yield a 404!

Sigh. I hear ya. All the more reason to come up with a smarter defense. That is, if my regular users make a mistake in the filename, I don't want to throw them out ungracefully.

Now, what exactly is the disadvantage of a "soft 404"? In this case, they're just getting redirected to a few hundred byte text file. I could just send them to a few hundred byte error page instead. What's the difference?

I have enabled User Agent logging, and I am collecting data on this perhaps-bot. Be right back.

not2easy




msg:4682108
 9:38 pm on Jun 23, 2014 (gmt 0)

The disadvantage of a "soft 404" is that Google doesn't like them because they offer a poor user experience. If you aren't concerned about your site's ranking, they don't matter. Does your service depend on ranking? It didn't sound like that in the first post. If you can block your pests you won't have very many soft 404s anyway.

Dan99




msg:4682117
 10:18 pm on Jun 23, 2014 (gmt 0)

Ah, thanks. No, I'm not concerned about the ranking and, in fact, I never even look at the ranking. Not quite clear why a soft 404 offers a poor user experience. Such that a hard 404 "FILE NOT FOUND!" is going to offer a better experience?

lucy24




msg:4682121
 10:26 pm on Jun 23, 2014 (gmt 0)

To clarify: A redirect of some specific request isn't a "soft 404". It's only an issue when the site never serves 404s at all, but deals with everything by redirecting to some existing page, most often the front page. As a user, this drives me bonkers because it leaves me with no way of knowing whether I just misspelled my request, or whether the site itself has a bad link that I'd like to report, or... (It is not always the case that the things G### dislikes are the same things a human visitor dislikes. So it's nice when they coincide.)

I think somewhere in this thread I posted an example of a targeted redirect. Since the pattern includes a specific referer, search engines would never even see the redirect; the whole point is to intercept unwanted robotic behavior.

Overlapping...
Such that a hard 404 "FILE NOT FOUND!" is going to offer a better experience?

Well, that's why you make a custom 404 page. "I'm awfully sorry, but I can't find the page you asked for, so here are some other places you might try." Sure, the custom 404 might include a link to the front page-- but the user doesn't always benefit from being dumped there unasked.

This 80 message thread spans 3 pages: 80 ( [1] 2 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved