homepage Welcome to WebmasterWorld Guest from 54.163.72.86
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Block Or Redirect Unknown Log Entries
Snorkasaurus




msg:4611877
 12:05 am on Sep 23, 2013 (gmt 0)

Greetings,

I have been seeing a lot of hits on one of my sites to /?s= and /?p=741. It is a Wordpress site and /?p=741 is a valid URL to a post made back in 2010, while /?s= is just an "empty search" so Wordpress returns to the index page (also valid of course). The requests are coming from a number of hosting, colocation, and VPS providers around the world. The User-Agent strings used to vary significantly but in the last few days have settled on a mostly Macintosh/Firefox based string. I can't imagine that this is the work of a legitimate bot since it doesn't visit any other URLs and is hitting the /?s= URL from a dozen geographically diverse IP's within a matter of minutes sometimes.

The folks in the Wordpress.Org forums didn't seem to know what it is, and didn't really have any suggestions about what to do about it. I'm still bothered by it because this is taking up a significant portion of my logs, and of course using resources on my web server. To give an idea of scale, I have used environment variables to separate legitimate bots and "myself" from the access log... and the described traffic is now over 90% of my remaining log. It is a small site, so the traffic does not amount to hundreds of gigs, but the log entries are a pain.

I was thinking that if I move the "/?p=741" post to another number, I could redirect those two URLs to something non-existent such as http://localhost/ [localhost]. I am hoping that this would bounce them back at themselves as well as stop the log entries from being created. Unfortunately, my experience with mod_rewrite is almost nil and I have been unable to make it work. I do have mod_rewrite enabled (Apache 2.2) and am able to do basic redirection (such as the foo.html and bar.html example). One of the posts I found here gave me the impression that the ? causes the remainder of the URL to be interpreted as a "query string"... I started reading the respective mod_rewrite pages but was quickly overwhelmed.

Would a redirection actually get rid of the log entries and if so, could someone please give me a hand with creating it? I'm running Apache 2.2/Win x86 and have full access to httpd.conf but would use htaccess if it is more appropriate. Alternatively, if anyone has any idea what it is or of a better way to deal with it I am quite open to ideas or advice.

PS: If posting a sample of the log files would help I could do that (or post them to pastebin if that would be better).

HF,
Snork.

 

lucy24




msg:4611890
 1:47 am on Sep 23, 2013 (gmt 0)

the ? causes the remainder of the URL to be interpreted as a "query string"

It's not being "interpreted as" a query string. It IS a query string. For redirection purposes this means you can't put this part into the body of an URL; it has to go in a RewriteCond like this:

RewriteCond %{QUERY_STRING} (^|&)s=($|&)

Whether you want to redirect or simply block is entirely up to you. Some individual robots do seem to go away faster if you redirect to a suitable contemplating-your-navel location such as 127.0.0.1 or even their originating IP-- and it shaves a few bytes off your server resources. But it's generally easier to block.

Robots sometimes develop weird fixations with some specific URL. Often it's an especially long page. (There is presumably an explanation, but I haven't found it.) Redirecting a robot without concurrently redirecting well-intentioned humans is tricky, though. It isn't enough to rename the page in your internal links; what if someone has the page bookmarked, or follows a link from some distant admirer?

The requests are coming from a number of hosting, colocation, and VPS providers around the world.

Is it one of those infinitely expanding botnets? Blocking humanoid UAs is, again, tricky. But you should definitely look for recurring IPs and block them when possible.

At one time I had a fairly long clutch of fake referers blocked. Technically they're still there, in my htaccess. But after a while I stopped seeing them, because they're all coming in from IPs that by now I've blocked globally.

phranque




msg:4611909
 5:37 am on Sep 23, 2013 (gmt 0)

welcome to WebmasterWorld, Snork!


Would a redirection actually get rid of the log entries ...?

no.
it would be logged as a 301 (or 302) instead of a 200 response.

Snorkasaurus




msg:4612055
 3:48 pm on Sep 23, 2013 (gmt 0)

Hey lucy24... thanks for the tip, as soon as I used your RewriteCond it worked just fine.

Robots sometimes develop weird fixations with some specific URL. Often it's an especially long page. (There is presumably an explanation, but I haven't found it.)

I'm sure it can't be a legitimate robot since it seriously has not viewed anything but /?s= and /?p=741 in 2013. And the 741 post is a 5 sentence post with two links (one of which is a dead local link that was relocated ages ago). So fortunately for me, the likelihood of a human wanting to access either URL is very unlikely.

Thanks for the warm welcome phranque, I checked and noticed that the log entries are of course still made. Do you know how I would be able to set an "env" (or two actually) at the same time as the RewriteRule so I can just dump them in to a separate log?

HF,
S.

phranque




msg:4612129
 8:06 pm on Sep 23, 2013 (gmt 0)

I'm pretty sure the only way to switch log file locations is to restart the server.

lucy24




msg:4612148
 9:00 pm on Sep 23, 2013 (gmt 0)

Since logs are a pure text file, it may be easier to let everything get logged and split them up in your log-processing routine instead. That part happens offline so you're not putting the server to extra work.

I can't quote any code, because my own site is so tiny I do it all in javascript. But the first thing I do is split out all 301 and 4xx responses and process them separately-- generally by ignoring all 403, tracking the redirects/404/410, and keep an eye out for requests for errorstyles.css.

Snorkasaurus




msg:4612164
 10:41 pm on Sep 23, 2013 (gmt 0)

Sorry folks, I should have been much more clear about my logging question. What I have done is use environment variables to separate my logs in to different files. For example, I have added a unique word to the User-Agent on my laptop (we'll call it "banana" for the sake of this post). Then I have used a SetEnvIf line like this:

SetEnvIf User-Agent "banana" donotlog

Then I added a bunch more SetEnvIf statements like this:

SetEnvIf User-Agent "Googlebot" donotlog searchengine
SetEnvIf User-Agent "bingbot" donotlog searchengine
SetEnvIf User-Agent "msnbot" donotlog searchengine


Then I setup my logs like this:

CustomLog C:/Logs/Accesslog.log combined env=!donotlog
CustomLog C:/Logs/SearchEngines.log combined env=searchengine


That way traffic from me is not logged, search engines get their own log, and my access.log should contain just humans who are actually reading my rants.

Now since my "human" log has become overrun with /?s= and /?p=741 requests, I thought it would be great if I could redirect them and add an environment variable so that I could push them to a separate log (and eventually maybe stop logging them). Unfortunately the documentation on SetEnvIf [httpd.apache.org] essentially says that if I use SetEnvIf with Request_URI that I should go see the mod_rewrite documentation for QUERY_STRING. LAWL!

Does that make a little more sense and does it sound like it would be possible to add "donotlog" and "garbage" to the /?s= and /?p=741 requests so that I can separate them out?

HF,
S.

phranque




msg:4612172
 1:12 am on Sep 24, 2013 (gmt 0)

you want this:
http://httpd.apache.org/docs/current/rewrite/flags.html#flag_e
With the [E], or [env] flag, you can set the value of an environment variable.

lucy24




msg:4612175
 1:23 am on Sep 24, 2013 (gmt 0)

Overlapping phranque, but luckily we don't contradict each other

Unfortunately the documentation on SetEnvIf essentially says that if I use SetEnvIf with Request_URI that I should go see the mod_rewrite documentation for QUERY_STRING.

Hee. But the real question is which executes first: mod_rewrite or mod_setenvif. Luckily you can also set environmental variables in mod_rewrite; it's one of those obscure flags that nobody ever thinks about, like [CO] (which I personally use in rare cases).

Did you ever say what the "path" of the URL is? I assume it's always the same. It comes out something like

RewriteCond %{QUERY_STRING} (^|&)s=($|&)
RewriteRule path-of-url-here - [E=donotlog:specialdetour]

No [L] flag, because you're just setting the variable and not taking any other action. Unless, that is, you are taking action-- in which case you just add the [E] stuff to your existing redirect.

Snorkasaurus




msg:4612780
 2:17 am on Sep 26, 2013 (gmt 0)

Hey Folks,

Thanks for the tips, as it turns out that was exactly what I was looking for. And then, the story gets more interesting...

I ended up having to use [R,L,E=donotlog,E=garbage] to get it to work and of course start redirecting them to localhost and keep a separate log of the hits. Then I started going through the new garbage.log file and adding the IP's to iptables on my router to block them from port 80 and 443. Within 24 hours of this I started seeing what I would consider to be "normal" bot traffic from them! IP's from the same hosting companies (easily over a dozen) are suddenly hitting some URLs that they were not touching at all before... and when I say "not touching" I mean these IP ranges have been hitting /?s= and /?p=741 (and nothing else) with increasing frequency since the beginning of the year.

Perhaps there is something quite meaningful to lucy24's comment:
Robots sometimes develop weird fixations with some specific URL

I expect that I will retain the redirection on the /?s= URL since it is redundant and monitor how they behave over the next little while.

Thanks for all the help eh... much appreciated!
S.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved