Forum Moderators: phranque

Message Too Old, No Replies

Error break down

         

5kKate

4:25 pm on Mar 25, 2015 (gmt 0)

10+ Year Member



I'm trying to break down our errors to see what's causing them. Is there an easy way to see which request URIs are causing the most errors?

lucy24

7:26 pm on Mar 25, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, for a given definition of "Easy". Why don't you start by firing up your raw logs?

"Errors" is an awfully general term, since it could apply to any non-200 response. What in particular are you looking for?

In the case of 400-class errors you'll need to distinguish between genuine errors that might happen to a human, and errors triggered by blocked robots. If you don't already use Webmaster Tools, sign up. Ordinarily when people talk about wmt they mean google, but other search engines have them too, and you may actually get more detailed error information from bing's version.

Malign Ukrainians do tend to home in on particular pages and request them over and over again. It probably isn't worth the trouble of figuring out why. And, of course, any and all malign robots from previously unknown ranges will hit a lot of 404s in their quest for /wp-admin/ and /fckeditor/.

A 301/302 response is not technically an error, but if certain requests consistently meet a redirect, you'll want to know about it.

5kKate

6:23 pm on Mar 26, 2015 (gmt 0)

10+ Year Member



Thanks for the response! Yeah I'm mostly interested in 4xx and 5xx errors. I can see in my access logs that there are log a calls to /wp-admin/ so you are on point about those malign robots. They make me a little nervous though. Should I block the IPs to keep my site secure? I have Webmaster Tools but doesn't that only show errors from Google's crawler? I was hoping to see errors from all the requests on my site.

I'm thinking I should use the apache access logs since it stores every request and the status codes. There are thousands of requests but it'd be nice to see them broken out by request URI so I can filter out the ones I already know about, and I don't miss something new that pops up. For example, if I do a site deployment I don't want to be missing any assets. We also use Tomcat to serve some of our application pages and it'd be nice to see if new releases are causing a problem. I've been scrolling through the logs and trying to use grep to find new errors. Are there analyzer tools out there you use that give you a summary report of this information?

lucy24

9:25 pm on Mar 26, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Are you on shared hosting or is this your own server? I'd have assumed shared, but Tomcat makes me think it's your own. If so, you may need to wait for someone who speaks Apache-- phranque? you out there? --as I don't know if Tomcat creates logs of its own.

If a request gets a 403 error, it is already blocked in some way, so you have done what you need to do. There is no way to keep blocked requests from getting logged. (Well, OK, there is if you have your own server, but even then it is very unlikely you would want to leave anything unlogged.)

A 500-class error merits closer inspection. One simple possibility is mod_security; I think it currently defaults to 503. (My host creatively uses 418 instead. This makes it very easy to analyze logs.) Sometimes they mean that a malign robot has walloped you so hard (over some-number-of concurrent connections), the server starts returning 500s to protect itself.

One thing you can do is look at your Apache error logs. Normally they will be stored in the same place as your access logs. On shared hosting they typically include all 400-class responses (again, on your own server you can change the logging level to exclude 403s). The most common line will be

:: shuffling papers ::

[Thu Mar 26 10:27:02 2015] [error] [client 50.192.203.171] client denied by server configuration: /home/username/example.com/wp-admin
[Thu Mar 26 10:59:39 2015] [error] [client 109.98.4.7] client denied by server configuration: /home/username/example.com/, referer: http://semalt.semalt.com/crawler.php?u=http://example.com

Unfortunately, no logging level will give you more detail. The "client denied by server configuration" wording means that something, somewhere, triggered a 403 [F] response. Generally it's a blocked IP; occasionally there will be some head-scratching as you try to work out why some particular human was locked out. (Here I'm assuming the first was a blocked IP while the second was blocked by the semalt referer.)

Then there are 404 errors:
[Wed Mar 25 11:15:56 2015] [error] [client 193.201.224.176] File does not exist: /home/user/example.com/includes/fckeditor

Q: Why didn't this request get blocked? A: Because I actually have an /includes/ directory, and it's got an "Allow from all" override so I can use includes even on the error page. So this 404 request was accompanied by a bunch of 403s from the same page. These 404s are a good way to catch new requests for things like /wp-admin/ from previously unknown malign robots. The ones you find in WMT tend to be malformed links from other people's sites. Most often I see a legitimate URL with a bit of text at the end, suggesting that they didn't put the closing </a> tag in the right place. Or that the Googlebot found a plain-text http link and didn't know when to stop reading.

Internal errors:
[Tue Mar 24 12:53:49 2015] [error] [client 66.127.aa.bb] unable to include "/includes/sharedlinks.php" in parsed file /home/user/example.com/boilerplate/forbidden.html

(This would be worrying, except that it was me testing a new page with an obvious glitch.)

mod_security errors:
[Thu Mar 19 03:00:55 2015] [error] [client 103.27.127.238] ModSecurity: Multipart parsing error: Multipart: Final boundary missing. [hostname "example.com"] [uri "/editors/ewebeditor//upload.asp"] [unique_id "{ snipped }"]
[Sun Mar 22 07:50:29 2015] [error] [client 91.121.169.22] ModSecurity: Access denied with code 418 (phase 1). Pattern match "^\\(\\) {" at REQUEST_HEADERS:Cookie. [file "/dh/apache2/template/etc/mod_sec2/hostname.conf"] [line "234"] [id "1990064"] [msg "CVE-2014-6271 - Bash Attack"] [hostname "example.com"] [uri "/"] [unique_id "{snipped}"]

There are some other formats, including a couple of User-Agents that never occur outside of unwanted robots. But I don't archive error logs so these are the only formats I can quote. (While hunting these down, I also made the unnerving discovery that some 418 requests don't show up in error logs. Don't know what this is about; I should probably ask.)

My personal log-wrangling routine pulls out all redirects and 400/500-class responses. The ones that can't be explained by automated methods (for example, a 301 followed by a non-301 for the same URL means someone used the wrong hostname) get a closer look.

5kKate

4:53 pm on Mar 28, 2015 (gmt 0)

10+ Year Member



Wow this is really awesome information! For your log-wrangling routine did you setup a script or something to pull out all the 400/500 class responses? Or is it more manual using grep or other command line tools?

lucy24

8:21 pm on Mar 28, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



All my sites are small. How small? Sooo small that I process the raw logs in javascript. (I use HTML as a word processor. I am not alone.) There's a string of functions that do different things: discard requests from known harmless robots such as search engines; pull out requests from known botnets and anyone I'm currently evaluating for block-or-ignore status; pull out any remaining non-200 responses; flag possible robots for visual checking; process the rest into a pretty table.

The non-200s get a routine of their own, which begins by deleting any 301 that's immediately followed by a 403 on the same request (obviously these shouldn't occur, but I keep it for insurance). By default I delete all other 403s and 50x; 404s and 410s are pulled out for closer inspection.

not2easy

9:01 pm on Mar 28, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I use something simlar with the raw access logs. Pulling out the same 4XX, 5XX server responses, pulling out all requests for robots.txt, wp-login or admin and cross check them with the list built up over time visiting the Spiders/UID forum here: [webmasterworld.com...] and I run it two ways so the same line gets checked more than one way. When I do the check that removes all the non 200/30X requests it is easy to spot anomalies from a solid block of requests that a human would need to view the page. Almost always those are image requests. That list gets cross checked with the blocked list because I don't block without cause. I use a text editor to pull out the lines: BBEdit (Mac).

phranque

12:48 am on Mar 29, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



if you have access to a *nix command line, you can get almost everything you need from your log files using grep, cut, sort, uniq, and occasionally sed or awk.

lucy24

8:02 am on Mar 29, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



From "command line" to "awk" is indeed a short step ;)

5kKate

12:37 am on Apr 1, 2015 (gmt 0)

10+ Year Member



Nice I'm not an expert at Javasrcript but I used phranque's example and wrote a little grep and cut to pull out the request URIs. Here's how it works:

ubuntu@ip-172-31-11-241:/var/log/apache2$ grep " 404 " access.log | cut -d ' ' -f 7
/manager/html
http://testp4.pospr.waw.pl/testproxy.php
/login.action
/cgi-bin/test-cgi
/muieblackcat
//phpmyadmin/scripts/setup.php

You can see some stuff that looks maliagn ukranians haha, especially the muieblackcat. It looks like only 2 IPs that generating all these requests. I guess I can add these IPs to deny in .htaccess just to be safe?

[edited by: Ocean10000 at 1:18 am (utc) on Apr 1, 2015]
[edit reason] unlinked [/edit]

lucy24

3:40 am on Apr 1, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



especially the muieblackcat

Oh my gosh that takes me back. Haven't seen those in years. Someone at the time must have told me what the file actually is, because it's the kind of thing I would have asked about, but I can't remember now.

:: detour to archived logs ::

Last one I can find was early 2012. I think most of them must have come before I started saving logs; the few that I find all had blank UAs, which alone would be enough for a lockout.

Denying IPs is the simplest and most efficient way to bar unwanted visitors. Save the complicated rules for things that involve "if A and B but not C, and then only when D=E". I also do a lot of mod_setenvif-plus-modauthzzwhatsit, as in:
BrowserMatch ^-?$ keep_out
...
Deny from env=keep_out

5kKate

11:47 pm on Apr 2, 2015 (gmt 0)

10+ Year Member



Awesome this is perfect thanks for all the help!

phranque

7:42 pm on Apr 3, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



welcome to WebmasterWorld, 5kKate!

5kKate

8:19 pm on Apr 3, 2015 (gmt 0)

10+ Year Member



Thanks phranque! Happy to be here. Looking forward to getting to know everyone.