Forum Moderators: phranque
Tracking of 400 and 500 Errors Part II
[webmasterworld.com...]
Tracking of 400 and 500 Errors Part I
[webmasterworld.com...]
We've built an application that manages 400/500 errors for our sites. I've had it running on one site inparticular for quite some time to fully test it and make sure it was doing everything I wanted and some. Now we're slowly migrating the App over to other websites. Wow, what an eye opener!
I don't want to have to traverse through logfiles to locate what I want to know. In this instance, I want to know everything about the 404s that are being generated. So, we built an application to do just that. And, I want them via email. No problem!
Now that I have all that, what am I doing with it?
Well, just today, I probably added 10 more rules to a rewrite file to capture some inbounds going to pages that no longer exist. There is a replacement page for most so I can 301 those 404s and at least take advantage of them, can't I? I mean, these are historical pages that were removed years ago. But, guess what, there are still some links out there sending traffic to those pages.
Also, I can now see all the probing taking place, the Schmucks! They are sitting there pounding the site looking for ways to generate valid queries. And, it won't work because we have facitlities in place to prevent that from happening. But, that doesn't stop them from probing and trying to find a hole to exploit.
I'm also finding invalid link references from really good sites. Its like, what were they thinking? Didn't they visit the link to verify that it was valid? I guess not. Well heck, I might as well 301 those too. No reason to return a 404 and lose the juice when I have a valid replacement page, is there?
And, with Yahoo!, MSN and Ask all requesting pages that have been gone for years, I'm going to send them somewhere now. Maybe I can get a little more traffic from the underdogs through our 404 handling.
I need to practice what I preach! I removed a group of 301s that were in place for at least 2 years. I figured, hey, those link references should be pretty much history by now. I was on a spring cleaning mission at that time. Now I know why I preach that once you implement a 301, it needs to stay forever and a day.
There I was, checking the 404s over the past 72 hours for a particular site and what do I find? Many of those 301s that I removed a little while back are still being requested by Yahoo!, MSN and Google. That's because there are still linked references out on the web, doh, now I know better. It was a quick fix. Just add the entries back in my .ini file and BAM, I'm now taking advantage of that 301 traffic no matter how small it may seem. I know, I know, I should have never removed the rules to begin with. Let this be a lesson learned. ;)
What else have I been finding you ask? Well, I learned that a very small number of web servers may be open for relay and can be used as a proxy server, not a good thing at all. There I was looking at 404s with another website's address. I immediately get on the horn to my programmer to see what's up. He explains what they were probing for and why the 404s were generated. I didn't know that stuff, I really didn't. Call me naive if you want but, tracking 404s at this level was not my forte in the past. I'm now making it a mission across the board. The information those 404 errors give me is a gold mine, it really is.
Edward, come on now, what are you on about this time?
I'm tellin ya' there's all sorts of stuff to be learned. For example, we may not prevent hotlinking in certain instances. A little while ago we renamed an entire library of images to break the leeching a bit as there were quite a few hotlinks out there. Little did I know that I probably should have left those alone. A recent patent from Microsoft hints at image search being influenced by hotlinking. That just gave me all sorts of ideas. So now I am updating some of those 404 hotlinked images. Oh, I'm being nice. I figure if someone is going to hotlink to one of our images from their myspace profile, and that profile gets at least 3-5 visits per day, I'm going to take advantage of that. Call me devious if you will. Why can't I have a little fun while doing all of this?
What were your last ten (10) 404s for?
So Edward, got any more interesting tidbits to share with us?
Sure do! One thing we are finding is that the Google GA does cause 404s daily for many of our sites. My programmer explained it to me but I'm still not 100% sure. What I understand is that there are two calls for the GA scripts. One of those calls may fail. When it does, you'll know as the destination URI will get appended with the
google-analytics.com/ga.js. It is less than 1% on the one site I am actively monitoring now. But, that still means that my stats for that particular session are not accurate, no biggie based on the very small percentage. Just recently I started seeing quite a few 404s with this
/MSOffice/cltreq.asp in the string followed by various parameters. A quick search in Google and I find out that there are users with MS Office installed and they have the Discussion Bar turned on in IE and there is a query being sent to the server to see if it supports Web Discussions. Hmmm, is that something I should look into? I've never done anything with that Discussion Bar in IE. What's it do? ;) Ah-ha, Log Spamming is still alive today. :)
I'm finding many invalid attempts at hacking this particular site's URI structure. 7 out of 10 of those come from a multitude of IPs in China. Hey, hack your own country's websites would ya!
I see constant probing for common vulnerabilities. Out of all the 404s for this one site, vulnerability probing is probably in the top 3 404s being generated. I can see the "dictionary" they are using because they will usually come in and BAM! generate about 40-50 404s in a period of a minute or two and be on their merry way.
This morning I caught someone looking for query strings to generate. Hehehe, they can't get to them because we don't allow it. But they sure tried. Heck, they sat there for at least 10 minutes and ran about 40-50 attempts to no avail. Man, if we didn't have facilities in place to combat some of this stuff, the indexing routines of that particular site would be an absolute mess. I know what that person is looking for. I'm not going to let you sabotage my efforts that easily. ;)
Another thing I found was a link within a pdf training document on a .edu site that was incorrect. Someone changed the extension from .asp to .htm. You can believe I fixed that one real quick!
Maybe, just maybe I can get a few others involved here, yes?
Ah-ha, I've figured out why I'm the only one involved in these topics. They are only visible to me at the public level, that has to be it. Its jatar_k messing with me, I know you're out there!
So, back to my question, why would I want to see 401s and 403s? Anyone? Would this maybe clue me in to a potential hack attempt? What sorts of things could I learn from tracking these types of specific errors? Would someone concerned about the security of their platform be interested in these?
I'll be back to continue in a week or so...
I chuck them out for anybody pretending to be a roundtrip-dns-able search engine spider or attempting to abuse forms (among other things) and I want to know when people are doing these things.
Even though these are the ones I'm already stopping, watching the amateurs evolve helps you work out how to stop the less idiotic examples. To that end I grab GET, POST and SERVER variables in addition to the normal logging I do.
What I'm wondering is whether to stop chucking out the 403s (although still tracking the incidents that currently cause them).
(You're not tracking me are you?)
[edited by: Status_203 at 1:04 pm (utc) on July 3, 2008]