Forum Moderators: DixonJones

Message Too Old, No Replies

Monitoring and managing growth

How to know when the site's getting hammered or needs to expand

         

ianevans

3:35 am on May 3, 2005 (gmt 0)

10+ Year Member



Just wondering what software/utilities/solutions you use to monitor site performance and manage growth.

E.g.

One site I visit got hammered after appearing on CNN (same for getting Slashdotted, etc.) and managed to get extra servers from their provider. What would you use to monitor that situation and notify yourself that that situation was happening?

On the same note, what sort of yardsticks do you use to know when it's time to move from one dedicated server to separate web frontends and database servers? Are here ways of keeping those stats so you can see a trend and act on it?

Here's another example: right now our dedicated server is showing loads of load average: 0.04, 0.11, 0.16

Over the last few days I've been unable to get on the site for a bit (too many mysql connections) and noticed loads of 16, 18, 20. That disappears shortly, but if I wasn't on the site myself, how would I know that was happening?

cgrantski

12:34 pm on May 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you have a decent hosting provider they will be using sophisticated software already for this and will be proactive in calling potential problems to your attention, and/or will do an assessment for you when asked - Compuware Vantage Suite is the one such tool I'm most familiar with but there are quite a few others. If this is a critical issue for you and your hosting provider doesn't have this in place, you might want to consider moving your site(s). So the first thing I'd do is ask them; they're your partner. Even if you host it yourself, your ISP should be able to help you if they're doing their job.

There are also lots of utility-level monitoring tools that hosting providers can use if they don't have the big diagnostic guns like Compuware offers. I wish I could name you some of these but I just deal with the data, not the tools, so I actually don't know.

At a minimum, you can watch your server logs for signs of trouble and you can find patterns that help you predict problems. IIS logs keep track of a great statistic called "time to serve" or something like that. I'm not sure Apache logs have this. It's very valuable. Your stats program should (if it's doing its job) be able to give you a report of average time to serve, hour by hour through the day. It's measured in milliseconds so you can get an idea of how bad it is for the end user. (Note - this is the time it takes for the server to process, and send out a file once it has received a request, which is only a piece of the user's screen-painting lag experience.) My rule of thumb is that you should see no appreciable slowing of the server as the traffic goes up --- in other words, the server should be able to handle anything your normal traffic does. If you start to see any kind of serious slowdown (doubling of server time for example, for an individual hour of the day) then your server situation could be considered marginal.

Remember that those hourly numbers are hourly AVERAGES, and the bad times can be only a few minutes long. That's why a modest increase in an hourly average should be paid attention to - it could be a sign of a huge problem that lasted only 20 minutes, but that was disguised by having the statistics in hourly chunks.

As for CNN coverage and so forth (this includes e-mail drops you might do, or roadblock ads on AOL, or other marketing actions) - these are things you should be planning for. If you know they are going to happen, make arrangements, even if you only have a few hours notice. Then there are "duh" kinds of things, like doing your email drops in small chunks over the day - and make sure they go out a few minutes after the hour so that your hourly server time averages will contain the whole spike rather than diluting the spike's effect over two hours of stats. My rough rule of thumb here is the drop size should be about 100,000 emails per server, at a time - so a 3-server site trying to send out a million-email drop should space it out in three 300,000+ mini-drops, a couple or three hours apart. Don't use this as hard and fast, because a lot depends on the open and clickthrough rate, whether the recipients are in a position to open the mails on receipt, and the capacity of the server farm and its internet connections. I'm sure others here have their own rules of thumb.

If you're going to do roadblock ads or something big on a major portal, plan for it and make serious arrangements in advance. I've seen roadblocks bring 6-server sites to their knees within a matter of minutes.

I hope this helps. It's a complicated operational issue and a slow user experience can be extremely difficult to track down because the relationships aren't at all linear and the causes can be elusive, to say the least.