Now if I call a page with a few code snippets, I think that the server calls up the page, parses it, generates the output and then serves it up. So, this process is repeated everytime that page is called. This would seem to add significant processor overhead, since the script needs to be loaded into memory anew with each call. Not an issue on a small site, but could become fairly significant on a lerger site with a significant amount of traffic.
Do I understand the process so far?
If so, it would seem more efficient to call a script once that stayed resident in memory until it was exited. Such a script would generate pages on the fly from a central user interface, simply calling itself over and over, but presenting different content on each call.
But, my question is, does this model hold up, or will the server reload the entire script every time the page is called by a browser, thuse creating that same overhead?
I suspect the answer is yes.
So, when it comes to scalability, what is the best approach? Should one keep the code on each page to the absolute minimum for that page to function? A complex dynamic site might have several thousand lines of code. It would seem a waste of resources to reload all of this with each call to the server.
Am I rambling? Is this a real issue, or have I taken myself off on a strange tangent?
To date I have worked on fairly small projects, but I am in the middle of one that promises to dwarf anything I have done to date. And, it may need to be scalable to fairly busy sites.
Can anyone give a understandable explanation of what is going on server side when my scripts are called?
WBF
Your description of what goes one behind the scenes at the server end is correct for CGI and server side processed pages such as ASP or php (AFAIK).
Your approach of 'a script ... that stayed resident in memory' is valid; that is what your web server and your database server do. But don't re-invent the wheel; class libraries for developing these sorts of application servers are available in your favourite language, say Java or C++. For more hints, google for terms such as Java, servlet, JBOSS, J2EE, ...
>>>"...does this model hold up, or will the server reload the entire script every time..."
The webserver needs to call something, but that could be a very simple application which just connects to your running application server. That's the basic concept, but this is hidden from you if you use something like JBOSS, J2EE, etc.
Other bits of advice: Applications for the web differ from traditional database appliactions in two main aspects: (1) number of users on the system at any time is less predictable, (2) the web is stateless (hence the requirement for session objects). Because of this statelessness, what you need to do for each transaction is: [open the database, do all the database transacting as quickly as possible, and then close the database]. i.e. interact with the database in short discreet bursts. Of course, opening and closing database connections are expensive, so connection pooling is used in heavy database applications.
Shawn
Another piece of generic advice:
If you're planning on a "database driven" site that could become huge, and the content of each page is based around generic parameters rather than anything user specific, then you should think about building a caching mechanism.
Very easy to do in any server side scripting language.
(1) You have a "cache" directory with write permissions to your script.
(2) When a page is requested, your script looks at the query string (URL) and derives the name of a cache file from the parameters.
(3) Looks in the cache directory to see if the file is present and recent enough for your liking, and simply serves the contents of the file if so.
(4) If there is no recent cache, then you run the database side of things and build the cache file there and then, and then serve the file as in (3).
An alternative method is to have a background process that generates static pages from database content every few minutes (or seconds, depends how busy your site is).
What we're getting at is if you're planning a CGI based site for scalability, consider your options for reducing the amount of CGI required per hit.
So you can just concentrate on optimizing your application by using all those services provided by the container.
However such things IMHO will only play a role with very high intensity ECommerce-Applications, transactions and so on. Maybe with online-banking and a huge customerbase etc.
For your regular cms-style db-driven websites, the effects are neglectible IMHO. You can achieve a lot more with caching mechanisms etc. as mentioned. With e.g. smarty-template-engine.
Its not terrible difficult to build (or use an existing) database pool in the language of your choice, eliminating the need to open and close the database on every hit.
The suggestions about caching are _VERY_ importand also. I always try to design a site that can handle having no database connection and only using cached pages. Much better then giving an error.
daisho.
Seems a bit weird to me that we use databases and then go write code to do the exact thing the database is meant to do. Why is it that caching is so srtongly recommended?
It's not hard to write a program which stores things in files, and retrieves things from files. And there is a well defined interface to database server (viz SQL). So If that is all there was to database server, we wouldn't be so passionate about which ones we like or which are the best. So what does make a good database server? Well its speed and eficiency, and one of the things which contributes to that is is caching algorithsm.
And I am with bcc1234, that a decent servlet engine will hide other implementation nasties as well, such as maintaining a connection pool.
So why do we try and reinvent the wheel?
As I said, I'm no expert. So I'd really be interested to hear some views and even more interested to hear the results of any benchmarking tests anyone might have done.
Shawn
As an example (since I track page generation time) a page that calls to the DB takes 0.14403903s but the same page once cached takes took 0.00485301s. When you have a lot of traffic it makes a big difference since the more connections to the database the slower it will go.
At first it really did seem strange to me to do such a thing but now I wouldn't go back.
daisho.
The reasoning for caching is that database connections and queries are expensive. If you have a method to detect changes and if there is none you send a cached copy you can handle _much_ more traffic.
I'll repeat myself - a decent app server will provide caching based on a uri. And a good EJB container will cache objects.
So when you have many requests to the same data at the same time, there won't be as many requests passed down to the database.
We all seem to agree on the things that allow applications scale better. I guess that answers the original post.
The methods we use to achieve those goals differ, but that's another matter.
Do you know of any app servers for PHP or even ASP? I do not have experiance with app servers but I've only heard about them in relation to java (J2EE, JRE, JBoss etc) and have never come across one for PHP (though I admit I have never looked).
daisho.
As bcc1234 mentioned an APP Server would do that transparently but I know of no product such as this for PHP.
daisho.
After all, page generation code is almost never a bottleneck. If you want a design that scales well, lay out your database scheme carefully.
After all, page generation code is almost never a bottleneck. If you want a design that scales well, lay out your database scheme carefully.
Laying out your database schema is definatly an asset but that is not the only thing that makes a scalable design. Running a popular website can easily max out a database server with even simple queries. At that point there are 2 choices.
1. Caching
2. Buy more hardware and do replication
Because of the cost and no lost of functionality the caching option is _very_ appealing.
daisho.
Do you know of any app servers for PHP or even ASP? I do not have experiance with app servers but I've only heard about them in relation to java (J2EE, JRE, JBoss etc) and have never come across one for PHP (though I admit I have never looked).
Application server is a servlet engine with additional services. Servlets are java clases. This is a java thing and there are no equivalents in the perl/php world.
Persistant connections save a complicated query can take awhile.
As bcc1234 mentioned an APP Server would do that transparently but I know of no product such as this for PHP.
That's not what I meant. mod_php does offer persistent connections. It basically keeps them open an maps them on a one-to-one basis. With postgresql, for example you would use pg_pconnect instead of pg_connect. So it's somewhat better than opening a new connection every time.
What I was saying, is that EJB servers can cache entity beans. Which means persistent data is kept in ram and not being read over and over from the database. All that introduces a lot of overhead and slows down performance, but increases scalability.
As far as caching of pages by an app server, that's a different thing. It has nothing to do with the database connections. You could as well configure squid in a perverted manner to achieve somewhat similar results.
I don't have much respect for EJB, so I would not advise it for 95% of projects out there, but an application server can make life a lot easier.
Laying out your database schema is definatly an asset but that is not the only thing that makes a scalable design. Running a popular website can easily max out a database server with even simple queries. At that point there are 2 choices.
Here is a practical explanation of what I said about EJB.
Let's say you have a popular message board.
And you have a category "Forums Index" and an object Category and a table category. Your reference number for that caregory is 1.
You have 50 requests and one update to your forum index in two seconds.
Here is how it goes:
1-4 goes on for the first few requests
1) client request
2) your app needs Category object with refnum 1
3) the bean with that category is called (primary key 1)
4) the db row with that category is fetched (primary key 1)
after that the caching logic kicks in
5-8 go for the other few requests made approximaly at the same time
5) client request
6) your app needs Category object with refnum 1
7) the bean with that category is called (primary key 1), no db access
then some client updates the category 9-12
9) client request
10) your app needs Category object with refnum 1
11) the bean with that category is called (primary key 1)
12) the db row with that category is updated (primary key 1)
at the time of the update, other clients still get the object through 5-7
once the threshold drops - it gets back to 1-4 and frees up the memory
Since most of the sites have a few hot spots, this approach takes care of most of the load. If a request comes in for a category with a refnum 135 - it goes through 1-4. If many requests come for 135 - it get chached and goes through 5-7 instead.
(In the meantime I found that for this special case using 'grep' instead of a RDBMS was a much better solution.)
Anyways, I decided instead of just giving a pat answer I decided to give some personal experience instead.
I started my site on a java/JSP engine with mySQL as the back end. This solution worked just fine for a while, but as traffic rose then you could see the site bogging down. This was due to 2 things:
1) increased data in the system and hence bigger result sets being returned in response to queries
2) increased traffic and hence more people trying to get to the result sets
So my first thought was to cache material in memory (in the Resin JSP/Java server layer). This worked in that it improved the speed of certain transactions and once I reduced the loading, the overall transaction performance throughput improved too.
However, I was unprepared for what happened next. Since I was on a shared server the ISP pulled the plug on my site without any notice whatsoever and stated I was using too much of the machine. And there was no appeal, no reinstatement. That was that. They gave me access to my files and it was adios.
So i took out the in-memory caching and moved to another ISP who were happy that i signed up with them, and did not care so much that my transactions were slow, but did warn me against in memory caching because I was honest and up front with them and talked it through.
Anyways the essential part of the solution came when I stepped away from the computer and started to think about my application in detail. What was the essential purpose of my application? What parts needed to be fast? What parts could be slow?
Essentially I broke out my application into sections. Some had to be real time, some could be cached.
Then it was a question of how to cache it. I decided that instead of doing the dynamic query and caching the query results, I was going to go the whole hog and produce static pages from the database, triggered by database updates and a timing mechanism.
At the same time I also decided to use XML documents and convert them using XSL stylesheets. Why? Well an xml document is a generate once, use many type of solution and using the stylesheets, changes over the site were easy to make. Simply generate the XML documents, then apply the relevant XSLs. I also made vast use of CSS as well to allow my now semi static site to be changed in a dynamic fashion if I selected today as a red day and tomorrow as a yellow day.
End of story for now. Traffic has quadrupled, but performance has IMPROVED. My main query pages are dynamic. No data is more than 15 minutes old. And the reality is, that is better than most people expected anyways.
Hope I have not bored anyone, but its one way ahead. And certainly one way of stretching out any investment you may have made in hardware or services for your business.
mod_perl was designed to solve exactly the problem that the original poster described. With plain Perl CGI, the script and any libraries it uses must be compiled each time the script is called. mod_perl OTOH loads and compiles everything once, using the stored result to handle an indefinite number of requests.
And didn't I read somewhere that Amazon.com uses a customized mod_perl configuration? That should allay your scalability concerns.
I'm in the caching camp... I'm also in favour of having complete control over what is cached and when it is refreshed, regenerated, writing your own caching system is the way to ensure that ... and to be honest it isn't that hard.
There are quite a few tools that will cache the compiled scripts (rather than cache the output), I use PHP Accelerator, mainly because its free, but it is also used by Yahoo. If your scripts are small and called repeatedly then output caching may work better, but in my case I was looking to ensure even very unfrequently requested pages where processed as quickly as possible.
1. The Zend engine parses and then compiles in RAM the scripts in question. Unless the file changes the code is not recompiled, and thus the binary code is "cached".
2. This can be improved further by using the Zend Encoder to optimise and then pre-compile the script.
3. From my understanding, the HTTP request is received by Apache (or 'x'), which on requesting a PHP page invokes the module handler for that extension ie the PHP module. PHP either then compiles the script or pulls a pre-compiled binary from ram (this is similar to server side javascript, which is complied into a servlet once and then cached).
4. I don't know when PHP decides to compile and cache and when it doesn't. ie. one line of php in a 100 line html document - is that compiled and cached? P'raps someone else could answer here.
5. Obviously this doesn't preclude a well designed backend - be it a traditional RDBMS database - or a 100,000 user LDAP directory ;)
6. As for general performance considerations:
We always use a reverse proxy cache server, as it will take all the load generated by static content off the webserver (which is probably where the PHP engine is running). So design the minimum content possible as dynamic to allow the proxy to offload it, or be clever about what you allow your proxy to cache via HTTP headers.
We have always had gzip compression switched on Apache which will massively increase the bandwidth available to your users, (try zipping up an html doc and see what ratio you get) this is not free though as your cpu is now busier. You can natively compress in PHP with the output handler - though I have no experience of this.
Some people will use an external compressor, though the issue here is to ensure that HTTP compression occurs before any SSL encryption as hash tables used in both are similar, and you'll get significantly less compression if you compress an SSL stream. Using hardware boxes here can be useful as you can provide a "round robin" type affair to load balance over multiple webservers.
Always ensure you are using a fully HTTP 1.1 compliant web server, and ensure that HTTP Pipelining and HTTP Persistant Connections are enabled. Obviously you need to make sure your reverse proxy can take advantage of these as well.
7. Finally if you're serious about this, checkout the Zend Optimizer as well, as this apparently can increase run-time performance by upto 40%.
asp
P.S. One thing I completely forgot was - make sure you design your pages with xhtml and css. By using external css pages, not only can the browser cache them but so can your reverse proxy. This can have a massive benefit in controlling "page bloat".
Caching Tutorial for Web Authors and Webmasters [mnot.net]
...and particularly this section: Writing Cache-Aware Scripts [mnot.net]
asp
Here is a good article on performance tuning PHP:
[php.weblogs.com...]
It addresses two types of caching, output gzip compression, recommendations on using multiple servers, performance tuning for both apache and php settings, as well as linux settings. Its a nice resource.
One thing I would suggest is to start visualizing how you would parallelize your database. If you can easily separate data amongst different machines, it is more efficient than just replication. And putting that into consideration before you start coding a large project can make your life far easier down the road.