Scalability - Perl Server Side CGI Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster & phranque

Message Too Old, No Replies

Scalability

what is the best approach to program design?

1
2
»

willybfriendly

2:17 am on May 6, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I am trying to get my head around the issue scalability (if that is a word). That got my head going in the direction of just what is happening server side when a php script is run. I am not much of a hardware or systems guy, so even though this might be obvious to others it is not to me.

Now if I call a page with a few code snippets, I think that the server calls up the page, parses it, generates the output and then serves it up. So, this process is repeated everytime that page is called. This would seem to add significant processor overhead, since the script needs to be loaded into memory anew with each call. Not an issue on a small site, but could become fairly significant on a lerger site with a significant amount of traffic.

Do I understand the process so far?

If so, it would seem more efficient to call a script once that stayed resident in memory until it was exited. Such a script would generate pages on the fly from a central user interface, simply calling itself over and over, but presenting different content on each call.

But, my question is, does this model hold up, or will the server reload the entire script every time the page is called by a browser, thuse creating that same overhead?

I suspect the answer is yes.

So, when it comes to scalability, what is the best approach? Should one keep the code on each page to the absolute minimum for that page to function? A complex dynamic site might have several thousand lines of code. It would seem a waste of resources to reload all of this with each call to the server.

Am I rambling? Is this a real issue, or have I taken myself off on a strange tangent?

To date I have worked on fairly small projects, but I am in the middle of one that promises to dwarf anything I have done to date. And, it may need to be scalable to fairly busy sites.

Can anyone give a understandable explanation of what is going on server side when my scripts are called?

WBF

ShawnR

4:19 am on May 6, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Wow, that is a monster question!

Your description of what goes one behind the scenes at the server end is correct for CGI and server side processed pages such as ASP or php (AFAIK).

Your approach of 'a script ... that stayed resident in memory' is valid; that is what your web server and your database server do. But don't re-invent the wheel; class libraries for developing these sorts of application servers are available in your favourite language, say Java or C++. For more hints, google for terms such as Java, servlet, JBOSS, J2EE, ...

>>>"...does this model hold up, or will the server reload the entire script every time..."
The webserver needs to call something, but that could be a very simple application which just connects to your running application server. That's the basic concept, but this is hidden from you if you use something like JBOSS, J2EE, etc.

Other bits of advice: Applications for the web differ from traditional database appliactions in two main aspects: (1) number of users on the system at any time is less predictable, (2) the web is stateless (hence the requirement for session objects). Because of this statelessness, what you need to do for each transaction is: [open the database, do all the database transacting as quickly as possible, and then close the database]. i.e. interact with the database in short discreet bursts. Of course, opening and closing database connections are expensive, so connection pooling is used in heavy database applications.

Shawn

dmorison

7:46 am on May 6, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Hi, WBF

Another piece of generic advice:

If you're planning on a "database driven" site that could become huge, and the content of each page is based around generic parameters rather than anything user specific, then you should think about building a caching mechanism.

Very easy to do in any server side scripting language.

(1) You have a "cache" directory with write permissions to your script.

(2) When a page is requested, your script looks at the query string (URL) and derives the name of a cache file from the parameters.

(3) Looks in the cache directory to see if the file is present and recent enough for your liking, and simply serves the contents of the file if so.

(4) If there is no recent cache, then you run the database side of things and build the cache file there and then, and then serve the file as in (3).

An alternative method is to have a background process that generates static pages from database content every few minutes (or seconds, depends how busy your site is).

What we're getting at is if you're planning a CGI based site for scalability, consider your options for reducing the amount of CGI required per hit.

jamesa

8:00 am on May 6, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

And yet another thought...

if the site is a finite, manageable size and the database is not updated very often (relative to the number of page views) then it may make sense to just generate static pages each time the database is updated, rather than building pages on-the-fly.

bcc1234

8:10 am on May 6, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

A decent servlet engine does all those things for you.
It will cache the pages you want to cache based on uri. It will maintain a database connection pool (periodically checking if those connections are still fine). It will keep most requested resources in the memory. It will manage application, session and request states.

So you can just concentrate on optimizing your application by using all those services provided by the container.

ruserious

9:36 am on May 6, 2003 (gmt 0)

10+ Year Member

In the Java World, there is the concept of Application Servers, which kind of do what you described (and a lot more). AFAIK there is no such equivalent in the PHP (or Perl) World. You could of course try to (re-)write such a thing in C++ and such, but that would be like re-inventing the wheel.

However such things IMHO will only play a role with very high intensity ECommerce-Applications, transactions and so on. Maybe with online-banking and a huge customerbase etc.

For your regular cms-style db-driven websites, the effects are neglectible IMHO. You can achieve a lot more with caching mechanisms etc. as mentioned. With e.g. smarty-template-engine.

willybfriendly

3:03 pm on May 6, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Thanks guys. SOunds like I have some serious reading to do. At least I have a direction to head in. I never cease to be amazed by the depth of knowledge on these boards.

WBF

drbrain

3:10 pm on May 6, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Don't forget that your OS is smart, so the file blocks will remain in memory somewhere in the disk or VM system, so you don't have to load the script back into memory (try it with any scripting language from the command prompt, on just about any OS, the first run hits the disk, the remaining ones don't).

Its not terrible difficult to build (or use an existing) database pool in the language of your choice, eliminating the need to open and close the database on every hit.

daisho

6:16 pm on May 6, 2003 (gmt 0)

10+ Year Member

If using PHP Zend and IONCube (as well as others) provide caching programs as extentions to PHP. This will parse and compile the script and leave it in memory. The next request will take the compiled version from memory rather than reading the file on disk and reparsing/compiling. This can save quite a bit of overhead.

The suggestions about caching are _VERY_ importand also. I always try to design a site that can handle having no database connection and only using cached pages. Much better then giving an error.

daisho.

ShawnR

12:11 pm on May 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I'd like to challenge some of the views expressed. Not that I'm an expert, so perhaps all I'll be doing is showing my ignorance:

Seems a bit weird to me that we use databases and then go write code to do the exact thing the database is meant to do. Why is it that caching is so srtongly recommended?

It's not hard to write a program which stores things in files, and retrieves things from files. And there is a well defined interface to database server (viz SQL). So If that is all there was to database server, we wouldn't be so passionate about which ones we like or which are the best. So what does make a good database server? Well its speed and eficiency, and one of the things which contributes to that is is caching algorithsm.

And I am with bcc1234, that a decent servlet engine will hide other implementation nasties as well, such as maintaining a connection pool.

So why do we try and reinvent the wheel?

As I said, I'm no expert. So I'd really be interested to hear some views and even more interested to hear the results of any benchmarking tests anyone might have done.

Shawn

daisho

12:37 pm on May 7, 2003 (gmt 0)

10+ Year Member

The reasoning for caching is that database connections and queries are expensive. If you have a method to detect changes and if there is none you send a cached copy you can handle _much_ more traffic.

As an example (since I track page generation time) a page that calls to the DB takes 0.14403903s but the same page once cached takes took 0.00485301s. When you have a lot of traffic it makes a big difference since the more connections to the database the slower it will go.

At first it really did seem strange to me to do such a thing but now I wouldn't go back.

daisho.

bcc1234

1:00 pm on May 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

The reasoning for caching is that database connections and queries are expensive. If you have a method to detect changes and if there is none you send a cached copy you can handle _much_ more traffic.

I'll repeat myself - a decent app server will provide caching based on a uri. And a good EJB container will cache objects.
So when you have many requests to the same data at the same time, there won't be as many requests passed down to the database.

We all seem to agree on the things that allow applications scale better. I guess that answers the original post.

The methods we use to achieve those goals differ, but that's another matter.

daisho

2:07 pm on May 7, 2003 (gmt 0)

10+ Year Member

bcc1234 I fully agree with you. You can either setup an app server or do caching yourself. 2 different solutions for the same problem. Either way you are reducing the database queries.

Do you know of any app servers for PHP or even ASP? I do not have experiance with app servers but I've only heard about them in relation to java (J2EE, JRE, JBoss etc) and have never come across one for PHP (though I admit I have never looked).

daisho.

ShawnR

3:14 pm on May 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

"...reasoning for caching is that database connections and queries are expensive. ..."

Ok, I get it now. You mean when using server side scripting languages ala cgi or php, where there is no way to keep the database connection. Thanks.

daisho

4:13 pm on May 7, 2003 (gmt 0)

10+ Year Member

Persistant connections save a complicated query can take awhile. Then a loop to process the result to spit out HTML takes more time. If you just cache the HTML you save a lot of time.

As bcc1234 mentioned an APP Server would do that transparently but I know of no product such as this for PHP.

daisho.

Storyteller

4:16 pm on May 7, 2003 (gmt 0)

10+ Year Member

Keeping database connections in PHP is easy (so-called persistent connections feature). Connection pooling is another story, and is probably database driver-dependent, or may be not possible at all. For example, Perl's Apache::DBI can't yet pool connections, which is sad.

After all, page generation code is almost never a bottleneck. If you want a design that scales well, lay out your database scheme carefully.

daisho

4:36 pm on May 7, 2003 (gmt 0)

10+ Year Member

After all, page generation code is almost never a bottleneck. If you want a design that scales well, lay out your database scheme carefully.

Laying out your database schema is definatly an asset but that is not the only thing that makes a scalable design. Running a popular website can easily max out a database server with even simple queries. At that point there are 2 choices.

1. Caching
2. Buy more hardware and do replication

Because of the cost and no lost of functionality the caching option is _very_ appealing.

daisho.

bcc1234

12:53 am on May 8, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Do you know of any app servers for PHP or even ASP? I do not have experiance with app servers but I've only heard about them in relation to java (J2EE, JRE, JBoss etc) and have never come across one for PHP (though I admit I have never looked).

Application server is a servlet engine with additional services. Servlets are java clases. This is a java thing and there are no equivalents in the perl/php world.

Persistant connections save a complicated query can take awhile.
As bcc1234 mentioned an APP Server would do that transparently but I know of no product such as this for PHP.

That's not what I meant. mod_php does offer persistent connections. It basically keeps them open an maps them on a one-to-one basis. With postgresql, for example you would use pg_pconnect instead of pg_connect. So it's somewhat better than opening a new connection every time.

What I was saying, is that EJB servers can cache entity beans. Which means persistent data is kept in ram and not being read over and over from the database. All that introduces a lot of overhead and slows down performance, but increases scalability.

As far as caching of pages by an app server, that's a different thing. It has nothing to do with the database connections. You could as well configure squid in a perverted manner to achieve somewhat similar results.

I don't have much respect for EJB, so I would not advise it for 95% of projects out there, but an application server can make life a lot easier.

daisho

1:12 am on May 8, 2003 (gmt 0)

10+ Year Member

Sorry bcc1234, I do understand what your saying. I was just not clear in my message. I was not saying that an app server is a synonym for persistant connections.

daisho.

bcc1234

1:16 am on May 8, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Laying out your database schema is definatly an asset but that is not the only thing that makes a scalable design. Running a popular website can easily max out a database server with even simple queries. At that point there are 2 choices.

Here is a practical explanation of what I said about EJB.

Let's say you have a popular message board.
And you have a category "Forums Index" and an object Category and a table category. Your reference number for that caregory is 1.

You have 50 requests and one update to your forum index in two seconds.
Here is how it goes:

1-4 goes on for the first few requests

1) client request
2) your app needs Category object with refnum 1
3) the bean with that category is called (primary key 1)
4) the db row with that category is fetched (primary key 1)

after that the caching logic kicks in
5-8 go for the other few requests made approximaly at the same time

5) client request
6) your app needs Category object with refnum 1
7) the bean with that category is called (primary key 1), no db access

then some client updates the category 9-12

9) client request
10) your app needs Category object with refnum 1
11) the bean with that category is called (primary key 1)
12) the db row with that category is updated (primary key 1)

at the time of the update, other clients still get the object through 5-7

once the threshold drops - it gets back to 1-4 and frees up the memory

Since most of the sites have a few hot spots, this approach takes care of most of the load. If a request comes in for a category with a refnum 135 - it goes through 1-4. If many requests come for 135 - it get chached and goes through 5-7 instead.

Fischerlaender

7:01 pm on May 9, 2003 (gmt 0)

10+ Year Member

One example where caching is (nearly) mandatory:
For a very special site I had to use SQL queries which resulted in a fullscan in some cases. Because the table was large any fullscan took about ten seconds. I implemented a caching mechanism as already suggested here and the database stopped moaning ...

(In the meantime I found that for this special case using 'grep' instead of a RDBMS was a much better solution.)

Storyteller

8:10 pm on May 10, 2003 (gmt 0)

10+ Year Member

bcc1234, mod_perl handlers are the perfect equivalent of Java servlets. PHP doesn't have anything like this, true.

As to caching issues, the new MySQL 4.x has transparent query cache. I recently tried it and it works great.

MetropolisRobot

1:01 am on May 13, 2003 (gmt 0)

10+ Year Member

An interesting thread this one and a subject matter that has as many answers as there are people to dream them up.

Anyways, I decided instead of just giving a pat answer I decided to give some personal experience instead.

I started my site on a java/JSP engine with mySQL as the back end. This solution worked just fine for a while, but as traffic rose then you could see the site bogging down. This was due to 2 things:

1) increased data in the system and hence bigger result sets being returned in response to queries

2) increased traffic and hence more people trying to get to the result sets

So my first thought was to cache material in memory (in the Resin JSP/Java server layer). This worked in that it improved the speed of certain transactions and once I reduced the loading, the overall transaction performance throughput improved too.

However, I was unprepared for what happened next. Since I was on a shared server the ISP pulled the plug on my site without any notice whatsoever and stated I was using too much of the machine. And there was no appeal, no reinstatement. That was that. They gave me access to my files and it was adios.

So i took out the in-memory caching and moved to another ISP who were happy that i signed up with them, and did not care so much that my transactions were slow, but did warn me against in memory caching because I was honest and up front with them and talked it through.

Anyways the essential part of the solution came when I stepped away from the computer and started to think about my application in detail. What was the essential purpose of my application? What parts needed to be fast? What parts could be slow?

Essentially I broke out my application into sections. Some had to be real time, some could be cached.

Then it was a question of how to cache it. I decided that instead of doing the dynamic query and caching the query results, I was going to go the whole hog and produce static pages from the database, triggered by database updates and a timing mechanism.

At the same time I also decided to use XML documents and convert them using XSL stylesheets. Why? Well an xml document is a generate once, use many type of solution and using the stylesheets, changes over the site were easy to make. Simply generate the XML documents, then apply the relevant XSLs. I also made vast use of CSS as well to allow my now semi static site to be changed in a dynamic fashion if I selected today as a red day and tomorrow as a yellow day.

End of story for now. Traffic has quadrupled, but performance has IMPROVED. My main query pages are dynamic. No data is more than 15 minutes old. And the reality is, that is better than most people expected anyways.

Hope I have not bored anyone, but its one way ahead. And certainly one way of stretching out any investment you may have made in hardware or services for your business.

Islander

2:43 am on May 13, 2003 (gmt 0)

10+ Year Member

I'd like to expand on what Storyteller said about mod_perl:

mod_perl was designed to solve exactly the problem that the original poster described. With plain Perl CGI, the script and any libraries it uses must be compiled each time the script is called. mod_perl OTOH loads and compiles everything once, using the stored result to handle an indefinite number of requests.

And didn't I read somewhere that Amazon.com uses a customized mod_perl configuration? That should allay your scalability concerns.

gethan

8:26 am on May 13, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

And yahoo are moving to php... [developers.slashdot.org]

I'm in the caching camp... I'm also in favour of having complete control over what is cached and when it is refreshed, regenerated, writing your own caching system is the way to ensure that ... and to be honest it isn't that hard.

RichD

11:08 am on May 13, 2003 (gmt 0)

10+ Year Member

After spending a lot of time optimising database queries and indexes to get the best performance (using EXPLAIN SELECT .. FROM .. WHERE ..), I found the next bottle neck was the time to parse and compile the php scripts at each request.

There are quite a few tools that will cache the compiled scripts (rather than cache the output), I use PHP Accelerator, mainly because its free, but it is also used by Yahoo. If your scripts are small and called repeatedly then output caching may work better, but in my case I was looking to ensure even very unfrequently requested pages where processed as quickly as possible.

aspr1n

12:11 pm on May 13, 2003 (gmt 0)

10+ Year Member

I'm not quite sure where the Java EJB stuff came in as I think the questioner mentioned PHP, I don't think he mentioned db caching either - so here's my take...

1. The Zend engine parses and then compiles in RAM the scripts in question. Unless the file changes the code is not recompiled, and thus the binary code is "cached".

2. This can be improved further by using the Zend Encoder to optimise and then pre-compile the script.

3. From my understanding, the HTTP request is received by Apache (or 'x'), which on requesting a PHP page invokes the module handler for that extension ie the PHP module. PHP either then compiles the script or pulls a pre-compiled binary from ram (this is similar to server side javascript, which is complied into a servlet once and then cached).

4. I don't know when PHP decides to compile and cache and when it doesn't. ie. one line of php in a 100 line html document - is that compiled and cached? P'raps someone else could answer here.

5. Obviously this doesn't preclude a well designed backend - be it a traditional RDBMS database - or a 100,000 user LDAP directory ;)

6. As for general performance considerations:

We always use a reverse proxy cache server, as it will take all the load generated by static content off the webserver (which is probably where the PHP engine is running). So design the minimum content possible as dynamic to allow the proxy to offload it, or be clever about what you allow your proxy to cache via HTTP headers.

We have always had gzip compression switched on Apache which will massively increase the bandwidth available to your users, (try zipping up an html doc and see what ratio you get) this is not free though as your cpu is now busier. You can natively compress in PHP with the output handler - though I have no experience of this.

Some people will use an external compressor, though the issue here is to ensure that HTTP compression occurs before any SSL encryption as hash tables used in both are similar, and you'll get significantly less compression if you compress an SSL stream. Using hardware boxes here can be useful as you can provide a "round robin" type affair to load balance over multiple webservers.

Always ensure you are using a fully HTTP 1.1 compliant web server, and ensure that HTTP Pipelining and HTTP Persistant Connections are enabled. Obviously you need to make sure your reverse proxy can take advantage of these as well.

7. Finally if you're serious about this, checkout the Zend Optimizer as well, as this apparently can increase run-time performance by upto 40%.

asp

P.S. One thing I completely forgot was - make sure you design your pages with xhtml and css. By using external css pages, not only can the browser cache them but so can your reverse proxy. This can have a massive benefit in controlling "page bloat".

aspr1n

6:51 pm on May 13, 2003 (gmt 0)

10+ Year Member

just wanted to add these in for ref:

Caching Tutorial for Web Authors and Webmasters [mnot.net]

...and particularly this section: Writing Cache-Aware Scripts [mnot.net]

asp

ggrot

4:40 am on May 14, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I've actually had to face the scalability problem with a frequent update (hard to cache) site. There are about 10-20 views of a page for each time it gets updated, so caching output is questionable in value at best.

Here is a good article on performance tuning PHP:
[php.weblogs.com...]

It addresses two types of caching, output gzip compression, recommendations on using multiple servers, performance tuning for both apache and php settings, as well as linux settings. Its a nice resource.

One thing I would suggest is to start visualizing how you would parallelize your database. If you can easily separate data amongst different machines, it is more efficient than just replication. And putting that into consideration before you start coding a large project can make your life far easier down the road.

MetropolisRobot

10:42 pm on May 14, 2003 (gmt 0)

10+ Year Member

Sorry if I got a bit off topic with the Java/JSP story. I also use PHP in my site and use the same techniques with the XML and XSL to generate static pages.

Keep all the comments coming in this thread: it makes for a very good read and also is VERY useful.

This 38 message thread spans 2 pages: 38

1
2
»