Forum Moderators: phranque

Message Too Old, No Replies

Mixed static and dynamic content compression

A split approach to achieve high compression with low CPU overhead

         

lammert

5:59 pm on Mar 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Description of the problem

I am serving websites websites from an Apache 2.0 server. These sites consist of a combination of static and dynamically generated content. Because a large percentage from my audience lives in parts of the world where internet speed is slow (dial in connections with 2 to 3 kByte per second max) I use Apache's compression feature to serve small files whenever possible. This is done with the mod_deflate package available in Apache 2.0. Users of Apache 1.3 could use mod_gzip instead.

Adding on-line compression is a nice feature but it has some drawbacks. One is, that mod_deflate compresses the content every time a browser requests it. This takes time and CPU overhead. For dynamic generated content there is no alternative, but static content is another issue. It doesn't feel good to see thousands of requests per day in your access_log file of the same static file where every time the compression algorithm is started to generate the same output.

Furthermore, the mod_deflate module available in Apache 2.0 doesn't generate the optimal compressed file. The results are worse than can be achieved with native gzip compression. I have set the compression level to 1 with the option DeflateCompressionLevel 1, but even when set to 9, the result is worse than regular gzip compression. For the files I tested the difference is somewhere between 10 and 25%. The size difference between compression level 1 and 9 is so small, that I decided to use the fastest compression method, thereby reducing CPU overhead and sending output quicker to clients on broadband connections. They will see an annoying delay when large chunks of data first have to be compressed at a high compression level.

The optimal solution worked out for .js files

I have thought of a solution of the problem mentioned above. The main problem where this occurs is with my external JavaScript files. They are quite large and most of them do not change quite often. However some do change often and I don't want to generate a compressed copy of the file every time I make a change. So ideally it would be a system that can handle both pre-compressed files which were processed with gzip at the maximum compression level, and plain JavaScript files which have no pre-compressed version and are handled by mod_deflate on the fly.

Content negotiation should be taken into account, because not all browsers accept compressed content. Because I have a separate directory for all my .js files, I tested the following code:

<Directory "/var/www/html/scripts">

ExpiresActive On
ExpiresDefault "access plus 2 hours"

Options +MultiViews

AddEncoding x-gzip .gz
AddType application/x-javascript .gz

BrowserMatch ^Mozilla/4 no-gzip
BrowserMatch \bMSIE!no-gzip
Header append Vary User-Agent

AddOutputFilter DEFLATE js

</Directory>

A small explanation:
The Expires directives are used to tell the browser and intermediate proxies that the retrieved content can be cached for the next two hours. This will cause the browser to used the cached version instead of asking the the server whenever a new page from the site is loaded which requires that JavaScript file. During one visitor session, there will normally be no need to retrieve a file multiple times. If however the visitor comes again another day, the setting of 2 hours will make sure he will receive a fresh version of the file.

The +MultiViews option switches automatic file search on whenever a file doesn't exist. For example if the file test.js is required but doesn't exist, Apache will search for all files with the pattern test.* and decide which one to return to the browser.

The AddEncoding and AddType directives are used to tell that all the .gz files in the directory are in fact JavaScript files. If these lines are not present, Apache will return the default application type application/g-zip which is not understood by the browser.

The BrowserMatch directives disable compression for specific old versions of browsers with bugs. The Header append directive tells intermediate caching proxies that the content may change depending on the browser identification string.

The last directive adds the default mod_deflate compression filter for regular js files found.

Process Flow if .js file exists

If the browser requests a .js file which exist, the browser type is checked to see if it is an old version. Furthermore Apache looks if the browser accepts compressed content. If both are OK, the .js file is compressed by mod_deflate and sent to the browser. If compression is not possible, to script file is sent without modifications.

Process Flow if .js file does not exist

The browser request for a non-existent file (for example test.js) causes the MultiViews system to start. This file looks for all files with the pattern test.js.* and checks which file types are accepted by the browser. Either the plain version (test.js.en), or the gzipped version (test.js.en.gz) is returned.

Remaining questions

The system as mentioned above works. I have however some questions and remarks left for optimizing.

  • Every MultiViews search for files matching a specific pattern is en expensive operation. Although there are just a few dozen files in the scripts directory, this might have impact on performance. Does anyone have an idea if the MultiViews solution is more expensive than the gain we get of not having to compress on-line?
  • I would like to use a type-map instead of MultiViews, but this doesn't give me the flexibility of automatic detection of the existence of compressed files. I have to maintain a map file manually. Is there a possibility to combine the flexibility of multi views with the speed of type maps?
  • The BrowserMatch lines appear to be only processed when an actual .js exists. The MultiViews content negotiation seems to bypass this additional check which may cause gzipped JavaScript files to be sent to browsers which can't handle it. For more clarity: Some old versions of Netscape request gzip compression for all files, where they in fact can only handle compression correctly for .html files.
  • I am not happy with the AddType and AddEncoding directives for the .gz files. Although it works, it feels like a trick and it prevents me from serving compressed files of more than one type from any given directory. Alternatives would be appreciated.

jdMorgan

6:12 pm on Mar 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Very nice post!

Having only read it twice, and with little time to think about it deeply, I'd just like to offer one very general idea. You might find some benefit from using mod_rewrite's RewriteCond %{REQUEST_FILENAME} -f and RewriteRules with the [T=] (type) flag to solve some of your file-exists and MIME-type variations. You could in fact replace content-negotiation with a series of 'file-exists' checks, with rewrites to the preferred files/filetypes based on the results. This could be 'centrally-controlled' with one or more RewriteMaps.

However, I have no idea what the performance implications would be.

Jim

lammert

6:22 pm on Mar 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I hadn't thought of using mod_rewrite for setting MIME types. I will take a dive in the manual and see if I can do some benchmark testing.

coopster

11:06 pm on Mar 23, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I agree, well thought out and documented post here.


... the mod_deflate module available in Apache 2.0 doesn't generate the optimal compressed file. The results are worse than can be achieved with native gzip compression.

How did you measure the compression results, lammert? Did you use DeflateFilterNote [httpd.apache.org] to place the compression ratio in a note for logging? The compression algorithm (deflate) used by gzip (also zip and zlib) is likely the same [gzip.org], that is why I wonder which method you were using to note the compression percentage differences. There is a note in the zlib FAQ though that I found interesting, thought it would be a good place to share this:


Ok, so why are there two different formats? [zlib.net]

The gzip format was designed to retain the directory information about a single file, such as the name and last modification date. The zlib format on the other hand was designed for in-memory and communication channel applications, and has a much more compact header and trailer and uses a faster integrity check than gzip.

Apache configure [httpd.apache.org] searches automatically for an installed zlib library if your source configuration requires one (e.g., when mod_deflate is enabled). I was wondering if you had a choice, you know? Guess not. It makes sense though, once you think about what is happening, the page being compressed in memory and all. Guess the contributing developers knew what they were doing, eh? ;-)

MultiViews versus type-maps -- good question. Apache has definitely made it easy to setup Content Negotiation for a lazy programmer. I know of a large site running MultiViews with major traffic. Expensive? Well, if it is, it is certainly quite difficult to notice. I have never tested an installation with both structures setup to see which content is drawn or how Apache would determine priority. But I'm guessing as you are, it's either one way or the other because if it finds a type-map it certainly shouldn't need to build one as a fallback. As I said though, I've never tested the theory. As far as maintaining the type-map manually, if you are adding a new file to the directory, wouldn't you just update the type-map at that time as well?

lammert

1:40 am on Mar 24, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



coopster, Your assumption about using another compression library could be right. The figures of the size difference between gzip and mod_deflate were obtained with the original pre-installed Apache installation on my dedicated host some time ago. At that time I tested different deflate compression levels and was actually quite disappointed that the results for level 9 were not better than level 6.

I recently installed a new Apache version and expected the ratio to be the same. Therefore I didn't rerun the test. Because of your post I just reran the tests I did preciously. Now both gzip and mod_deflate obviously use the same routines. I used the transmitted size in the access_log as the compression size for mod_deflate.

Original: 23944 bytes
level 1 - mod_deflate: 7133 bytes
level 1 - gzip: 7141 bytes
level 9 - mod_deflate: 5866 bytes
level 9 - gzip: 5858 bytes

The remaining difference in length could well be the extra path information as you suggested.

So this cleared one problem, the absolute difference between gzip and mod_deflate compression. I am currently running some benchmarks with different setups to see how compression and MultiViews influence performance. As soon as something reasonable comes out I will tell it here.

coopster

4:18 pm on Mar 24, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Hey, that's great news. And thanks for posting your findings -- that is some very valuable information. It makes sense where you are headed here, and I for one am quite interested in your findings. Thanks for keeping us posted.