Under construction...

Caching Algorithm of CERN httpd

This section describes the caching algorithm and cache design used in CERN httpd 3.0. Note that some of the features appear only in the final version 3.0, not in all the prereleases.

Which Documents Are Subject to Being Cached?

The following conditions must be met before a document can be cached:

The request method must be GET; for other methods caching does not make sense or is not particularly worth the effort (caching PUT or POST would be pathological behaviour).
The request is not a query, i.e. there is no ?keywords part in the request URL. This is because queries produce usually dynamic results, e.g. database queries, and currently these results rarely contain expiry information although they should. Futhermore, the same keyword is often very unlikely to be requested in the near future; imagine an ISMAP query as an example.
Document is not protected, i.e. no Authorization header field is in the request.
The document doesn't expire immediately or have a very short lifetime.
The requested resource is served by an HTTP, FTP or Gopher server;
- Caching News would be rarely desirable since the NNTP server is usually somewhere very close.
- WAIS consists mainly of database queries so it is not cached for the same reasons that query URL's are not. Additionally, WAIS URL's are so exotic that caching them would cause problems in httpd's current cache design.

Caching Only Certain Documents/Sites

For cases when only a few sites or document trees should be cached httpd provides a way to specify a set of URL patterns that should be exclusively cached.

Preventing Caching of Certain URL's

In the opposite case, when it's desirable to cache most of the Web, except for a few URLs, httpd provides another directive for specifying a set of URL patterns that should never be cached.

Expires HTTP Response Header

When the Expires HTTP header is specified by the remote server it is always used; cache file is considered to be up-to-date until this time is reached.

Especially, if the document expires immediately or within a very short time (a couple of minutes) it is never even written to the cache file. This saves resources because the same file is very unlikely to be requested again within a minute. Actually, sometimes what appears to be a lifetime of one minute could well be an inaccurate machine clock that is off by one minute, and expiry should be immediate.

Documents with an invalid Expires header line are never cached.

Use of Expires header field is the only correct way to determine if a document should be cached; strictly speaking documents without this field should never be cached. However, since in practise Expires header is extremely rarely given by the current HTTP servers, it is necessary to use approximative algorithms to calculate some kind of expiry date for documents that otherwise don't have it.

Furthermore, Expires field is part of only the HTTP protocol, not other WWW protocols.

Last-Modified HTTP Response Header

Most of the current HTTP servers give the time of last modification for files that they return from their local filesystem. It can be used to approximate the expiry time; files that have been recently changed will probably soon change again; on the other hand files that have remained the same for a long time are unlikely to suddenly change overnight.

CERN httpd handles this via a last-modified factor, or LM factor for short. This factor specifies the fraction of time since last modification that the file is approximated to remain valid.

For example, with LM factor 0.1 a file that was changed ten hours ago it be considered to be up-to-date for one hour, and a file that was modified a month ago will expire after three days.

LM factor can be specified differently for URL's matching different URL patterns.

Default Expiry Time

When no expiry date nor the time of last modification is given by the remote server, a general default value is used; again, this default value is configurable by URL patterns.

However, since documents without a last-modified field are very often produced by CGI scripts on the fly it is safest to keep this value in zero, or very small. Afterall, most of the script responses should never be cached, but reproduced by the script every time, because the content usually changes.

If a CGI script produces output that is valid for a certain time it should express this by returning an Expires header field. As an example consider a result of a database lookup and the database is updated every night at 2:30; clearly the same query will return the same results at least until 2:30, so that time should be specified as the expiry time.

Guaranteed Cache Refresh Interval

Sometimes it is vital to have always up-to-date information from a certain site, regardless of expiry times specified by the remote server or calculated by the proxy. In CERN httpd it is possible to configure a cache refresh interval for URL's matching a given pattern. This will cause the proxy server to check that the file is still up-to-date if more than the maximum allowed time has passed since the last check, even if it would still seem to be up to date according expiry date.

As a special case, specifying the refresh interval to be zero every cache access will cause a check to be made from remote server. This is ideal for users who need to have always the absolutely most up-to-date version, but still want faster response times and saves in network costs. This is still cheaper as all the checks are performed using the conditional GET request, which sends the document only if it has changed, and otherwise tells the proxy to use the cache.

Conditional GET Request

Every time httpd has already a cached version of a document, albeit an expired one, it issues a conditional GET request which causes the document to be sent to the proxy only if it has changed since it was last updated to the cache. If the document has been changed it will get an expiry date in the normal fashion when it is written to the cache.

If the document hasn't changed httpd will recalculate a new expiry date for it, using the old last-modification date with the LM factor approximation; that is of course only if no expiry date was explicitly given in the "Not modified" response.

CERN httpd's Physical Cache Design

Currently, the cache file name is constructed from the URL in a very staigth-forward way: the access protocol (http, ftp, gopher) is used as the first path component, the hostname (with optional port part) as the second component, and the rest of the URL is taken directly as the rest of the pathname.

As CERN httpd was probably the first WWW proxy ever to provide caching we picked this particular design to make it easy to debug the caching system, and sometimes go and fix the cache by hand. This will eventually be replaced by something more efficient. The current one clearly has some flaws in it, like DNS aliases for host names cause multiple caching of same documents. This can be solved by using the IP number string instead of the host name.

Using the Maximum Cache Capacity

CERN httpd is given a certain amount of diskspace for its cache. If the specified limit is reached httpd performs garbage collection, removing cache files that haven't been accessed lately or that have expired expired.

If disk space is a critical factor, that is, if it's desirable that httpd keeps its cache in the minimum, it always removes all the expired files during the garbage collection. This is the default behaviour.

However, this is wasteful. Often files that have expired have not in fact changed, and a simple conditional GET request could be made to verify this and make them up-to-date again ("Not modified" HTTP response contains a new expiry date, or one can be calculated from the data in the cache database).

CERN httpd has a mode when it lets its cache fill up and removes only files that haven't been used in a long time (as usual, this time can be configured according to URL patterns). All expired files are kept until it is absolutely necessary to sacrifice some of them to get space for new ones. If there is sufficient amount of disk space available this situation is never reached, and httpd is able to get optimal performance for conditional GET request.

Standalone Cache Mode

It is possible to configure httpd to never connect to remote hosts, that is, to run in cache-only mode. This is useful when there is no network connection, e.g. on a portable machine that is no connected to the network for the moment.

Ari Luotonen - Kevin Altis