httpd
3.0. Note that some of the features appear
only in the final version 3.0, not in all the
prereleases.
Authorization
header field is in the request.
httpd
's
current cache design.
httpd
provides a way to specify a set of URL patterns
that should be exclusively cached.
httpd
provides another directive
for specifying a set of URL patterns that should never be cached.
Expires
HTTP header is specified by the remote
server it is always used; cache file is considered to be up-to-date
until this time is reached. Especially, if the document expires immediately or within a very short time (a couple of minutes) it is never even written to the cache file. This saves resources because the same file is very unlikely to be requested again within a minute. Actually, sometimes what appears to be a lifetime of one minute could well be an inaccurate machine clock that is off by one minute, and expiry should be immediate.
Documents with an invalid Expires
header line are never
cached.
Use of Expires
header field is the only correct way to
determine if a document should be cached; strictly speaking documents
without this field should never be cached. However, since in practise
Expires
header is extremely rarely given by the current
HTTP servers, it is necessary to use approximative algorithms to
calculate some kind of expiry date for documents that otherwise don't
have it.
Furthermore, Expires
field is part of only the HTTP
protocol, not other WWW protocols.
CERN httpd
handles this via a
last-modified factor,
or LM factor for short. This factor specifies the fraction of
time since last modification that the file is approximated to remain
valid.
For example, with LM factor 0.1 a file that was changed ten hours ago it be considered to be up-to-date for one hour, and a file that was modified a month ago will expire after three days.
LM factor can be specified differently for URL's matching different URL patterns.
However, since documents without a last-modified field are very often produced by CGI scripts on the fly it is safest to keep this value in zero, or very small. Afterall, most of the script responses should never be cached, but reproduced by the script every time, because the content usually changes.
If a CGI script produces output that is valid for a certain time it
should express this by returning an Expires
header field.
As an example consider a result of a database lookup and the database
is updated every night at 2:30; clearly the same query will return the
same results at least until 2:30, so that time should be specified as
the expiry time.
httpd
it is
possible to configure a cache refresh interval for URL's matching a
given pattern. This will cause the proxy server to check that the
file is still up-to-date if more than the maximum allowed time has
passed since the last check, even if it would still seem to be up to
date according expiry date. As a special case, specifying the refresh interval to be zero every cache access will cause a check to be made from remote server. This is ideal for users who need to have always the absolutely most up-to-date version, but still want faster response times and saves in network costs. This is still cheaper as all the checks are performed using the conditional GET request, which sends the document only if it has changed, and otherwise tells the proxy to use the cache.
httpd
has already a cached version of a
document, albeit an expired one, it issues a conditional GET request
which causes the document to be sent to the proxy only if it has
changed since it was last updated to the cache. If the document has
been changed it will get an expiry date in the normal fashion when it
is written to the cache.
If the document hasn't changed httpd
will recalculate a
new expiry date for it, using the old last-modification date with the
LM factor approximation; that is of course only if no expiry date was
explicitly given in the "Not modified" response.
http
,
ftp
, gopher
) is used as the first path
component, the hostname (with optional port part) as the second
component, and the rest of the URL is taken directly as the rest of
the pathname.
As CERN httpd
was probably the first WWW proxy ever to
provide caching we picked this particular design to make it easy to
debug the caching system, and sometimes go and fix the cache by hand.
This will eventually be replaced by something more efficient. The
current one clearly has some flaws in it, like DNS aliases for
host names cause multiple caching of same documents. This can be
solved by using the IP number string instead of the host name.
httpd
is given a certain amount of diskspace for its
cache. If the specified limit is reached httpd
performs
garbage collection, removing cache files that haven't been accessed
lately or that have expired expired.
If disk space is a critical factor, that is, if it's desirable that
httpd
keeps its cache in the minimum, it always removes
all the expired files during the garbage collection. This is the
default behaviour.
However, this is wasteful. Often files that have expired have not in fact changed, and a simple conditional GET request could be made to verify this and make them up-to-date again ("Not modified" HTTP response contains a new expiry date, or one can be calculated from the data in the cache database).
CERN httpd
has a mode when it lets its cache fill up and
removes only files that haven't been used in a long time (as usual,
this time can be configured according to URL patterns). All expired
files are kept until it is absolutely necessary to sacrifice some of
them to get space for new ones. If there is sufficient amount of disk
space available this situation is never reached, and
httpd
is able to get optimal performance for conditional
GET request.
httpd
to never connect to
remote hosts, that is, to run in cache-only mode. This is useful when
there is no network connection, e.g. on a portable machine that is no
connected to the network for the moment.