HTTP Proxy Cache

These are just a few notes from a days tinkering with Proxy Cache in Apache 1.1.1., and Squid 1.0.5

What is Proxy Cache?
There are three reasons for using a proxy server:

Because you are behind a firewall (for security) and you have to.
Because using a cache speeds up Web browsing significantly, for you and everyone else.
Because you don't have enough 'real' IP addresses for your machines.

If you are behind a firewall you are probably using one already. If not, you might consider installing one.

Netscape Navigator and other newer browsers have cacheing built in. On a single-user system, such as a PC on a phone line, this may be adequate. You can tune the cache parameters to make the cache larger, or check entries more often. Incidentally, in Netscape Reload does not always get a fresh copy of a document; it sends GET If-Modified-Since with Pragma: no-cache. Shift-Reload (holding down Shift while clicking Reload) will force all frames to be reloaded from source by sending Pragma: no-cache. For information on your disk cache in Netscape type about:cache, or about:memory-cache, about:image-cache for information about the RAM and image caches. For information about a document, see about:document.

Even on a single system, if you use more than one browser or have more than one user, a proxy cache may help since cached documents can be shared among all agents.

LAN systems

The real benefits accrue from using a proxy cache on a LAN with many users. Any new page accessed by anyone is stored in the cache. The next person to access that page gets the cached copy, at full LAN speed, rather than going to the source. This may be a thousand times faster, or more.

Systems for Windows:

(List from John R Buchan)

Configuring a browser for proxy

Some browsers may accept only one value, for instance (on Unix) by using setenv http_proxy http://somewhere.org:80/. Certain domains may be excluded, typically ones own domain, by using setenv no_proxy some.org,some.other.org. Other browsers, such as Netscape, have a more sophisticated scheme for supporting multiple proxies. Netscape has a scheme for automated proxy handling using Javascript. Mosaic-2.7 also allows a list of proxies.

Bypassing cache

If an http request has the Pragma: no-cache header set, then the cache is directed to get a new copy. It may, however, save the new copy itself. Using Reload on Netscape, Mosaic and Lynx (possibly all browsers) sends a request with this header.

Cacheable and uncacheable documents

Regular HTML files are usually cacheable. Cacheing agents may require a valid Last-Modified header , and may not cache objects greater than a certain size or subject to other restrictions. HTML documents generated by CGI scripts can be made cacheable or not by generating an Expires header, though some agents may not cache URLs with "cgi-bin" or a query string. Documents requiring authorisation should not normally be cached. Netscape has an option to cache documents obtained from an SSL (Secure) server locally. If you turn this on, someone who gains access to your computer (perhaps by stealing it) can read all your recent secure transactions.

Note that different cache servers may interpret the http specification in slightly different ways, so that a document cached by one may not be cached by another.

Cache Control and CGI

I obtained the following results with Apache 1.1.1 and Squid 1.0.5, using the cache test script:

Expires	Last-Modified	Apache	Squid
Tonight	Last Night	Cached	Cached
+1 minute	Last Night	Expires	Expires
Tonight	none	Not Cached	Cached
none	none	Not Cached	Cached
0	Last Night	Not Cached	Not Cached
Last Night	Last Night	Not Cached	Not Cached
Tonight	Tonight	Cached	Cached
Tonight	0	Not Cached	Cached

The default configuration of the Squid cache is not to cache URLs with "cgi-bin" or "?"; this has been commented out to obtain these results.

RFC1945 (the HTTP1.0 spec.) says that if the Expires date is equal to or earlier than the value of the Date header, the recipient must not cache the document. A value of zero (0) or an invalid date format should be considered equivalent to an "expires immediately."

Suggested use in CGI scripts: If the output of the CGI script is really a static document, and is the same for the same query string (e.g. /cgi-bin/search?query=food), generate an appropriate Last-Modified date in RFC1123 format . If the output is considered valid for a particular length of time, generate the appropriate Expires header. If the output is immediately invalid, or depends on other data, generate an Expires header equal to the current time, or an illegal (0) value.

In order to make better use of the global bandwidth, it is probably a good idea to make as many things cacheable as possible. This means, for instance, that if you have a Webcam showing the view from an office window, essentially looking at the weather, you might generate an Expires header 10 minutes or more in the future.
See this script (log-tail.pl) for an example.

Server-Side includes

Apache, NCSA, and some other servers allow server-side-includes in HTML (.shtml) files. Since the contents of the document is composed of several included files, the server does not normally set a Last-Modified date or Content-Length. Accordingly, such documents are uncacheable (and will therefore load much slower for someone around the world who is using cache. Apache supports an option known as XBitHack which allows a Last-Modified date to be sent. If you use this, you must touch the .shtml wrapper file any time the included files are changed, else people using cache will not see your new document unless they explicitly Reload.

Content-Negotiation

If content negotiation is being used, to serve different languages or image types, then there is only one URL with possibly different contents. Accordingly, the later Apache servers do not set Last-Modified for content-negotiated documents. This problem is addressed in the draft HTTP 1.1 specification using the Vary header.

Expiring documents

Proxy caches look at the Expires header and use it to set an expiry date in the cache. If one does not exist, a default lifetime is assumed. At this time I am unaware of proxies examining HTML content for META tags; however CERN and new Apache servers may use a metadata file scheme to generate extra fields such as Expires in the document head. CGI scripts may generate appropriate fields explicitly using e.g. the LWP Perl library. An invalid Expires header, such as a value of "0", makes the document uncacheable.

Who is using Proxies

Although RFC1945 recommends not modifying the User-Agent field, certain proxies do and they can be counted. About 8% of hits here use one of the following proxies:

CERN-HTTPD
Harvest Cache
Squid Cache

All users behind firewalls must use proxy, though cache is strictly a separate issue. SOCKS is a non-caching proxy scheme.

NLANR Cache Project

Check out the Distributed Cache project at NLANR!

The Future

HTTP 1.1 (in draft at ds.internic.net) has many more schemes for proxy servers. We will probably a net of interacting proxy caches which will vastly speed up the entire Web. Anyone who has faster Net access than a 14.4 modem will benefit.

Another development is the pre-fetching proxy cache, for instance Wcol: WWW Collector. Here, the proxy actively seeks out related images and pages ahead of time. Originally written to speed up Mosaic (which would not display anything until all images had been fetched), it can co-operate with a hierarchical cache scheme to get related pages while the user reads the first one.

Testing Proxy Cache

It's not obvious how the cache is working, unless you have access to the proxy server (Apache 1.1.1 with -DEXPLAIN, for instance). You can experiment with the Netscape cache using the Cache Tester here.

Vancouver Webpages