See also an article about Proxy Cache
HTTP Proxy Cache
These are just a few notes from a days tinkering with Proxy Cache in Apache 1.1.1., and
Squid 1.0.5
What is Proxy Cache?
There are three reasons for using a proxy server:
- Because you are behind a firewall (for security) and you have to.
- Because using a cache speeds up Web browsing significantly, for you
and everyone else.
- Because you don't have enough 'real' IP addresses for your machines.
If you are behind a firewall you are probably using one already. If not,
you might consider installing one.
Netscape Navigator and other newer
browsers have cacheing built in. On a single-user system, such as a PC on a phone line,
this may be adequate. You can tune the cache parameters to make the cache larger,
or check entries more often. Incidentally, in Netscape Reload does not always
get a fresh copy of a document; it sends GET If-Modified-Since with Pragma: no-cache. Shift-Reload (holding down Shift while clicking Reload)
will force all frames to be reloaded from source by sending Pragma: no-cache. For
information on your disk cache in Netscape type about:cache,
or about:memory-cache, about:image-cache
for information about the RAM and image caches.
For information about a document, see about:document.
Even on a single system, if you use more than one browser or have more than one
user, a proxy cache may help since cached documents can be shared among all agents.
LAN systems
The real benefits accrue from using a proxy cache on a LAN with many users. Any new page
accessed by anyone is stored in the cache. The next person to access that page gets
the cached copy, at full LAN speed, rather than going to the source. This may be
a thousand times faster, or more.
Systems for Windows:
(List from John R Buchan)
Configuring a browser for proxy
Some browsers may accept only one value, for instance (on Unix) by using
setenv http_proxy http://somewhere.org:80/. Certain domains may be excluded,
typically ones own domain, by using setenv no_proxy some.org,some.other.org.
Other browsers, such as Netscape, have a more sophisticated scheme for supporting
multiple proxies. Netscape has a
scheme for automated proxy handling using Javascript. Mosaic-2.7 also allows a list
of proxies.
Bypassing cache
If an http request has the Pragma: no-cache header set, then the cache is
directed to get a new copy. It may, however, save the new copy itself. Using
Reload on Netscape, Mosaic and Lynx (possibly all browsers) sends a request
with this header.
Cacheable and uncacheable documents
Regular HTML files are usually cacheable. Cacheing agents may require a valid
Last-Modified header , and may not cache objects greater than a certain size
or subject to other restrictions.
HTML documents generated by CGI
scripts can be made cacheable or not by generating an Expires header, though
some agents may not cache URLs with "cgi-bin" or a query string.
Documents requiring
authorisation should not normally be cached. Netscape has an option
to cache documents obtained from an SSL (Secure) server locally. If
you turn this on, someone who gains access to your computer (perhaps by
stealing it) can read all your recent secure transactions.
Note that different cache servers may interpret the http specification
in slightly different ways, so that a document cached by one may
not be cached by another.
Cache Control and CGI
I obtained the following results with Apache 1.1.1
and Squid 1.0.5, using the
cache test script:
Expires | Last-Modified | Apache | Squid |
Tonight | Last Night | Cached | Cached |
+1 minute | Last Night | Expires | Expires |
Tonight | none | Not Cached | Cached |
none | none | Not Cached | Cached |
0 | Last Night | Not Cached | Not Cached |
Last Night | Last Night | Not Cached | Not Cached |
Tonight | Tonight | Cached | Cached |
Tonight | 0 | Not Cached | Cached |
The default configuration of the Squid cache is not to cache URLs with
"cgi-bin" or "?"; this has been commented out to obtain
these results.
RFC1945 (the HTTP1.0 spec.) says that if the Expires date is equal to or earlier than
the value of the Date header, the recipient must not cache the document.
A value of zero (0) or an invalid date format should be considered equivalent to
an "expires immediately."
Suggested use in CGI scripts: If the output of the CGI script
is really a static document, and is the same for the same query string (e.g.
/cgi-bin/search?query=food), generate an appropriate
Last-Modified date in RFC1123 format . If the output is considered valid for a particular
length of time, generate the appropriate Expires header. If the output
is immediately invalid, or depends on other data, generate an Expires header
equal to the current time, or an illegal (0) value.
In order to make better use of the global bandwidth, it is probably a good
idea to make as many things cacheable as possible. This means, for instance,
that if you have a Webcam showing the view from an office window,
essentially looking at the weather, you might generate an Expires header
10 minutes or more in the future.
See this script (log-tail.pl) for an example.
Server-Side includes
Apache, NCSA, and some other servers allow
server-side-includes in HTML (.shtml) files. Since the contents of the
document is composed of several included files, the server does not normally
set a Last-Modified date or Content-Length. Accordingly, such documents
are uncacheable (and will therefore load much slower for someone around the world
who is using cache. Apache supports an option known as XBitHack which allows
a Last-Modified date to be sent. If you use this, you must touch the
.shtml wrapper file any time the included files are changed, else people using
cache will not see your new document unless they explicitly Reload.
Content-Negotiation
If content negotiation is being used, to serve
different languages or
image types, then there is only one URL with possibly
different contents. Accordingly, the later Apache servers do not set Last-Modified
for content-negotiated documents. This problem is addressed in the draft HTTP 1.1
specification using the Vary header.
Expiring documents
Proxy caches look at the Expires header and use it to set an expiry date in
the cache. If one does not exist, a default lifetime is assumed. At this
time I am unaware of proxies examining HTML content for META tags; however
CERN and new Apache servers may use a metadata file scheme to generate
extra fields such as Expires in the document head. CGI scripts may generate
appropriate fields explicitly using e.g. the LWP Perl library. An invalid Expires header, such as a value of
"0", makes the document uncacheable.
Who is using Proxies
Although RFC1945 recommends not modifying the User-Agent field, certain proxies do
and they can be counted. About 8% of
hits here use one of the following
proxies:
- CERN-HTTPD
- Harvest Cache
- Squid Cache
All users behind firewalls must use proxy, though cache is strictly
a separate issue. SOCKS is a non-caching proxy scheme.
NLANR Cache Project
Check out the Distributed Cache
project at NLANR!
The Future
HTTP 1.1 (in draft at ds.internic.net)
has many more schemes for proxy servers. We will probably a net of
interacting proxy caches which will vastly speed up the entire Web.
Anyone who has faster Net access than a 14.4 modem will benefit.
Another development is the pre-fetching proxy cache, for instance
Wcol: WWW Collector.
Here, the proxy actively seeks out related images and pages ahead of time.
Originally written to speed up Mosaic (which would not display anything
until all images had been fetched), it can co-operate with a hierarchical
cache scheme to get related pages while the user reads the first one.
Testing Proxy Cache
It's not obvious how the cache is working, unless you have access to
the proxy server (Apache 1.1.1 with -DEXPLAIN, for instance). You can
experiment with the Netscape cache using the
Cache Tester here.
Vancouver Webpages