Any organisation operating a LAN with more than a dozen or so PCs should
consider using a cache server. Many software packages are available
for different platforms.
For low cost and moderate performance, consider
the Squid cache or
Apache caching server on
a PC running Linux. A system can
probably be implemented for under $1000 using a Pentium or 486 PC with
1Gb disk (no monitor required), or less if a pre-owned PC is used.
These packages will also run on
traditional Unix platforms such as Sun, SGI, Alpha, etc . .
Apache will also funcion as a server, so that it is
possible to have a robust, moderate volume cache/server
combination for a few hundred dollars. Firewall software can
be put on the same machine, or a dedicated firewall used if
required.
For those more comfortable with a
commercial product (where you can blame someone when it fails),
or a GUI interface, consider
the Harvest Cache, or
Netscape Proxy Server. For Win95 or NT, other servers
are available such as WinProxy,WinGate,
Spaghetti,
Novell BorderManager,
etc.
A freeware Java proxy is available called
Wormhole
Note: Proxies are open to abuse. Ensure that only authorized
users can use your proxy, especially for general (non-HTTP) networking.
Certain early versions of WinGate, for instance, shipped with
security features disabled.
Since many cache server functions (If-Modified-Since, Expires)
depend on time, it is important that the server (and the browser)
uses the correct GMT time. See e.g. the Time Page
for how to set your server's clock accurately.
Hierarchical cache agents are not allowed to cache objects which are
authenticated or obtained from a secure server (shttp transport protocol).
This is deliberate to decrease the security risk. Browsers are also
configured by default to not cache secure documents; otherwise, the
documents on disk (such as financial transactions) are vulnerable to
equipment theft or unauthorized access.
It is therefore not a good idea to serve a large number of graphics and
inconsequential documents from a secure server.
It has been pointed out that many browsers may be configured to issue
a warning when moving from secure to insecure pages, so serving
images in a secure page from a regular server is probably unworkable.
Therefore more care should be taken to reduce the total transfer size
from secure servers.
While VRML viewers driven from a Web browser benefit from local and
proxy cacheing in the browser for simple worlds, complex worlds contain
inlines, textures, and in the case of VRML2, sound and animation. The viewer
typically contains its own HTTP client to get these from the net. It
is important for production quality viewers to either have a local
cache and support for proxy, or if used as a plugin to use the host browser
to obtain objects. Some early viewers would fetch a new copy of an inline
for each instance in a world. VRweb, while not
having (version 1.1.2) a local cache, has support for proxy, and thus could
be used with a proxy server such as Apache or Squid.
Most modern browsers have support for Proxy Cache.
Netscape Navigator has a
powerful proxy configuration tool using Javascript, allowing
for multiple proxies and automatic fallback.
Internet Explorer also
has proxy support.
Any Unix agent such as Lynx or Mosaic built with the
CERN libwww library can be configured for simple proxy using
an environment variable, viz.:
setenv http_proxy "http://proxy.some.org:3128/"
You should upgrade to a newer browser (e.g. Netscape 3.0) which
allows unverified objects from the cache to be used. Typically,
the modification date is checked once per session, then the cached
copy is used with no further Net access. Consider growing your local
cache size if operating on a slow modem. If you regularly browse a large
number of graphic-rich sites, then the Netscape default 5Mb is probably
inadequate.
The Web can benefit from servers sharing icons. This may seem a heretical
viewpoint, and many webmasters use scripts to prevent other sites
referencing their images, but consider:
If I develope an icon or Java applet so wonderful that every page
on the planet uses it, and they all link to my page, what happens?
Do I get a billion hits a day? No; because as soon as a modern browser
loads it, it goes into local cache. Every other page that the user
looks at using my icon will use the cached copy. With proxy cache,
my icon will always be in the cache so that new users seeing my icon for
the first time will get a cached copy, not one from my server.
I will, however, get a number of If-Modified-Since requests from
users who use Reload on my icon, or who have not selected
"Verify Document: Never", which in this extreme case may be
sufficient to crash my server. I believe, though, that a powerful
server making public-domain icons availabe as links would be a
globally useful service. It only works if the URLs to the icons
are identical, since the same icon held under the same name but on
different servers is cached as a separate URL.
In any case, virtual servers running on one CPU should have no problem
with sharing icons.
Note: - since URLs are often cached without
transformation (case folding, etc.), it is important that
shared objects are named consistantly. For example, if the machine
123.456.789.001 has addresses www.org1.com, www.org2.com and org2.com,
then the following URLs, although referring to the same object, will
be considered unique for cacheing purposes:
http://123.456.789.001/some.gif
http://www.org1.com/some.gif
http://www.org2.com/some.gif
http://org2.com/some.gif
http://www.ORG1.COM/some.gif
Pages on a virtual server sharing a common pool of icons should therefore
use absolute URLs to refer to the icons, e.g. a page
"http://www.some-org.com/some.html" hosted on node
www.host-org.net should use
"http://www.host-org.net/icons/icon1.gif" rather than
"/common-icons/icon1.gif", where common-icons is a
symbolic link or shortcut to the directory "icons".
Content Negotiation refers to a mechanism whereby the server and browser
can negotiate some characteristics of the document, such as language
or image type. See e.g.
draft-holtman-http-negotiation-03.txt for a recent draft.
Unfortunately, content negotiation does not co-exist happily with
proxy cache. Some effort has been made in the HTTP 1.1 protocol to
improve this. Meanwhile, rather than making large numbers of documents
negotiable, it is probably better to negotiate on e.g. natural language
at the document root and maintain separate trees of non-negotiable
pages in each language. Alternatively,
a negotiated redirect may be used, resulting
in only a small transfer from the origin server.
Since redirected documents (response 302), such as generated by server-side
imagemap, are not cached, using this requires a transfer from the
origin server, even though the actual document may be cached nearby.
Client-side imagemap (USEMAP), which is
understood by recent browsers (including Lynx 2.5), requires only one
transfer and can be fulfilled directly by cache.
Redirected responses (Status 302) are not usually cached by proxy caches
(although Netscape Navigator will use a locally cached page as long
as there is no "." in the directory name. I.e. if
there is a cached document "http://some.org/here/", then
a link to "http://some.org/here" will use it.)
Implicit redirects are often generated by giving a directory name as
a URL, as in the example above. The redirect will go to the origin
server. Appending a tailing "/" to the directory, or giving
the complete file URL, will avoid this. I.e. use
"http://some.org/~John/" or
"http://some.org/~John/index.html"
instead of "http://some.org/~John".
Some CGI scripts parse the USER_AGENT field to determine the
capabilities of a browser (such as whether it is Netscape-compatible).
When the resulting document is cached, it is no longer true that
the server knows what browser is being used. Other techniques are available
such as Javascript functions, NOFRAMES, NOEMBED and NOSCRIPT tags, ALT
tags on images, which permit the document to be understood by a variety
of browsers without the server having to know which browser is used.
Where a page represents "Today's News", and it is rolled over
to "Yesterday's News" at night, consider giving it a unique URL
(perhaps using the Julian date), giving the index page
an expiry date of ""tonight" and updating it nightly
with links from "Today", "Yesterday" to 961003.html, 961002.html etc.
In this fashion someone who read the news yesterday, goes to the index page and clicks
"Yesterday's News", will see the page they have in their local (or site) cache,
instead of getting an identical page with a different name from the origin.
The New BC pages here
were
an excellent example of how not to do it, I'm ashamed to say, but
are now fixed.
Most Web pages are static - a document that was written at some
point in time, and may be updated occasionally. In between updates,
it doesn't change, which makes it cacheable. The URL of the page
completely determines the content that is seen. Some CGI scripts are
used in this fashion; for instance, the
Help Page
at AltaVista, and could usefully be made cacheable.
Some current cache agents assume that CGI scripts are uncacheable. Specifically,
they may check a URL for "cgi-bin" and "?" (query).
Generating a website from a database may make more sense in many cases than
using a traditional file-based webserver, but one should try to use path-based
URLS (such as "http://some.org/database/sales/1996") where
possible instead of query-based URLs
("http://some.org/database?sales+1996" or
"http://some.org/cgi-bin/lookup?sales+1996"). Clearly, where
the URL is generated by an HTML form, this is not possible, but often
many links between pages are static.
Other CGI scripts
generate output that depends on some other factor, for instance, a
script may return an image from a Webcam, or the wind speed from
a weather station. At first sight these documents appear uncacheable, but
in fact may be usefully made cacheable in a controlled way. For instance,
in the case of the weather station, there is no need to give up-to-the-second
data via the Web. A script might generate a cacheable document giving
a Last-Modified time for when the measurement was made, and an Expires
time for when the next measurement will be made. If instant data is critical,
a browser may send GET with Pragma:no-cache (typically Shift-Reload), or
use some other transport protocol.
This Perl script
illustrates the kind of thing that can be done.
Some cache agents (Netscape Navigator, for instance) may make use of the
Content-Length header to characterize a document and to determine whether
a document is completely loaded. If a persistent connection is used, the
only way to signal the end of the object is to shutdown the connection.
It may be useful, therefore, to generate a valid Content-Length header.
Different
browsers and proxy agents apply different heuristics to determine
what documents may be cached, while the HTTP specification lays down
some rules. For instance, documents retrieved using POST or requiring
authorisation cannot be cached. Documents with an expiry date of
"now", an expiry date in the past or an illegal expiry date
should not be cached. The Squid cache, in default configuration,
will not cache documents whose URL contains a query term ("?"),
or the string "cgi-bin". Apache 1.1.1 will not cache
documents without a Last-Modified header. Netscape Navigator caches
most documents except those whose Pragma header forbids it; cacheing
of secure documents is a user option. Accordingly, it may be difficult
to make a CGI script that is universally cacheable, especially if
input from a form is required. If form input is not essential, a URL
of the form www.some.org/cacheable-cgi/locator/search-term-here
may be used.
See also
the discussion on http-wg (active again April 1997).
Cookies
are useful for tracking a user through the system, but do not
work well with proxy cache,
since
Set-Cookie headers should not be cached. Some cache agents may
ignore cookie headers and cache all documents, some may drop
any document with a cookie, and some may be capable of stripping
the header and cacheing the document only.
If you require cookies to track customers, consider using
statistical sampling. Avoid generating a cookie for a page
unless it is strictly necessary (do not use a server
configured to set cookies on all objects). Try to keep the size of
pages which do set a cookie small; one technique may
be to set a cookie using a Redirect script.
Rotate
advertising banners from a pool of cacheable images rather than
generating disposable URLs.
Note that some proxy services may strip cookies from pages in order
to achieve a better hit rate; these organisations may include those
such as academic institutions which pay for Internet bandwidth but
supply their clients (such as students) with a free service.
Some sites (such as Vancouver Webpages...)
make use of CGI scripts to validate form data, often using a two-pass approach (your
submitted information looks like this; is this OK?). This reduces the need for offline
checking of data but is wasteful of bandwidth, especially if the user takes several attempts
to get it right.
Javascript
and Java
provide functionality to check form data on the user's workstation before it is submitted.
Of course, it is still necessary to provide a handler for browsers which do not support these
languages, in addition to checking that the validation is correct (not using an out-of-date
cached script, or a user-modified script designed to bypass certain checks).
Server-side includes can be useful on a server, but usually
these documents do not have a Last-Modified header,
which precludes their responding to an If-Modified-Since request.
Apache
provides a method (the XBitHack) of forcing the server to generate one. In this case,
the "wrapper" document must be "touched" or updated
whenever any of its daughter documents are changed. Server-side includes
also do not usually have a Content-Length header, which may cause problems
with a persistant connection, so they should probably be avoided where
possible (yes, there are a lot here
...).
If-Modified-Since is a powerful concept which ensures
cached pages do not become stale, while conserving bandwidth.
Pages are served with a Last-Modified date http header typically
generated from the file timestamp. A cache stores this date
and time along with the page content. If the browser has
reason to suspect that the page is stale (it has been configured
to always ask, or the user used "Reload", or the Expires
date has passed) it sends an If-Modified-Since header with the request.
If the page has not changed, the server responds with a short status 304
message instead of the complete page, which is usually much quicker.
In order for this mechanism to work, the server must generate valid
Last-Modified headers and the server clock must be reliable. Some
Microsoft web servers reportedly are ill-behaved in this regard, making them
appear slow over long paths or cache hierarchies.
As stated above, the HTTP specification
defines the use for the Expires header. In the absence of an Expires
header, proxy agents guess a probable expiry date for a document based
on available data. For instance, a document that has recently changed
as evidenced by the Last-Modified header may be assumed to be updated
frequently, and flushed from the cache in a short time. Images may be
cached for longer than text. To assert control over the expiry date
of a document, an Expires header may be explicitly generated in
RFC1123 time format, for instance
Expires: Tue, 08 Oct 1996 08:00:00 GMT
On a server such as Apache, this may be generated in a CGI script,
specified in an
as-is document header,
or given in a CERN-style
.meta file.
".asis" files under Apache do not appear to
respond properly to If-Modified-Since. Apache 1.2 offers direct support
for the Expires header through the
mod_expires module.
Regardless of the value of Expires, a cache may expire files sooner
based on site configuration, lack of hits, etc. It should not cache
a file beyond the Expires date.
Netscape Navigator (v3) appears to cache documents with a zero Expires data, but
then issue a GET with If-Modified-Since, thus checking the document
for changes. A document with a Pragma no-cache header is not cached.
Note: In HTTP/1.1 there is a richer set of cache-control headers
than in HTTP/1.0, including controls on maximum age, public and private
cache directives, etc.
Other techniques may use a small uncacheable inline image on pages that must
be metered, indicating that the page has probably been retrieved by a browser
with images enabled, or count redirects if URLs are redirected (redirects are
not cacheable in a public cache). It is not necessary to track every single
request to measure how many people have visited a page. One possibility
for a busy site is to turn on cache-busting for a few thousand hits at random
times during the week, which will enable statistical estimates of the
cached ratio, and thus of the total viewer counts.
There's been some discussion on the Squid mail list recently (August 1997)
in regard to cache operation and copyright violation. The consensus among
cache operators is that automated operation of a Web cache may be considered
like operating a long piece of cable; information is stored in it temporarily
but is not considered "copied", much as multiple versions of TV programs
in transit in cable TV systems are not "copies". One respondant refers to
draft German (and possibly EU) legislation which addresses this issue.
Some people are interested in using proxy cache not for any benefit
in bandwidth conservation, but in order to disguise their ip address
from the webserver, or to avoid some domain restrictions that
may be in place at some gateway. The trojan "Ring Zero" in late 1999
used a distributed mapping technique to discover open Web proxies, possibly for this purpose.
Note that some proxy agents, such as Squid, make the original IP address
available using an HTTP header such as X-Forwarded-For, and do not
strip cookies or otherwise anonymize the original request. Those
requiring anonymous access may use a service such as
ZeroKnowledge, which is designed specifically for this purpose.
The icons should be copied to your server.
The image size may be adjusted on many browsers by changing the width and height
values.
Linking these icons is no longer unconditionally recommended.
However, in keeping with the spirit of Cache Now!, you should
share links to the icons, for instance,
all the virtual servers on one CPU should use the same copy,
and several servers within one organization can share one copy.
This will protect you from having broken icons on your page if this server
is unavailable, either from downtime or through network breaks. There
will be many If-Modified-Since requests which must be fulfilled at small
overhead.
You are welcome to copy the files and make links available.
Files may be mirrored from
ftp://vancouver-webpages.com/CacheNow/.
Please
let me know
if you do this.
There are two icons, a static PNG and an animated GIF. In Netscape
Navigator, animated GIFs may be stopped with the Stop button and
re-started by clicking the Location bar and pressing Return, but some
people don't like them.
Thanks to Grant Bayley for an improved (smaller) animated GIF (afaik
created with licensed software).
For the Cache Now! icons, the script
age.pl was previously used to artificially set the
modification date back, so that some cache flushing algorithms
would assume they are a stable file and cache them for longer.
Currently the Apache ExpiresByType directive is used to give
the icons an expiry time of one month after loading.
In future some HTTP/1.1 cache headers may be included.
Recently (Octover 2000) there have been reports of cache hits being
reported as intrusion attempts. In this cache model, as
implemented by
Akamai,
HTP requests are made to the origin server but are filled
by a cache server located at a closer point on the net.
Some IDS systems report this as suspicious activity.
See for instance this
BCNet Tech Bulleting