the Cache Now! campaign - details

Traduisez - Übersetzen - Traduzca - Traduza - Tradurre - Translate

LAN cache servers

Any organisation operating a LAN with more than a dozen or so PCs should consider using a cache server. Many software packages are available for different platforms.
For low cost and moderate performance, consider the Squid cache or Apache caching server on a PC running Linux. A system can probably be implemented for under $1000 using a Pentium or 486 PC with 1Gb disk (no monitor required), or less if a pre-owned PC is used. These packages will also run on traditional Unix platforms such as Sun, SGI, Alpha, etc . . Apache will also funcion as a server, so that it is possible to have a robust, moderate volume cache/server combination for a few hundred dollars. Firewall software can be put on the same machine, or a dedicated firewall used if required.
For those more comfortable with a commercial product (where you can blame someone when it fails), or a GUI interface, consider the Harvest Cache, or Netscape Proxy Server. For Win95 or NT, other servers are available such as WinProxy,WinGate, Spaghetti, Novell BorderManager, etc.
A freeware Java proxy is available called Wormhole

Note: Proxies are open to abuse. Ensure that only authorized users can use your proxy, especially for general (non-HTTP) networking. Certain early versions of WinGate, for instance, shipped with security features disabled.

Correct Time

Since many cache server functions (If-Modified-Since, Expires) depend on time, it is important that the server (and the browser) uses the correct GMT time. See e.g. the Time Page for how to set your server's clock accurately.

Secure Servers and Cache

Hierarchical cache agents are not allowed to cache objects which are authenticated or obtained from a secure server (shttp transport protocol). This is deliberate to decrease the security risk. Browsers are also configured by default to not cache secure documents; otherwise, the documents on disk (such as financial transactions) are vulnerable to equipment theft or unauthorized access.

It is therefore not a good idea to serve a large number of graphics and inconsequential documents from a secure server.

It has been pointed out that many browsers may be configured to issue a warning when moving from secure to insecure pages, so serving images in a secure page from a regular server is probably unworkable. Therefore more care should be taken to reduce the total transfer size from secure servers.

VRML viewer Cache and Proxy support

While VRML viewers driven from a Web browser benefit from local and proxy cacheing in the browser for simple worlds, complex worlds contain inlines, textures, and in the case of VRML2, sound and animation. The viewer typically contains its own HTTP client to get these from the net. It is important for production quality viewers to either have a local cache and support for proxy, or if used as a plugin to use the host browser to obtain objects. Some early viewers would fetch a new copy of an inline for each instance in a world.
VRweb, while not having (version 1.1.2) a local cache, has support for proxy, and thus could be used with a proxy server such as Apache or Squid.

Browser support for Proxy Cache

Most modern browsers have support for Proxy Cache. Netscape Navigator has a powerful proxy configuration tool using Javascript, allowing for multiple proxies and automatic fallback. Internet Explorer also has proxy support. Any Unix agent such as Lynx or Mosaic built with the CERN libwww library can be configured for simple proxy using an environment variable, viz.:
setenv http_proxy "http://proxy.some.org:3128/"

You should upgrade to a newer browser (e.g. Netscape 3.0) which allows unverified objects from the cache to be used. Typically, the modification date is checked once per session, then the cached copy is used with no further Net access. Consider growing your local cache size if operating on a slow modem. If you regularly browse a large number of graphic-rich sites, then the Netscape default 5Mb is probably inadequate.

Sharing Links

The Web can benefit from servers sharing icons. This may seem a heretical viewpoint, and many webmasters use scripts to prevent other sites referencing their images, but consider:
If I develope an icon or Java applet so wonderful that every page on the planet uses it, and they all link to my page, what happens? Do I get a billion hits a day? No; because as soon as a modern browser loads it, it goes into local cache. Every other page that the user looks at using my icon will use the cached copy. With proxy cache, my icon will always be in the cache so that new users seeing my icon for the first time will get a cached copy, not one from my server. I will, however, get a number of If-Modified-Since requests from users who use Reload on my icon, or who have not selected "Verify Document: Never", which in this extreme case may be sufficient to crash my server. I believe, though, that a powerful server making public-domain icons availabe as links would be a globally useful service. It only works if the URLs to the icons are identical, since the same icon held under the same name but on different servers is cached as a separate URL.

In any case, virtual servers running on one CPU should have no problem with sharing icons.

Note: - since URLs are often cached without transformation (case folding, etc.), it is important that shared objects are named consistantly. For example, if the machine 123.456.789.001 has addresses www.org1.com, www.org2.com and org2.com, then the following URLs, although referring to the same object, will be considered unique for cacheing purposes:

Pages on a virtual server sharing a common pool of icons should therefore use absolute URLs to refer to the icons, e.g. a page "http://www.some-org.com/some.html" hosted on node www.host-org.net should use "http://www.host-org.net/icons/icon1.gif" rather than "/common-icons/icon1.gif", where common-icons is a symbolic link or shortcut to the directory "icons".

Content Negotiation

Content Negotiation refers to a mechanism whereby the server and browser can negotiate some characteristics of the document, such as language or image type. See e.g. draft-holtman-http-negotiation-03.txt for a recent draft. Unfortunately, content negotiation does not co-exist happily with proxy cache. Some effort has been made in the HTTP 1.1 protocol to improve this. Meanwhile, rather than making large numbers of documents negotiable, it is probably better to negotiate on e.g. natural language at the document root and maintain separate trees of non-negotiable pages in each language. Alternatively, a negotiated redirect may be used, resulting in only a small transfer from the origin server.

Use of Imagemap

Since redirected documents (response 302), such as generated by server-side imagemap, are not cached, using this requires a transfer from the origin server, even though the actual document may be cached nearby. Client-side imagemap (USEMAP), which is understood by recent browsers (including Lynx 2.5), requires only one transfer and can be fulfilled directly by cache.

Redirect

Redirected responses (Status 302) are not usually cached by proxy caches (although Netscape Navigator will use a locally cached page as long as there is no "." in the directory name. I.e. if there is a cached document "http://some.org/here/", then a link to "http://some.org/here" will use it.)

Implicit redirects are often generated by giving a directory name as a URL, as in the example above. The redirect will go to the origin server. Appending a tailing "/" to the directory, or giving the complete file URL, will avoid this. I.e. use "http://some.org/~John/" or "http://some.org/~John/index.html" instead of "http://some.org/~John".

Parsing USER_AGENT

Some CGI scripts parse the USER_AGENT field to determine the capabilities of a browser (such as whether it is Netscape-compatible). When the resulting document is cached, it is no longer true that the server knows what browser is being used. Other techniques are available such as Javascript functions, NOFRAMES, NOEMBED and NOSCRIPT tags, ALT tags on images, which permit the document to be understood by a variety of browsers without the server having to know which browser is used.

Renaming Pages

Where a page represents "Today's News", and it is rolled over to "Yesterday's News" at night, consider giving it a unique URL (perhaps using the Julian date), giving the index page an expiry date of ""tonight" and updating it nightly with links from "Today", "Yesterday" to 961003.html, 961002.html etc. In this fashion someone who read the news yesterday, goes to the index page and clicks "Yesterday's News", will see the page they have in their local (or site) cache, instead of getting an identical page with a different name from the origin.

The New BC pages here were an excellent example of how not to do it, I'm ashamed to say, but are now fixed.

Cacheable CGI

Most Web pages are static - a document that was written at some point in time, and may be updated occasionally. In between updates, it doesn't change, which makes it cacheable. The URL of the page completely determines the content that is seen. Some CGI scripts are used in this fashion; for instance, the Help Page at AltaVista, and could usefully be made cacheable.

Some current cache agents assume that CGI scripts are uncacheable. Specifically, they may check a URL for "cgi-bin" and "?" (query). Generating a website from a database may make more sense in many cases than using a traditional file-based webserver, but one should try to use path-based URLS (such as "http://some.org/database/sales/1996") where possible instead of query-based URLs ("http://some.org/database?sales+1996" or "http://some.org/cgi-bin/lookup?sales+1996"). Clearly, where the URL is generated by an HTML form, this is not possible, but often many links between pages are static.

Other CGI scripts generate output that depends on some other factor, for instance, a script may return an image from a Webcam, or the wind speed from a weather station. At first sight these documents appear uncacheable, but in fact may be usefully made cacheable in a controlled way. For instance, in the case of the weather station, there is no need to give up-to-the-second data via the Web. A script might generate a cacheable document giving a Last-Modified time for when the measurement was made, and an Expires time for when the next measurement will be made. If instant data is critical, a browser may send GET with Pragma:no-cache (typically Shift-Reload), or use some other transport protocol.
This Perl script illustrates the kind of thing that can be done.

Some cache agents (Netscape Navigator, for instance) may make use of the Content-Length header to characterize a document and to determine whether a document is completely loaded. If a persistent connection is used, the only way to signal the end of the object is to shutdown the connection. It may be useful, therefore, to generate a valid Content-Length header.

Different browsers and proxy agents apply different heuristics to determine what documents may be cached, while the HTTP specification lays down some rules. For instance, documents retrieved using POST or requiring authorisation cannot be cached. Documents with an expiry date of "now", an expiry date in the past or an illegal expiry date should not be cached. The Squid cache, in default configuration, will not cache documents whose URL contains a query term ("?"), or the string "cgi-bin". Apache 1.1.1 will not cache documents without a Last-Modified header. Netscape Navigator caches most documents except those whose Pragma header forbids it; cacheing of secure documents is a user option. Accordingly, it may be difficult to make a CGI script that is universally cacheable, especially if input from a form is required. If form input is not essential, a URL of the form www.some.org/cacheable-cgi/locator/search-term-here may be used.

See also the discussion on http-wg (active again April 1997).

Cookies

Cookies are useful for tracking a user through the system, but do not work well with proxy cache, since Set-Cookie headers should not be cached. Some cache agents may ignore cookie headers and cache all documents, some may drop any document with a cookie, and some may be capable of stripping the header and cacheing the document only.

If you require cookies to track customers, consider using statistical sampling. Avoid generating a cookie for a page unless it is strictly necessary (do not use a server configured to set cookies on all objects). Try to keep the size of pages which do set a cookie small; one technique may be to set a cookie using a Redirect script.

Rotate advertising banners from a pool of cacheable images rather than generating disposable URLs.

Note that some proxy services may strip cookies from pages in order to achieve a better hit rate; these organisations may include those such as academic institutions which pay for Internet bandwidth but supply their clients (such as students) with a free service.

For more cookie information, see e.g. HTTP Cookie Library FAQs by Matt Wright, or Andy's Cookie Notes. See also RFC 2109.
Here is more information on the Specification.

Form Validation

Some sites (such as Vancouver Webpages...) make use of CGI scripts to validate form data, often using a two-pass approach (your submitted information looks like this; is this OK?). This reduces the need for offline checking of data but is wasteful of bandwidth, especially if the user takes several attempts to get it right. Javascript and Java provide functionality to check form data on the user's workstation before it is submitted. Of course, it is still necessary to provide a handler for browsers which do not support these languages, in addition to checking that the validation is correct (not using an out-of-date cached script, or a user-modified script designed to bypass certain checks).

Server-side Includes

Server-side includes can be useful on a server, but usually these documents do not have a Last-Modified header, which precludes their responding to an If-Modified-Since request. Apache provides a method (the XBitHack) of forcing the server to generate one. In this case, the "wrapper" document must be "touched" or updated whenever any of its daughter documents are changed. Server-side includes also do not usually have a Content-Length header, which may cause problems with a persistant connection, so they should probably be avoided where possible (yes, there are a lot here ...).

If-Modified-Since

If-Modified-Since is a powerful concept which ensures cached pages do not become stale, while conserving bandwidth. Pages are served with a Last-Modified date http header typically generated from the file timestamp. A cache stores this date and time along with the page content. If the browser has reason to suspect that the page is stale (it has been configured to always ask, or the user used "Reload", or the Expires date has passed) it sends an If-Modified-Since header with the request. If the page has not changed, the server responds with a short status 304 message instead of the complete page, which is usually much quicker. In order for this mechanism to work, the server must generate valid Last-Modified headers and the server clock must be reliable. Some Microsoft web servers reportedly are ill-behaved in this regard, making them appear slow over long paths or cache hierarchies.

The Expires header

As stated above, the HTTP specification defines the use for the Expires header. In the absence of an Expires header, proxy agents guess a probable expiry date for a document based on available data. For instance, a document that has recently changed as evidenced by the Last-Modified header may be assumed to be updated frequently, and flushed from the cache in a short time. Images may be cached for longer than text. To assert control over the expiry date of a document, an Expires header may be explicitly generated in RFC1123 time format, for instance
Expires: Tue, 08 Oct 1996 08:00:00 GMT
On a server such as Apache, this may be generated in a CGI script, specified in an as-is document header, or given in a CERN-style .meta file. ".asis" files under Apache do not appear to respond properly to If-Modified-Since. Apache 1.2 offers direct support for the Expires header through the mod_expires module.

Regardless of the value of Expires, a cache may expire files sooner based on site configuration, lack of hits, etc. It should not cache a file beyond the Expires date.

Netscape Navigator (v3) appears to cache documents with a zero Expires data, but then issue a GET with If-Modified-Since, thus checking the document for changes. A document with a Pragma no-cache header is not cached.

Note: In HTTP/1.1 there is a richer set of cache-control headers than in HTTP/1.0, including controls on maximum age, public and private cache directives, etc.

Hit Metering

There is an Internet Draft draft-ietf-http-hit-metering-02.txt by Jeffrey Mogul and Paul Leach which addresses the issue of Hit Metering.

Other techniques may use a small uncacheable inline image on pages that must be metered, indicating that the page has probably been retrieved by a browser with images enabled, or count redirects if URLs are redirected (redirects are not cacheable in a public cache). It is not necessary to track every single request to measure how many people have visited a page. One possibility for a busy site is to turn on cache-busting for a few thousand hits at random times during the week, which will enable statistical estimates of the cached ratio, and thus of the total viewer counts.

Cache and Copyright

There's been some discussion on the Squid mail list recently (August 1997) in regard to cache operation and copyright violation. The consensus among cache operators is that automated operation of a Web cache may be considered like operating a long piece of cable; information is stored in it temporarily but is not considered "copied", much as multiple versions of TV programs in transit in cable TV systems are not "copies". One respondant refers to draft German (and possibly EU) legislation which addresses this issue.

Proxy Cache and Privacy

Some people are interested in using proxy cache not for any benefit in bandwidth conservation, but in order to disguise their ip address from the webserver, or to avoid some domain restrictions that may be in place at some gateway. The trojan "Ring Zero" in late 1999 used a distributed mapping technique to discover open Web proxies, possibly for this purpose.

Note that some proxy agents, such as Squid, make the original IP address available using an HTTP header such as X-Forwarded-For, and do not strip cookies or otherwise anonymize the original request. Those requiring anonymous access may use a service such as ZeroKnowledge, which is designed specifically for this purpose.

Getting the Icons

The icons should be copied to your server. The image size may be adjusted on many browsers by changing the width and height values.

Linking these icons is no longer unconditionally recommended. However, in keeping with the spirit of Cache Now!, you should share links to the icons, for instance, all the virtual servers on one CPU should use the same copy, and several servers within one organization can share one copy. This will protect you from having broken icons on your page if this server is unavailable, either from downtime or through network breaks. There will be many If-Modified-Since requests which must be fulfilled at small overhead.
You are welcome to copy the files and make links available. Files may be mirrored from ftp://vancouver-webpages.com/CacheNow/. Please let me know if you do this.

See Cache Now! Partners for a list of sites mirroring the icons.

There are two icons, a static PNG and an animated GIF. In Netscape Navigator, animated GIFs may be stopped with the Stop button and re-started by clicking the Location bar and pressing Return, but some people don't like them.
Thanks to Grant Bayley for an improved (smaller) animated GIF (afaik created with licensed software).

For the Cache Now! icons, the script age.pl was previously used to artificially set the modification date back, so that some cache flushing algorithms would assume they are a stable file and cache them for longer. Currently the Apache ExpiresByType directive is used to give the icons an expiry time of one month after loading. In future some HTTP/1.1 cache headers may be included.

Cache Now! (animated GIF, 5.7kb) Cache Now! (PNG, 497 bytes)

<!-- animated GIF -->
<a href="http://vancouver-webpages.com/CacheNow/">
<img src="cache_now_anim.gif"
alt="Cache Now!" width=100 height=31></a>

<!-- static PNG -->
<a href="http://vancouver-webpages.com/CacheNow/">
<img src="cache_now.png" 
alt="Cache Now!" width=100 height=31></a>

Intrusion Detection & Cache

Recently (Octover 2000) there have been reports of cache hits being reported as intrusion attempts. In this cache model, as implemented by Akamai, HTP requests are made to the origin server but are filled by a cache server located at a closer point on the net. Some IDS systems report this as suspicious activity. See for instance this BCNet Tech Bulleting

Other Links

Sponsorship

Suggestions, corrections and sponsorship of this page are welcome. Please email

Statistics

Icon Backlinks (thanks for the support)

Access statistics for the CacheNow icons here.

Week of Nov. 3rd:
cache_now.png 14k hits, 18% No change (cache update request)
cache_now.anim.gif 43k hits, 8% No change

Hosted by Vancouver Webpages