HISTORY 

A while ago, I was looking for a way to cache my downloaded debian
<http://www.debian.org> (unstable) packages. I have some computers on a LAN, and
it seemed like a waste of time and bandwidth to download the same packages over
and over again for each computer. So I looked into some existing solutions... 

squid <http://www.squid-cache.org> 

Squid is a very well known proxy server, which (amongst others) maintains a
cache of downloaded files. Quite a few people seem to use squid for the purposes
I'm interested in. What bothers me is that these files are accessible only by
squid itself. It is (as far as I know) not possible to get a list of all cached
files, let alone to manually install or copy a file from cache. This is probably
a consequence of squids immense functionality, but for my purpose it seems most
logical to just store files as files. 

apt-proxy <http://apt-proxy.sourceforge.net> 

Apt-proxy is a proxy server written specially for the purpose of maintaining a
debian cache. The current version is a bash script, so speed is not optimal but
it easily beats my internet connection... Actually it isn't a real proxy server
but a normal http server that serves files from some configured remote hosts. 
These hosts are configured in both sources.list and the apt-proxy configuration
file. I believe either location has its pros and cons, but in requiring both
you'll end up with cons only. I've used it for a while, but I got a strange http
error every now and then that screwed up apt so I had to constantly restart. 
There is also a beta rewrite in python <http://www.python.org>, which I also
tried, but I don't remember being to satisfied with that either. 

apt-cacher <http://www.apt-cacher.org> 

This one looked really promising. It is a specialized debian proxy server, just
like apt-proxy, but it is written in perl as an apache module. Again, this isn't
a real proxy, but this time sources are defined in sources.list only so that's
an improvement. Moreover, this one works (!) and it even seems to store files in
the normal way that seems so obvious to me. But... when inspecting the files in
cache they appear to be somehow altered, for a reason I cannot understand, but
they can't be viewed nor installed manually. Apart from that I don't like the
idea of running an entire apache server for a debian cache only. 

debproxy <http://sourceforge.net/projects/debproxy> 

Yet another specialized proxy server, a true proxy this time. As you can see
from the "this project has not released any files" it never came out of cvs. I
can't remember if I ever had it up and running. However judging from the
description this project would have been ideal if it would have been finished. 
It is a real proxy, simply caching all files that are downloaded through it or
serving from cache if they're there already. Creator lordsuch
<http://sourceforge.net/users/lordsutch> named this kind of proxy a "replicating
proxy". This name seemed so obvious that I immediately googled for more... but
only debproxy showed up. That's when I began to realize that the proxy I was
looking for didn't yet exist. Hence my own attempt: 

REPLICATOR 

Replicator is a general purpose proxy server. That is, not specifically geared
towards debian caching. It can be used for any http client, for instance a web
browser. While browsing the web it will simply save all the files that go
through it. The data is transparently streamed to all clients simultaneously
requesting the same file, so everything is downloaded only once. When a
requested file is already in cache a check is added to the http header to see if
the file is modified since the time it was cached. If it isn't than the file is
served from cache, preventing it from being downloaded again. Real speed gains
are obtained when the server is configured not to check for modifications but to
serve the cached file directly. 

To use replicator for debian caching modify or create /etc/apt/apt.conf to
contain the following line: 
-- Acquire::http::Proxy "http://(HOST):(PORT)"; 
Now apt and similar will download through the replicator proxy, caching all
packages that are installed. This has worked flawlessly on my computer for quite
some time now. The advantage of a general proxy rather than a specialized one
becomes apparent when installing packages such as 'msttcorefonts' and
'flashplugin-nonfree'. These packages offer the possibility to use a proxy
server, so not only the packages themselves but also related files can be
cached. As of version 2.0 replicator is also suitable for maintaining a gentoo
<http://www.gentoo.org> package cache, mainly because of the new flat mode - see
below. And as of version 2.1 it is also possible to forward requests to an
external proxy server. 

In version 3.0 a lot of new features are introduced, such as... cache browsing! 
Yes, that's right, replicator can now act as a web server. Simply visit the
address where you have replicator configured to listen for proxy connections in
a browser and you'll get an html listing of the files in cache. And the great
thing is that the streaming principle still applies, so even incomplete files
can be served. Another new feature is client-side support for range requests, so
incomplete downloads can now be resumed. For a complete list of changes see the
changelog <changelog>. All this added functionality required some fairly 'deep'
changes in code. The interested will notice that the [LocalResponse] construct
has disappeared, which is what I personally like most about this version :-). 

The following command line options control the server: 

* [-p, --port PORT]: 
  The proxy port on which the server listens for http requests. Default is 8080. 
* [-i, --ip IP]: 
  The ip addresses from which access is allowed, optionally with wildcards ? and
  * for single and multiple digits respectivily. Defaults to localhost and can
  be used multiple times. 
* [-d, --dir DIR]: 
  The directory to store the cached files in. By default files are simply saved
  under the current directory. 
* [-a, --alias STR]: 
  Colon separated list of mirrors, i.e. hosts serving identical content. Useful
  for caching (debian) packages from different mirrors to the same directory,
  like this: 'cache/dir:host1/somepath:host2/other/path'. Can be used multiple
  times. 
* [-s, --static]: 
  Static mode: files are known to never change so files that are present are
  served from cache directly without contacting the server. Primarily useful for
  gentoo caching, not for debian because of the changing Packages.gz and Release
  files. 
* [-f, --flat]: 
  Flat mode: all files are saved in a single directory. Primary use for gentoo. 
* [-e, --external]: 
  Forward requests to an external proxy server, specified as host:port or
  username:password@host:port if the server requires authentication. 
* [-q, --quiet]: 
  Decrease verbosity. Messages on stdout all have a certain level of importancy. 
  By repeatedly selecting this option the least important messages are skipped. 

Daemon options: 

* [--daemon]: 
  Switches to daemon mode and activates the following options: 
* [--log LOG]: 
  Write output to a log file instead of stdout. 
* [--pid PID]: 
  Write the process id to a pid file instead of stdout. 
* [--user USER]: 
  Change the process user id after opening the log and pid file. 

Debugging options: 

* [--debug]: 
  Set the lowest possible logging level (the quiet option is ignored) and
  activate the following options: 
* [--intolerant]: 
  Crash on unhandled exceptions. 

When the server is started from [/etc/init.d] the command line options are read
from [/etc/default/http-replicator]. Here two variables are defined:
[DAEMON_OPTS] that contains the command line options for the server and
[MAINTENANCE_OPTS] that contains the options for the maintenance (cron) script. 
When replicator is just installed all files are in place but the server is not
yet activated. For that you must remove the 'exit' command in
[/etc/default/http-replicator]. This is done for two reasons. First, you might
not want to have the server running all the time but just use it from the
command line. As I said, package caching is not replicator's only purpose. 
Second, if you do want to have replicator running all the time you need to
create the caching directory first and set up the right permissions. You may
also want to change the port and select the static or flat operation mode before
starting the server. 

I already mentioned the cron maintenance script that comes with replicator. This
script helps keeping the package cache down in size. Without such script a
package cache tends to grow very rapidly to immense proportions. The maintenance
script simply deletes all packages of which more than a certain number of newer
versions are present. This number is set on the commandline with [--keep], which
is read from [/etc/default/http-replicator]. The script is disabled by setting
the number to zero or less. 

LINKS 

Some websites that are somehow related to this project: 

* replicator home <http://gertjan.freezope.org/replicator> 
* freshmeat project_page <http://freshmeat.net/projects/http-replicator> 
* gentoo specific replicator howto <http://gentoo-wiki.com/HOWTO_Download_Cache_for_LAN-Http-Replicator> 
* gentoo discussion forum <http://forums.gentoo.org/viewtopic.php?t=173226> with replicator ebuild 
* overview <http://www.xhaus.com/alan/python/proxies.html> of http proxies written in python 
* nice introduction <http://www.jmarshall.com/easy/http> to the HTTP protocol 
* complete HTTP/1.1 protocol <http://www.w3.org/Protocols/rfc2616/rfc2616.html> 
