TODO list for agedu
===================

 - flexibility in the HTML report output mode: expose the internal
   mechanism for configuring the output filenames, and allow the
   user to request individual files with hyperlinks as if the other
   files existed. (In particular, functionality of this kind would
   enable other modes of use like the built-in --cgi mode, without
   me having to anticipate them in detail.)

 - non-ASCII character set support
    + could usefully apply to --title and also to file names
    + how do we determine the input charset? Via locale, presumably.
    + how do we do translation? Importing my charset library is one
      heavyweight option; alternatively, does the native C locale
      mechanism provide enough functionality to do the job by itself?
    + in HTML, we would need to decide on an _output_ character set,
      specify it in a <meta http-equiv> tag, and translate to it from
      the input locale
       - one option is to make the output charset the same as the
         input one, in which case all we need is to identify its name
         for the <meta> tag
       - the other option is to make the output charset UTF-8 always
         and translate to that from everything else
       - in the web server and CGI modes, it would probably be nicer
         to move that <meta> tag into a proper HTTP header
    + even in text mode we would want to parse the filenames in some
      fashion, due to the unhelpful corner case of Shift-JIS Windows
      (in which backslashes in the input string must be classified as
      path separators or the second byte of a two-byte character)
       - that's really painful, since it will impact string processing
         of filenames throughout the code
       - so perhaps a better approach would be to do locale processing
         of filenames at _scan_ time, and normalise to UTF-8 in both
         the index and dump files?
          + involves incrementing the version of the dump-file format
          + then paths given on the command line are translated
            quickly to UTF-8 before comparing them against index paths
          + and now the HTML output side becomes easy, though the text
            output involves translating back again
          + but what if the filenames aren't intended to be
            interpreted in any particular character set (old-style
            Unix semantics) or in a consistent one?

 - we could still be using more of the information coming from
   autoconf. Our config.h is defining a whole bunch of HAVE_FOOs for
   particular functions (e.g. HAVE_INET_NTOA, HAVE_MEMCHR,
   HAVE_FNMATCH). We could usefully supply alternatives for some of
   these functions (e.g. cannibalise the PuTTY wildcard matcher for
   use in the absence of fnmatch, switch to vanilla truncate() in
   the absence of ftruncate); where we don't have alternative code,
   it would perhaps be polite to throw an error at configure time
   rather than allowing the subsequent build to fail.
    + however, I don't see anything here that looks very
      controversial; IIRC it's all in POSIX, for one thing. So more
      likely this should simply wait until somebody complains.

 - run-time configuration in the HTTP server
    * I think this probably works by having a configuration form, or
      a link pointing to one, somewhere on the report page. If you
      want to reconfigure anything, you fill in and submit the form;
      the web server receives HTTP GET with parameters and a
      referer, adjusts its internal configuration, and returns an
      HTTP redirect back to the referring page - which it then
      re-renders in accordance with the change.
    * All the same options should have their starting states
      configurable on the command line too.

 - curses-ish equivalent of the web output
    + try using xterm 256-colour mode. Can (n)curses handle that? If
      not, try doing it manually.
    + I think my current best idea is to bypass ncurses and go
      straight to terminfo: generate lines of attribute-interleaved
      text and display them, so we only really need the sequences
      "go here and display stuff", "scroll up", "scroll down".
    + Infrastructure work before doing any of this would be to split
      html.c into two: one part to prepare an abstract data
      structure describing an HTML-like report (in particular, all
      the index lookups, percentage calculation, vector arithmetic
      and line sorting), and another part to generate the literal
      HTML. Then the former can be reused to produce very similar
      reports in coloured plain text.

 - abstracting away all the Unix calls so as to enable a full
   Windows port. We can already do the difficult bit on Windows
   (scanning the filesystem and retrieving atime-analogues).
   Everything else is just coding - albeit quite a _lot_ of coding,
   since the Unix assumptions are woven quite tightly into the
   current code.
    + If nothing else, it's unclear what the user interface properly
      ought to be in a Windows port of agedu. A command-line job
      exactly like the Unix version might be useful to some people,
      but would certainly be strange and confusing to others.

 - replacing the index data structure with a layered range tree.
    + The basic idea of the data structure:
       * store a search tree ordered by atime, in which every node
         stores an array of pairs (pathname index, cumulative size)
         describing the data in that node's subtree, ordered by the
         pathname index.
       * Also, alongside each element in a node's array is the index
         of the corresponding location in the sort order in the array
         at each child node.
    + To search the tree for a given (path index, atime) query point:
       * Start by doing a log-time search for the path index in the
         array at the root node.
       * Now, every time you descend to a child node, you can use the
         crosslinks between the arrays to find the corresponding
         search position in the new node's array, in constant time.
       * So your overall search still takes log time, and from every
         node you can read off the total size of stuff in that subtree
         less than the query path index, which is enough to find the
         total size of stuff in the desired quadrant.
    + This also enables a 3-sided query (i.e. specifying both start
      and end pathname index) in a single operation: just start by
      finding _both_ your pathname index endpoints in the root node's
      array, and then as you descend the tree you can read off the
      total size of stuff in each subtree _between_ those two
      endpoints, by subtracting the two cumulative sizes.
       * For the existing kind of query, this is only a minor
         optimisation, and doesn't affect the asymptotic performance
         of the query.
       * But the advantage is that it also lets you do an inverse
         lookup, of the form 'Given a pathname interval (subdirectory)
         and a target *size* X, find an atime T such that the amount
         of data in that subdir older than T is X.' In the current
         agedu data structure that takes log^2 time, because you have
         to do a log-time query for each step of a binary search over
         candidate atimes; but if you do a 3-sided search in this
         range tree system, you can use the subtree sizes instead of
         the tree sort order to decide which direction to go as you
         descend the tree, just like querying an ordinary annotated
         tree.
       * I think this inverse lookup could be helpful for speeding up
         the HTML report generation. The normal query answers the
         question 'Where should the boundary be between these two
         colour values?', and agedu runs that query once per output
         colour. But the inverse query answers 'What colour should
         _this pixel_ in the output bar be?', so for very small bars
         (of which there will be a lot), inverse queries would surely
         get the needed information faster.
          - Also, I quite like the idea of a hybrid strategy:
             + for a bar with fewer pixels than colours, determine the
               colour of its middle pixel by an inverse lookup, then
               recurse into each half
             + but if there are more pixels than colours, determine
               the location of the middle colour by a forward lookup,
               and recurse similarly
             + and at each level of the recursion, pick one of those
               strategies as appropriate.
    + Without having actually implemented it in practice, my guess is
      that this structure as described above is comparable in both
      space and time to the current one. It's still N log N space in
      principle; you still get to optimise to some degree by
      disregarding uninteresting differences between pathname indices
      (in this case, by compressing the arrays at each node, rather
      than by discarding intermediate states of the evolving tree);
      and it still takes log time to answer the existing kind of
      query, namely 'what is the total size of stuff in this subdir
      older than this atime?'.
    + There's also a way to compress the index to a smaller size, at
      the cost of query time, by replacing the array at each tree node
      with a bit-vector that just says which child each logical array
      element came from, and storing a cumulative element count and
      cumulative size per _word_ of the bit-vector instead of a
      cumulative size for every element. This means that at every step
      down the tree you can only compute the cumulative size
      approximately from the data actually stored at that tree node,
      and to fix up the errors you have to look at the two partial
      words at each end of your search and chase each individual
      element down the tree to the bottom to find its exact size. Done
      naively, this gives you O(log^3 n) lookup time (at each of the
      log n steps down the tree, there are log n extra points you have
      to sort out, each of which needs another log n pass to the
      bottom of the tree); you can improve on that by storing extra
      indices (also bit-packed) that allow you to jump multiple levels
      down the tree, permitting an assortment of tradeoffs that let
      you _almost_ get rid of a log factor - you can have O(n) space
      and O(n log^{2+e} n) for e as small as you want (paying for a
      smaller e with a larger constant factor), or conversely
      O(n^{1+e}) space and exactly O(n log^2 n) query time, or the
      in-between compromise of O(n log log n) space and O(n log^2 n
      log log n) query time.
       + source: "A Functional Approach to Data Structures and Its Use
         in Multidimensional Searching", B. Chazelle, SIAM J. Comput.
         17 (1988), 427-462
       + and I _think_ it should in principle be possible to vary the
         frequency of cumulative counts/sizes so as to trade off space
         and time smoothly between that compressed representation and
         the full layered range tree.

 - A user requested what's essentially a VFS layer: given multiple
   index files and a map of how they fit into an overall namespace,
   we should be able to construct the right answers for any query
   about the resulting aggregated hierarchy by doing at most
   O(number of indexes * normal number of queries) work.

 - Support for filtering the scan by ownership and permissions. The
   index data structure can't handle this, so we can't build a
   single index file admitting multiple subset views; but a user
   suggested that the scan phase could record information about
   ownership and permissions in the dump file, and then the indexing
   phase could filter down to a particular sub-view - which would at
   least allow the construction of various subset indices from one
   dump file, without having to redo the full disk scan which is the
   most time-consuming part of all.
