======================
Gamera kd-tree library
======================

Introduction
------------

A *kd-tree* is multidimensional generalization of a binary
search tree. It can be used efficiently for range queries and nearest
neighbor searches, provided the dimension is not to high. In document
analysis problems, the dimension is typically two, so that kd-trees
can be a powerful utility for layout analysis problems.

For a detailed description of the present kd-tree implementation with
examples for use cases, see

  C. Dalitz: `Kd-Trees for Document Layout Analysis.`__
  In C. Dalitz (Ed.): "Document Image Analysis with the Gamera Framework."
  Schriftenreihe des Fachbereichs Elektrotechnik und Informatik,
  Hochschule Niederrhein, vol. 8, pp. 39-52, Shaker Verlag (2009)

.. __: http://lionel.kr.hsnr.de/~dalitz/data/publications/sr09-kdtree-layout.pdf

Additional background information on kd-trees can be found in the 
following literature:

- for an introduction to kd-trees and basic properties, see [deBerg2000]_
- more algorithms for kd-trees can be found in the original article 
  [Bentley1975]_
- neither of the above two references covers nearest neighbor
  searches in kd-trees; these are coverd in [Friedman1977]_


Examples
--------

Here is an example for looking up the three nearest neighbors to
a given point from a set of sample points:

.. code:: Python

   from gamera.kdtree import *

   points = [(1,4), (2,4), (1,5), (3,6), (8,9),
             (2,7), (4,4), (5,5), (4,6), (8,3)]
   nodes = [KdNode(p) for p in points]
   tree = KdTree(nodes)

   # neighbors to a sample point not from the set
   point = [5,6]
   k = 3
   knn = tree.k_nearest_neighbors(point, k)
   print "%i neighbors of (%i,%i):" % (k,point[0], point[1]),
   for node in knn:
       print "(%i,%i)" % (node.point[0], node.point[1]),
   print "" # final newline

   # neighbors to a sample point from the set
   # we must query k+1 neighbors, because one of them is
   # the sample point (the first entry in the returned list)
   point = [5,5]
   k = 3
   knn = tree.k_nearest_neighbors(point, k+1)
   print "%i neighbors of (%i,%i):" % (k,point[0], point[1]),
   for node in knn[1:]:
       print "(%i,%i)" % (node.point[0], node.point[1]),
   print "" # final newline

The property *KdNode.data* can store an arbitrary Python object
associated with the point. The following example represents each
connected component by its middle point and stores the actual
CC with the point in the node:

.. code:: Python

   ccs = image.cc_analysis()
   nodes = [KdNode([cc.center.x,cc.center.y], cc) for cc in ccs]

It is also possible to search for only those neighbors that fulfil
a certain predicate. To this end, you must define a function or callable
class that takes a KdNode as input and returns ``True`` when it fulfils
the search predicate. The following example only returns nearest neighbors
below the search point:

.. code:: Python

   # predicate definition as callable class
   class predicate(object):
       def __init__(self, point):
           self.point = point
       def __call__(self, node):
           return (point[1] > node.point[1])

   # find the three nearest neighbors below
   point = (5,6)
   knn = tree.k_nearest_neighbors(point, 3, predicate(point))

Beware however, that a search predicate can have the effect that
less than k neighbors are returned, because too few nodes fulfil
the predicate. In such a case, the runtime is O(n) instead of O(log(n)).
In other words, a poorly chosen search predicate can have a considerable
negative impact on the runtime. In some cases, modifying the distance metric
(method *set_distance*) can be an alternative to a search predicate
[Dalitz09]_.


The Kd-Tree Python API
----------------------


``KdNode`` objects
''''''''''''''''''

Each ``KdNode`` has two properties:

 *point*
   The geometric location of the node as a sequence of coordinate
   numbers. The coordinate numbers can be floats or ints.
   This is an immutable property, because changing the geometric
   location of nodes belonging to an already built kd-tree obviously
   breaks subsequent search operations.

 *data*
   An arbitrary Python object connected to the location *point*.

.. docstring:: gamera.kdtree KdNode

``KdTree`` objects
''''''''''''''''''

Each kd-tree is represented by instances of the ``KdTree`` class.
Even though there are general kd-tree algorithms to add and remove
nodes dynamically (see [Bentley1975]_), the present implementation
does not support alteration of a once built tree. This has the
consequence that tree nodes must be passed to the contructor of
``KdTree``.

A ``KdTree`` has the following (read only) properties:

 *dimension*
   The dimension of the kd-tree. This is automatically determined
   by the constructor.

.. docstring:: gamera.kdtree KdTree

.. docstring:: gamera.kdtree KdTree set_distance

.. docstring:: gamera.kdtree KdTree k_nearest_neighbors


The Kd-Tree C++ API
-------------------

The module ``gamera.kdtree`` is only a thin Python wrapper around
a C++ class ``KdTree``. This can also be used directly in C++ plugins.

Compilation and linkage
'''''''''''''''''''''''

The header file *kdtree.hpp* declares the necessary structures in
the namespace ``Gamera::Kdtree``. It is installed with the other gamera
header files, and can thus be included with

.. code:: CPP

   #include "geostructs/kdtree.hpp"
   using namespace Gamera::Kdtree;

The tricky part is getting your plugin module to be linked with the
actual kdtree implementation. This is achieved by adding the
source file ``kdtree.cpp`` to the ``cpp_sources`` property in the
plugin Python interface.

In the gamera core code, the following works:

.. code:: Python

   class ExampleModule(PluginModule):
       category = "MyPlugins"
       cpp_headers = ["myplugins.hpp"]
       cpp_sources = ["src/geostructs/kdtree.cpp"]
       functions = [myplugin1, myplugin2]
   module = ExampleModule()

In a toolkit, this will not work, because the path names in
the ``cpp_sources`` property are relative to the location
of the *setup.py* script. To allow for the use of the KdTree C++
class in toolkits, the source file *kdtree.cpp* is installed together
with Gamera. You can thus specify this file in ``cpp_sources``
as follows:

.. code:: Python

   class ExampleModule(PluginModule):
       # ...
       import os, gamera
       gamera_root = os.path.dirname(gamera.__file__)
       cpp_sources = [os.path.join(gamera_root, "src/geostructs/kdtree.cpp")]
       # ...

Querying the Gamera installation directory through the *__file__* property
of the gamera module is safer than using *get_python_lib()* from
``distutils.sysconfig``, because it also works when Gamera is not
installed into the standard location for python extensions.


Usage
'''''

For normal use of the kdtree, you will need the classes 
``CoordPoint`` (a typedef for ``vector<double>``), ``KdNode``, 
``KdNodeVector``, and ``KdTree``. Beside the property ``point``, a
``KdNode`` can also store an arbitrary pointer as ``data``. See
the header file *kdtree.hpp* for details.

Here is a usage example for a nearest neighbor search:

.. code:: CPP

   // set points for building the tree
   KdNodeVector nodes;
   double points[][2] = {
     {1,4}, {2,4}, {1,5}, {3,6}, {8,9},
     {2,7}, {4,4}, {5,5}, {4,6}, {8,3},
     {-20,-20} // array terminator
   };
   size_t i;
   for (i=0; points[i][0]>=-1.0; i++) {
     CoordPoint p;
     p.push_back(points[i][0]);
     p.push_back(points[i][1]);
     nodes.push_back(KdNode(p));
   }

   // build the tree
   KdTree tree(&nodes);

   // find the three nearest neighbors to (5,6)
   KdNodeVector neighbors;
   CoordPoint point(2);
   point[0] = 5;
   point[1] = 6;
   tree.k_nearest_neighbors(point, 3, &neighbors);

If you want to restrict the returned neighbors to only those fulfilling
a specific search predicate, you can define your search predicate as
a callable class (aka *functor*, i.e. a class with the call operator
``operator()`` overwritten [Stroustrup1997]_) derived from 
``KdNodePredicate`` as in the following example:

.. code:: CPP

   // only search for nodes on the right hand side of search point
   struct MyPredicate : public KdNodePredicate {
     CoordPoint point(2);
     MyPredicate(CoordPoint& p) {
       point = p;
     }
     bool operator()(const KdNode& kn) const {
       return point[0] < kn.point[0];
     }
   };

The call of the class only takes a single argument, the ``KdNode``, and
returns true when this node is an admissable search result. If this
criterion depends on some information from the actual search point (like
its *x*-component in the exampel above), this information must be passed to
the class constructor and stored within the class.

An instance of this class can then be passed as the third, optional argument
to ``KdTree.k_nearest_neighbors``:

.. code:: CPP

   KdNodeVector neighbors;
   CoordPoint point(2);
   point[0] = 5;
   point[1] = 6;
   MyPredicate predicate(point);
   tree.k_nearest_neighbors(point, 3, &neighbors, &predicate);


References
----------

.. [deBerg2000] M. de Berg, M. van Kreveld, M. Overmars, O. Schwarzkopf:
   *Computational Geometry.* Second edition, Springer (2000)

.. [Bentley1975] J.L. Bentley: *Multidimensional Binary Search Trees Used
   for Associative Searching.* Communications of the ACM 18,
   pp. 509-517 (1975)

.. [Friedman1977] J.H. Friedman, J.L. Bentley, R.A. Finkel:
   *An Algorithm for Finding Best Matches in Logarithmic Expected Time.*
   ACM Transcations on Mathematical Software 3, pp. 209-226 (1977)

.. [Stroustrup1997] Stroustrup, B. 1997. *The C++ Programming
   Language: Third Edition.*  Reading, MA: Addison-Wesley.

.. [Dalitz09] C. Dalitz: `Kd-Trees for Document Layout Analysis.`__
  In C. Dalitz (Ed.): "Document Image Analysis with the Gamera Framework."
  Schriftenreihe des Fachbereichs Elektrotechnik und Informatik,
  Hochschule Niederrhein, vol. 8, pp. 39-52, Shaker Verlag (2009)

.. __: http://lionel.kr.hsnr.de/~dalitz/data/publications/sr09-kdtree-layout.pdf
