-----------------------------------------
Release Notes for Trilinos Package Tpetra
-----------------------------------------

Development version (12.??)
---------------------------

* Fix #1683

Tpetra now enables GO = long long by default, regardless of whether
this is a CUDA build.

* Fix #1088

* Add CMake option for whether Tpetra should assume that MPI
  is CUDA aware (#1571)

Add a CMake option Tpetra_ASSUME_CUDA_AWARE_MPI, with associated macro
TPETRA_ASSUME_CUDA_AWARE_MPI defined in TpetraCore_config.h.  If the
CMake option is ON, Tpetra may assume that the MPI implementation it
uses is CUDA aware.  See #1571 for discussion, and #1088 for an
application.

Tpetra's CMake logic attempts to detect whether the MPI implementation
is CUDA aware.  If automatic detection does not succeed, Tpetra just
makes the safe assumption that MPI is not CUDA aware.

Currently, automatic detection requires OpenMPI.  If not using
OpenMPI, Tpetra conservatively assumes lack of CUDA awareness.  It
would be wise for us to extend detection to support other MPI
implementations, but for now, this covers a common use case for
Trilinos testing.

Automatic detection depends on running an executable.  This is
relevant for cross compilation, so I have added two measures to
protect against misleading results in that case:

  1. If CMAKE_CROSSCOMPILING is ON, Tpetra skips detection and prints
     a configure-time message telling users that they may set
     Tpetra_ASSUME_CUDA_AWARE_MPI explicitly.

  2. If users set Tpetra_ASSUME_CUDA_AWARE_MPI explicitly, Tpetra
     skips detection and assumes the user's value as the default.

* Tpetra now enforces CUDA >= 7.5 (see #1278)

Tpetra has required CUDA >= 7.5 for a while, if building with CUDA
enabled.  Now, Tpetra's CMake logic enforces CUDA_VERSION >= 7.5 at
configure time.  See #1278 for discussion.

Development version (12.10)
---------------------------

* Build time and size improvements (fix #700)

KokkosKernels now only pre-builds the sparse matrix-vector multiply
kernels that Tpetra needs.  Also, for integer Scalar types,
KokkosKernels no longer optimizes sparse matrix-vector multiply for
multiple right-hand sides.  It does so only for non-integer (e.g.,
floating-point) Scalar types.  This reduces build time and size.  (See
Github Issue #700.)  Furthermore, KokkosKernels now only pre-builds
sparse matrix-vector multiply for the default offset type.

* Removed "using Teuchos::*" declarations from Tpetra_ConfigDefs.hpp

Tpetra no longer imports Teuchos classes like Comm and RCP (among
others) into the Tpetra namespace.  This will help us eventually
remove all the Teuchos_*.hpp header file includes from
Tpetra_ConfigDefs.hpp, thus improving build time.

* MultiVector: Add new two-argument randomize(min,max)

* MultiVector: Get rid of old-interface DistObject methods

Tpetra::MultiVector implements the new DistObject interface.  Thus, it
no longer needs to provide implementations for the following three
old-interface DistObject methods:

  - createViews
  - createViewsNonConst
  - releaseViews

* Optimize Map::replaceCommWithSubset for MPI_COMM_SELF (#673)

* Fixed many other issues

Issues fixed include (but are not limited to) #699, #680, #638, #617,
#607, #603, #601, #597, #561, and #46.


Trilinos 12.8
-------------

* Stop creating Node instances explicitly!

Hi users!  Please don't create Node instances explicitly any more.
Tpetra::Map creates one for you, if you really need one.  You really
don't need Node instances: Map's constructors and nonmember
"constructors" don't need them any more, nor do Tpetra's Matrix Market
readers.

Creating Node instances explicitly causes issues with Kokkos
initialization.  Node will go away eventually, in favor of Kokkos
execution spaces and memory spaces.

* Lots of bug fixes, especially for CUDA

* Computing offsets in CrsGraph and CrsMatrix is now thread parallel

CrsGraph's and CrsMatrix's fillComplete method computes row offsets,
if they have not yet been computed.  This is now thread parallel.  It
uses Kokkos::parallel_scan.

* More BlockCrsMatrix kernels are thread parallel

* Interface changes to KokkosSparse::CrsMatrix (the "local" matrix)

The replaceValues and sumIntoValues methods now take "is_sorted" and
"force_atomic" arguments.  These methods now use binary search
(falling back to linear search for short rows) for the sorted case.

Row views in KokkosSparse::CrsMatrix are no longer templated.  They
now use the ordinal type, rather than the offset type, for indexing.
This suffices as long as there are not enough duplicate entries in a
row to exceed ordinal_type.  This has the beneficial side effect of
reducing the number of local sparse matrix-vector multiply kernel
instantiations.

* Got rid of LittleBlock and LittleVector (for Block* classes)

Instead, use the little_block_type, const_little_block_type,
little_vec_type, and const_little_vec_type typedefs in BlockCrsMatrix
and other related classes.  Underlying data layout has NOT changed
(yet), but constructors HAVE changed.  This is technically a
non-backwards-compatible interface change, but all these classes are
in an Experimental namespace anyway.

* Got rid of KokkosClassic::DefaultArithmetic

Stokhos was using this, so we had left it in place in previous
releases for backwards compatibility.  Now that no other packages
depend on it, we have gotten rid of it for good.  Its functionality
has been replaced by various functions in TpetraKernels.

The original idea behind DefaultArithmetic, as suggested in the name,
was that users could swap out this "default" implementation of
multivector operations with their own implementations.  This is
generally less useful than swapping out the implementation of sparse
matrix kernels (like sparse matrix-vector multiply or sparse
triangular solve).  As a result, Tpetra never had an implementation
(since at least January 2010) of multivector operations other than
DefaultArithmetic.

Trilinos 12.6
-------------

* Better CUDA testing

We added more nightly test builds with CUDA enabled (for running on
NVIDIA GPUs).  The builds test various combinations of CUDA with
different compiler versions and host thread parallelism options
(OpenMP, Pthreads, serial).  CUDA + GCC 4.7.2 is currently the
best-tested option, but we're using these tests to improve support for
other options.

* CrsMatrix, MultiVector, Vector: Added 'atomic' option to sumInto

The sumIntoLocalValues method in Tpetra::CrsMatrix, and the
sumIntoLocalValue and sumIntoGlobalValue methods in
Tpetra::MultiVector and Tpetra::Vector, now take an optional bool
'atomic' argument.  If true, the methods use Kokkos::atomic_add
(atomic +=); if false, they use (non-atomic) += as before.

This lets different threads call the methods concurrently on the same
entry/ies of the matrix, multivector, or vector.  To support this, I
also modified CrsMatrix::sumIntoGlobalValues so that it does not
change Teuchos::RCP reference counts, thus making it thread safe.

The default value of 'atomic' depends on the class' execution space.
If the execution space is Kokkos::Serial (no threads), atomic is false
by default; else, it is true by default.  This ensures that existing
MPI-only codes do not need to pay the (small integer factor) overhead
of atomic updates, while making sumInto always correct by default when
using thread parallelism.

If you know that different threads will never access the same entries
concurrently, you should set atomic=false for best performance.

* Block(Multi)Vector: Add "offset view" constructors (Bug 6450)

Tpetra::Experimental::{BlockMultiVector, BlockVector} now have two
"offset view" constructors.  They behave analogously to the offsetView
and offsetViewNonConst methods of Tpetra::MultiVector.

The constructors view an existing BlockMultiVector with a different
mesh Map, and an optional local row offset from which to start the
view on each process.  The offset is a mesh offset (it gets multiplied
by the block size internally in order to find the point offset).  The
two constructors differ only in that one lets you supply the new point
Map, while the other computes it for you.

This fixes Bug 6450 (which was a feature request).

Trilinos 12.4
-------------

* Changed CMake option for setting default Node type

To set the default Node type, use the Tpetra_DefaultNode CMake option.
We support the old KokkosClassic_DefaultNode CMake option for
backwards compatibility.

Tpetra will eventually change from using Node types to using
Kokkos::Device types directly.  For now, though, if you wish to set
the default Node type explicitly, you must use one of the following:

  - Kokkos::Compat::KokkosCudaWrapperNode    (CUDA)
  - Kokkos::Compat::KokkosOpenMPWrapperNode  (OpenMP)
  - Kokkos::Compat::KokkosSerialWrapperNode  (Serial (no threads))
  - Kokkos::Compat::KokkosThreadsWrapperNode (Pthreads)

Tpetra normally only enables one Node type, so you only need to set
the default Node type if you have enabled more than one Node type.

* Rules for which Node type gets enabled by default

Tpetra only enables one Node type by default, whether or not ETI
(explicit template instantiation) is enabled.  Here are the rules for
which Node type gets enabled by default:

  1. If you're building with CUDA, Tpetra uses CUDA by default.
  2. Otherwise, if you're building with OpenMP, Tpetra uses OpenMP by
     default.
  3. Otherwise, if Kokkos enables the Serial execution space (if
     Kokkos_ENABLE_SERIAL is ON), Tpetra uses Serial by default.
  4. Otherwise, if Kokkos enables the Threads execution space (if
     Kokkos_ENABLE_PTHREAD is ON), Tpetra uses Threads by default.

If you wish to enable other Node types, you may set the following
CMake options.  You do NOT need to set any of these options explicitly
if the Node type would be enabled by default anyway.

  - Tpetra_INST_CUDA (Kokkos_ENABLE_CUDA must be ON, and Trilinos must
    be built with CUDA; ON by default if building with CUDA)
  - Tpetra_INST_OPENMP (Kokkos_ENABLE_OPENMP must be ON, and Trilinos
    must be built with OpenMP support)
  - Tpetra_INST_PTHREAD (Kokkos_ENABLE_PTHREAD must be ON)
  - Tpetra_INST_SERIAL (Kokkos_ENABLE_SERIAL must be ON)

While it is legal to enable both the OpenMP and Pthreads back-ends in
the same executable, it is a bad idea.  Both back-ends spawn their own
worker threads, and those threads will fight over cores.

* Completely removed the "classic" version of Tpetra

You might recall that a while back, we split Tpetra into "classic"
(old) and "Kokkos refactor" (new) versions.  As of Trilinos 12.0, the
classic version was no longer supported, but we kept it in place for a
few users.  As of this release, we have removed the classic version
completely.

You no longer need to set Tpetra_ENABLE_Kokkos_Refactor to get the new
verson of Tpetra.  It is ON (TRUE).  If you attempt to set it to OFF
(FALSE), Tpetra's CMake raises an error at configure time.  Just
enable Tpetra -- that's all you need to do!

This change affects both the Classic and Core subpackages of Tpetra.
All the "classic" Node types are gone now, along with their associated
computational kernels.  Use the Kernels subpackage of Tpetra for local
kernels.  (We left KokkosClassic::DefaultArithmetic in place for
Stokhos, but ONLY for Stokhos.)  The "classic" versions of Tpetra
classes are also now gone.  We have replaced them completely with
their "Kokkos refactor" versions.

You might have noticed that Doxygen had a hard time generating
documentation for the classes which had "classic" and "refactor"
versions.  These changes should fix that.  Furthermore, it's easier to
find header files for classes.  In particular, most of the header
files in the tpetra/core/src/kokkos_refactor directory now just have
trivial definitions and only remain for backwards compatibility.

* Improved build times and fewer .cpp files in source directory

Tpetra does a better job now of splitting up explicit instantiations
into separate .cpp files.  In some cases, it uses CMake to generate
those .cpp files automatically.  This means fewer .cpp files in
tpetra/core/src, so it's easier to find what you want.

* 128-bit floating-point arithmetic through Scalar = __float128

__float128 is a GCC language extension to C(++) that implements
"double-double" 128-bit floating-point arithmetic.  It requires
linking with libquadmath, which comes with GCC.

You must use GCC in order to try this feature.  Also, set the
following CMake variables:

Tpetra_INST_FLOAT128:BOOL=ON
CMAKE_CXX_FLAGS:STRING="-std=gnu++11 -fext-numeric-literals"
TPL_ENABLE_quadmath:BOOL=ON

You may also have to tell CMake where to find the libquadmath library
and quadmath.h header file:

quadmath_LIBRARY_DIRS:FILEPATH="${QUADMATH_LIB_DIR}"
quadmath_INCLUDE_DIRS:FILEPATH="${QUADMATH_INC_DIR}"

Here, ${QUADMATH_LIB_DIR} points to the directory containing the
libquadmath library (usually your GCC library directory), and
${QUADMATH_INC_DIR} points to the directory containing its header file
(quadmath.h).  For example, if you use a GCC installed in
$HOME/pkg/gcc-5.2.0, you might need to set those variables as follows:

QUADMATH_LIB_DIR=$HOME/pkg/gcc-5.2.0/lib
QUADMATH_INC_DIR=$HOME/pkg/gcc-5.2.0/lib/gcc/x86_64-unknown-linux-gnu/5.2.0/include

Trilinos likes to set the "-pedantic" flag, which causes warnings for
__float128 literals.  The build works regardless, but it would be more
pleasing to your eyes if you could figure out how to shut off the
warnings.

I implemented this because the Kokkos refactor of Tpetra broke QD
support (dd_real and qd_real -- "double-double" and "quad-double,"
128- resp. 256-bit floating-point arithmetic).  Applications were
asking for a work-around solution.

Trilinos 12.2
-------------

* Improvements to the "local" part of Tpetra::Map

Tpetra::Details::FixedHashTable implements the "local" part of
Tpetra::Map, where the "local part" is that which does not use MPI
communication.  For example, FixedHashTable knows how to convert from
global indices to local indices, for all the global indices known by
the calling process.

FixedHashTable now uses Kokkos for its data structures.  Its
initialization is completely Kokkos parallel, and its conversions
between global and local indices are Kokkos device functions.  This
achieves an important goal of making the local part of Tpetra::Map
functionality available for Kokkos parallel operations.

* Many Tpetra classes now split instantiations into multiple files

This matters only when explicit template instantiation (ETI) is ON.
(This _should_ be ON by default, but is not ON by default yet.)

The largest Tpetra classes (e.g., CrsGraph, CrsMatrix, and
MultiVector) now split their explicit instantiations into multiple
.cpp files.  This helps reduce build times and memory usage when ETI
is ON, and makes setting ETI ON an even more attractive option for
applications.

* Fixed Bugs 6335, 6336, 6377, and others

* Improved tests to catch errors on processes other than Process 0

* Improved CMake output and internal ETI-related documentation

Trilinos 12.0
-------------

* Tpetra now requires C++11

This requirement comes in part from Tpetra itself, and in part from
the Kokkos package, on which Tpetra depends.

* "Kokkos refactor" (new) version of Tpetra is the only version

We no longer enable or support the old ("classic") version of Tpetra.
The new ("Kokkos refactor") implementation of Tpetra is now the only
supported version.

Do not use any of the Node types in the KokkosClassic namespace.  We
do not support any of those Node types anymore.  Instead, use any of
the following Node types:

  - Kokkos::Compat::KokkosOpenMPWrapperNode (OpenMP)
  - Kokkos::Compat::KokkosCudaWrapperNode (NVIDIA CUDA)
  - Kokkos::Compat::KokkosSerialWrapperNode (no threads)
  - Kokkos::Compat::KokkosThreadsWrapperNode (Pthreads)

Each of these is a typedef for
Kokkos::Compat::KokkosDeviceWrapperNode<ExecSpace>, for the
corresponding Kokkos execution space.

* Set / rely on the default Node type as much as possible

Tpetra classes have a template parameter, "Node", which determines
what thread-level parallel programming model Tpetra will use.  This
corresponds directly to the "execution space" concept in Kokkos.

Tpetra classes have a default Node type.  Users do NOT need to specify
this explicitly.  I cannot emphasize this enough:

IF YOU ONLY EVER USE THE DEFAULT VALUES OF TEMPLATE PARAMETERS, DO NOT
SPECIFY THEM EXPLICITLY.

If you need to refer to the default values of template parameters, ask
Tpetra classes.  For example, 'Tpetra::Map<>::node_type' is the
default Node type.

Tpetra pays attention to Kokkos' build configuration when determining
the default Node type.  For example, it will not use a disabled
execution space.  If you do not like the default Node type, but you
only ever use one Node type in your application, you should change the
default Node type at Trilinos configure time.  You may do this by
setting the 'KokkosClassic_DefaultNode' CMake option.  Here is a list
of reasonable values:

  "Kokkos::Compat::KokkosSerialWrapperNode": use Kokkos::Serial
  execution space (execute in a single thread on the CPU)

  "Kokkos::Compat::KokkosOpenMPWrapperNode": use Kokkos::OpenMP
  execution space (use OpenMP for thread-level parallelism on the CPU)

  "Kokkos::Compat::KokkosThreadsWrapperNode": use Kokkos::Threads
  execution space (use Pthreads (the POSIX Threads library) for
  thread-level parallelism on the CPU)

  "Kokkos::Compat::KokkosCudaWrapperNode": use Kokkos::Cuda execution
  space (use NVIDIA's CUDA programming model for thread-level
  parallelism on the CPU)

You must use the above strings with the 'KokkosClassic_DefaultNode'
CMake option.  If you choose (unwisely, in many cases) to specify the
Node template parameter directly in your code, you may use those
names.  Alternately, you may let the Kokkos execution space determine
the Node type, by using the templated class
Kokkos::Compat::KokkosDeviceWrapperNode.  This class is templated on
the Kokkos execution space.  The above four types are typedefs to
their corresponding specializations of KokkosDeviceWrapperNode.  For
example, KokkosSerialWrapperNode is a typedef of
KokkosDeviceWrapperNode<Kokkos::Serial>.  This may be useful if your
code already makes use of Kokkos execution spaces.

* Removed (deprecated classes) Tpetra::VbrMatrix, Tpetra::BlockMap,
  Tpetra::BlockCrsGraph, and Tpetra::BlockMultiVector.

All these classes relate to VBR (variable-block-size block sparse
matrix) functionality.  We may reimplement that at some point, but for
now it's going away.

* Removed (deprecated class) Tpetra::HybridPlatform

Trilinos 11.14:
---------------

* Public release of "Kokkos refactor" version of Tpetra

The "Kokkos refactor" version of Tpetra is a new implementation of
Tpetra.  It is based on the new Kokkos programming model in the
KokkosCore subpackage.  It coexists with the "classic" version of
Tpetra, which has been DEPRECATED and will be removed entirely in the
12.0 major release of Trilinos.  Thus, the Kokkos refactor version
will become the /only/ version of Tpetra at that time.

The Kokkos refactor version of Tpetra maintains mostly backwards
compatibility [SEE NOTE BELOW] with the classic version's interface.
Its interface will continue to evolve.  For this first public release,
we have prioritized backwards compatibility over interface innovation.

The implementation of the Kokkos refactor version of Tpetra currently
lives in tpetra/core/src/kokkos_refactor.  It works by partial
specialization on the 'Node' template parameter, and by a final 'bool'
template parameter (which users must NEVER SPECIFY EXPLICITLY).  The
"classic" version of Tpetra uses the old ("classic") Node types that
live in the KokkosClassic namespace.  All of the classic Node types
have been DEPRECATED, which is how users can see that classic Tpetra
has been deprecated.

If you wish to disable the Kokkos refactor version of Tpetra, set the
Tpetra_ENABLE_Kokkos_Refactor CMake option to OFF.  Please note that
this will result in a large number of warnings about deprecated
classes.  This CMake option will go away in the 12.0 release.

* Note on backwards compatibility of Tpetra interface

In the new version of Tpetra, MultiVector and Vector implement /view
semantics/.  That is, the one-argument copy constructor and the
assignment operator (operator=) perform shallow copies.  (By default,
in the classic version of Tpetra, they did deep copies.)  For deep
copies, use one of the following:

  - Two-argument "copy constructor" with Teuchos::Copy as the second
    argument (to create a new MultiVector or Vector which is a deep
    copy of an existing one)
  - Tpetra::deep_copy (works like Kokkos::deep_copy)

* What if I have trouble building with Scalar=std::complex<T>?

The new version of Tpetra should be able to build with Scalar =
std::complex<float> or std::complex<double>.  If you have trouble
building, you may disable explicit template instantiation (ETI) and
tests for those Scalar types, using the following CMake options:

  Tpetra_INST_COMPLEX_FLOAT:BOOL=OFF
  Tpetra_INST_COMPLEX_DOUBLE:BOOL=OFF

* Accessing and changing the default Node type

Tpetra classes have a template parameter, "Node", which determines
what thread-level parallel programming model Tpetra will use.  This
corresponds directly to the "execution space" concept in Kokkos.

Tpetra classes have a default Node type.  Users do NOT need to specify
this explicitly.  I cannot emphasize this enough:

IF YOU ONLY EVER USE THE DEFAULT VALUES OF TEMPLATE PARAMETERS, DO NOT
SPECIFY THEM EXPLICITLY.

If you need to refer to the default values of template parameters, ask
Tpetra classes.  For example, 'Tpetra::Map<>::node_type' is the
default Node type.

Tpetra pays attention to Kokkos' build configuration when determining
the default Node type.  For example, it will not use a disabled
execution space.  If you do not like the default Node type, but you
only ever use one Node type in your application, you should change the
default Node type at Trilinos configure time.  You may do this by
setting the 'KokkosClassic_DefaultNode' CMake option.  Here is a list
of reasonable values:

  "Kokkos::Compat::KokkosSerialWrapperNode": use Kokkos::Serial
  execution space (execute in a single thread on the CPU)

  "Kokkos::Compat::KokkosOpenMPWrapperNode": use Kokkos::OpenMP
  execution space (use OpenMP for thread-level parallelism on the CPU)

  "Kokkos::Compat::KokkosThreadsWrapperNode": use Kokkos::Threads
  execution space (use Pthreads (the POSIX Threads library) for
  thread-level parallelism on the CPU)

  "Kokkos::Compat::KokkosCudaWrapperNode": use Kokkos::Cuda execution
  space (use NVIDIA's CUDA programming model for thread-level
  parallelism on the CPU)

You must use the above strings with the 'KokkosClassic_DefaultNode'
CMake option.  If you choose (unwisely, in many cases) to specify the
Node template parameter directly in your code, you may use those
names.  Alternately, you may let the Kokkos execution space determine
the Node type, by using the templated class
Kokkos::Compat::KokkosDeviceWrapperNode.  This class is templated on
the Kokkos execution space.  The above four types are typedefs to
their corresponding specializations of KokkosDeviceWrapperNode.  For
example, KokkosSerialWrapperNode is a typedef of
KokkosDeviceWrapperNode<Kokkos::Serial>.  This may be useful if your
code already makes use of Kokkos execution spaces.

* Changes to subpackages

Tpetra is now divided into subpackages.  What was formerly just
"Tpetra" is now "TpetraCore".  Other subpackages of Kokkos have moved,
some into Teuchos and some into Tpetra.  Those subpackages have
changed from Experimental (EX) to Primary Tested (PT), so that they
build by default if Tpetra is enabled.

The most important change is that Tpetra now has a required dependency
on the Kokkos programming model.  See below.

If your application links against Trilinos using either the
Makefile.export.* system or the CMake FIND_PACKAGE(Trilinos ...)
system, you do not need to worry about this.  Just enable Tpetra and
let Trilinos' build system handle the rest.

* New required dependency on Kokkos

Tpetra now has a required dependency on the Kokkos programming model.
In particular, TpetraCore (see above) has required dependencies on the
KokkosCore, KokkosContainers, and KokkosAlgorithms subpackages of
Kokkos.

This means that Tpetra is now subject to Kokkos' build requirements.
C++11 support is still optional in this release, but future releases
will require C++11 support.  Please refer to Kokkos' documentation for
more details.

* Deprecated variable-block-size classes (like VbrMatrix).

We have deprecated the following classes in the Tpetra namespace:

  - BlockCrsGraph
  - BlockMap  
  - BlockMultiVector (NOT Tpetra::Experimental::BlockMultiVector)
  - VbrMatrix

These classes relate to "variable-block-size" vectors and matrices.
Tpetra::BlockMultiVector (NOT the same as
Tpetra::Experimental::BlockMultiVector) implements a
variable-block-size block analogue of MultiVector.  Each row of a
MultiVector corresponds to a single degree of freedom; each block row
of a BlockMultiVector corresponds to any number of degrees of freedom.
"Variable block size" means that different block rows may have
different numbers of degrees of freedom.  An instance of
Tpetra::BlockMap represents the block (row) Map of a BlockMultiVector.
Tpetra::VbrMatrix implements a variable-block-size block sparse matrix
that corresponds to BlockMultiVector.  Each (block) entry of a
VbrMatrix is it own dense matrix.  These dense matrices are not
distributed; they are locally stored and generally "small" (think
"fits in cache").  An instance of Tpetra::BlockCrsGraph represents the
block graph of a VbrMatrix.

Here are the reasons why we are deprecating these classes:

  - Their interfaces as well as their implementations need a
    significant redesign for MPI+X, e.g., for efficient use of
    multiple levels of parallelism.
  - They are poorly exercised, even in comparison to their Epetra
    equivalents.
  - They have poor test coverage, and have outstanding known bugs: see
    e.g., Bug 6039.
  - Most users don't need a fully general VBR [1].
  - We would prefer to name the VBR classes consistently, both to
    emphasize the V (variable) part and to distinguish them from the
    new constant-block-size classes.

[1] Many users' block matrices have blocks which are all the same
    size.  They would get best performance by using the new
    constant-block-size classes that currently live in the
    Tpetra::Experimental namespace.  Others usually only have a small
    number of different block sizes per matrix (e.g., 3 degrees of
    freedom per interior mesh point; 2 for boundary mesh points).  The
    latter users could get much better performance by a data structure
    that represents the sparse matrix as a sum of constant-block-size
    matrices.


Development version (11.13):
----------------------------

* Deprecated "classic" Node types

* Update to instructions for building Kokkos Refactor version

See the Release Notes for the 11.12 release (below) for an explanation
of what the Kokkos Refactor version of Tpetra is.  Currently, you must
enable this option explicitly at Trilinos configuration time; it is
disabled by default.  Here is a summary of the required CMake options
(NOTE: many options are no longer necessary!).

  # Enable the Kokkos refactor version of Tpetra.
  -D Tpetra_ENABLE_Kokkos_Refactor:BOOL=ON

  # Set default Node type to one of the new Kokkos Nodes.
  # The one below uses the Kokkos::Serial execution space.
  # Replace "Serial" with "OpenMP" for the Kokkos::OpenMP
  # execution space, or replace it with "Threads" for the
  # Kokkos::Threads execution space.
  -D KokkosClassic_DefaultNode:STRING="Kokkos::Compat::KokkosSerialWrapperNode"

* Subpackages needed for the Kokkos refactor version of Tpetra
  are now Primary Tested (PT) by default

This means that if you enable Tpetra, you should no longer have to
enable the necessary subpackages of Kokkos and Tpetra (see below) by
default.

* Changed CMake options for enabling / disabling "classic" Nodes

  TpetraClassic_ENABLE_SerialNode:    For KokkosClassic::SerialNode
  TpetraClassic_ENABLE_OpenMPNode:    For KokkosClassic::OpenMPNode
  TpetraClassic_ENABLE_TBBNode:       For KokkosClassic::TBBNode  
  TpetraClassic_ENABLE_TPINode:       For KokkosClassic::TPINode
  TpetraClassic_ENABLE_ThrustGPUNode: For KokkosClassic::ThrustGPUNode

* Disabled "classic" Node types by default if Kokkos refactor version
  of Tpetra is enabled

* Moved KokkosCompat and KokkosMpiComm subpackages of Kokkos to
  KokkosCompat resp. KokkosComm subpackages of Teuchos

This affects configuration for the Kokkos refactor build of Tpetra.
See below.  I have updated release notes for the 11.13 version, but
not for previous releases (in particular 11.12).

* Moved KokkosLinAlg subpackage (from Kokkos) into Tpetra as the
  TpetraKernels subpackage

* Split Tpetra into subpackages; moved KokkosClassic and KokkosTSQR
  subpackages into Tpetra

The "Classic" and "TSQR" subpackages of Kokkos are now the "Classic"
resp. "TSQR" subpackages of Tpetra.  The original contents of Tpetra
are now the "Core" subpackage of Tpetra.

NOTE: Any package dependencies on "KokkosClassic" or "KokkosTSQR" must
now be changed to "TpetraClassic" resp. "TpetraTSQR".

* Update to instructions for building Kokkos Refactor version

See the Release Notes for the 11.12 release (below) for an explanation
of what the Kokkos Refactor version of Tpetra is.  Currently, you must
enable this option explicitly at Trilinos configuration time; it is
disabled by default.  Here is a summary of the required CMake options.

# Enable Kokkos subpackages explicitly.
#
# If Tpetra_ENABLE_Kokkos_Refactor is ON but any of those subpackages
# are not enabled, CMake will stop with an error message that tells you
# what subpackages to enable.
-D Trilinos_ENABLE_KokkosCore:BOOL=ON
-D Trilinos_ENABLE_KokkosContainers:BOOL=ON
-D Trilinos_ENABLE_KokkosAlgorithms:BOOL=ON
-D Trilinos_ENABLE_TeuchosKokkosCompat:BOOL=ON
-D Trilinos_ENABLE_TeuchosKokkosComm:BOOL=ON
-D Trilinos_ENABLE_TpetraKernels:BOOL=ON

# Set default Node type to one of the new Kokkos Nodes
-D KokkosClassic_DefaultNode:STRING="Kokkos::Compat::KokkosSerialWrapperNode"

# Enable the Kokkos refactor version of Tpetra.
-D Tpetra_ENABLE_Kokkos_Refactor:BOOL=ON

# Turn off ETI and tests for std::complex.
-D Tpetra_INST_COMPLEX_DOUBLE:BOOL=OFF
-D Tpetra_INST_COMPLEX_FLOAT:BOOL=OFF

Here are some OPTIONAL CMake options that may help reduce build times
if ETI (explicit template instantiation) is enabled:

# Disable KokkosClassic::OpenMPNode
-D TpetraClassic_ENABLE_OpenMP:BOOL=OFF
# Shut off Kokkos' Pthreads back-end.
-D Kokkos_ENABLE_PTHREAD:BOOL=OFF

* Deprecated variable-block-size classes (like VbrMatrix).

I have deprecated the following classes in the Tpetra namespace:

  - BlockMultiVector (NOT Tpetra::Experimental::BlockMultiVector)
  - BlockMap
  - VbrMatrix
  - BlockCrsGraph

Tpetra::BlockMultiVector (NOT the same as
Tpetra::Experimental::BlockMultiVector) implements a
variable-block-size block analogue of MultiVector.  Each row of a
MultiVector corresponds to a single degree of freedom; each block row
of a BlockMultiVector corresponds to any number of degrees of freedom.
"Variable block size" means that different block rows may have
different numbers of degrees of freedom.  An instance of
Tpetra::BlockMap represents the block (row) Map of a BlockMultiVector.
Tpetra::VbrMatrix implements a variable-block-size block sparse matrix
that corresponds to BlockMultiVector.  Each (block) entry of a
VbrMatrix is it own dense matrix.  These dense matrices are not
distributed; they are locally stored and generally "small" (think
"fits in cache").  An instance of Tpetra::BlockCrsGraph represents the
block graph of a VbrMatrix.

Here are the reasons why I am deprecating these classes:

  - Their interfaces as well as their implementations need a
    significant redesign for MPI+X [1].
  - They are poorly exercised, even in comparison to their Epetra
    equivalents.
  - They have poor test coverage, and have outstanding known bugs: see
    e.g., Bug 6039.
  - Most users don't need a fully general VBR [2].
  - I would prefer to name the VBR classes consistently, both to
    emphasize the V (variable) part and to distinguish them from the
    new constant-block-size classes.  This would let me move the
    latter out of the Tpetra::Experimental namespace into the Tpetra
    namespace.
  - I don't have time to support these classes _and_ finish the port
    of the rest of Tpetra to use new Kokkos.
  - I don't want to leave these classes in place and let users think
    that they work correctly.

[1] The interfaces will likely change so much that it might not be
    worth maintaining backwards compatibility, in contrast to the
    other Tpetra classes, whose interfaces can change gradually.  This
    is because current and future computer hardware requires
    exploiting multiple levels of parallelism in order to get best (or
    even reasonable) performance.  We will have to think hard about
    how to do this efficiently for sparse matrices with variable block
    sizes.  I would prefer to focus on the constant-block-size case
    first, get that right, and then generalize to handle the load
    balancing issues of the variable-block-size case.

[2] Many users' block matrices have blocks which are all the same
    size.  They would get best performance by using the new
    constant-block-size classes that currently live in the
    Tpetra::Experimental namespace.  Others usually only have a small
    number of different block sizes per matrix (e.g., 3 degrees of
    freedom per interior mesh point; 2 for boundary mesh points).  The
    latter users could get much better performance by a data structure
    that represents the sparse matrix as a sum of constant-block-size
    matrices.


Trilinos 11.12:
---------------

* Kokkos refactor version of Tpetra

The "Kokkos refactor" version of Tpetra is the new version of Tpetra,
based on the new Kokkos programming model in the KokkosCore
subpackage.  It coexists with the "classic" version of Tpetra, which
is currently the default version.  We plan to deprecate the "classic"
version of Tpetra in the 11.14 minor release in January, and to remove
it entirely in the 12.0 major release.  Thus, the "Kokkos refactor"
version of Tpetra will become the /only/ version of Tpetra at that
time.

The implementation of the Kokkos refactor version of Tpetra currently
lives in src/kokkos_refactor.  It works by partial specialization on
the Node template parameter.  If you would like to enable this version
of Tpetra, here is a suggested set of CMake options:

# Enable OpenMP, and enable Kokkos' OpenMP backend
-D Trilinos_ENABLE_OpenMP:BOOL=ON

# Set Tpetra's default Node type to use new Kokkos with OpenMP.
# You could also use KokkosThreadsWrapperNode or even 
# KokkosSerialWrapperNode here.  
-D KokkosClassic_DefaultNode:STRING="Kokkos::Compat::KokkosOpenMPWrapperNode"

# Enable the Kokkos refactor version of Tpetra.
-D Tpetra_ENABLE_Kokkos_Refactor:BOOL=ON

In a debug build, you might like to enable Kokkos' run-time bounds
checking.  Here's how you do that.  These are _optional_ parameters
and their default values are both OFF (not enabled).

-D Kokkos_ENABLE_BOUNDS_CHECK:BOOL=ON
-D Kokkos_ENABLE_DEBUG:BOOL=ON

The following options may reduce build times if ETI is enabled:

# Disable KokkosClassic::OpenMPNode
-D KokkosClassic_ENABLE_OpenMP:BOOL=OFF
# Shut off Kokkos' Pthreads back-end in favor of OpenMP
-D Kokkos_ENABLE_PTHREAD:BOOL=OFF

You must also enable the following subpackages explicitly, since they
are not Primary Tested at the moment:

  - KokkosCore
  - KokkosCompat
  - KokkosContainers
  - KokkosLinAlg
  - KokkosAlgorithms
  - KokkosMpiComm

If Tpetra_ENABLE_Kokkos_Refactor is ON but any of those subpackages
are not enabled, CMake will stop with an error message that tells you
what subpackages to enable.

If you would like to build with the above subpackages enabled, but
would /not/ like to build Tpetra with any of the new Kokkos Nodes, you
may try setting the CMake KokkosClassic_ENABLE_KokkosCompat to OFF.
This works for me as of 07 Oct 2014, but I do not recommend it, and it
is not supported.

Fun fact: there are three relevant combinations of (new Kokkos
enabled?, Kokkos refactor enabled?), and we test them all!  You can
use the new Kokkos Node types with "classic" Tpetra, or you can use
them with "Kokkos refactor" Tpetra.

Most Tpetra tests exercise all enabled Node types, or just use the
default Node type.  Ifpack2 tests only use the default Node type
currently.  That's why the above build configuration changes the
default Node type.  That way, all packages that depend on Tpetra will
use the Kokkos refactor version of Tpetra in /their/ tests by default.


* Full set of default values of template parameters

Usability improvement!  Most Tpetra classes now come with a full set
of default values of template parameters.  In many cases, you need no
longer specify _any_ template parameters' values, if you only intend
to use their defaults.  For example, you may now write the following:

  // All default template parameters!
  Tpetra::Map<> map (...);

  // No "typename" because Map<> is a concrete type.
  typedef Tpetra::Map<>::local_ordinal_type LO;
  typedef Tpetra::Map<>::global_ordinal_type GO;

  for (LO i_lcl = map.getMinLocalIndex (); 
       i_lcl <= map.getMaxLocalIndex (); ++i_lcl) {
    const GO i_gbl = map.getGlobalElement (i_lcl);
    // ...
  }

  // All default template parameters!
  // Scalar defaults to double.
  // LocalOrdinal, GlobalOrdinal, and Node default
  // to the same values as those of Map<> above.
  Tpetra::MultiVector<> X (...);

Also, if you need to specify (say) GlobalOrdinal explicitly, you don't
have to specify Node explicitly.  For example:

  // Don't need to specify Node; it takes its default value.
  Tpetra::Map<int, long long> map (...);
  Tpetra::MultiVector<double, int, long long> X (...);

You may specify the default value of Node at Trilinos configure time
(that is, when running CMake).  The current default is
KokkosClassic::SerialNode (no threads; MPI only).  This will change,
but it will always have a reasonable value for conventional multicore
processors.

Please, _please_ prefer default values of template parameters!  This
will make your code shorter, allow more flexibility at configure time,
and might even make builds a bit faster.  All Tpetra classes come with
public typedefs, so you can pick up scalar_type (if applicable),
local_ordinal_type, global_ordinal_type, and node_type from Tpetra
directly, rather than specifying them explicitly.

* Removed the LocalMatOps template parameter

CrsGraph, CrsMatrix, VbrMatrix, and other classes used to have a
LocalMatOps template parameter.  This was the fourth template
parameter of CrsGraph and the fifth template parameter of CrsMatrix.
It was always optional.  Chris Baker intended it as an extension point
for users or third-party vendors to insert their own sparse
matrix-vector multiply or triangular solve routines.  However, no one
ever used it for this purpose as far as we know.  When it started to
hinder the Kokkos refactor effort (see release notes for Trilinos
11.10 below), we removed it.  This should speed up compilation times.

Lesson: It's always easier to _add_ a template parameter (at the end,
if it's optional) than it is to remove one.

Getting rid of LocalMatOps does amount to a backwards incompatible
interface change.  However, we deemed it a harmless change, for the
following reasons:

  1. LocalMatOps has a reasonable default value.
  2. As far as I know, no one other than Chris Baker and myself ever
     wrote or used alternate implementations of LocalMatOps.
  3. Trilinos packages or applications which bothered to specify
     LocalMatOps never used anything other than the default value.

Thus, it never even crossed my mind that applications would bother to
specify this thing.  Unfortunately, some applications may still
LocalMatOps explicitly.  This typedef is unnecessary.  You do not need
to specify this template parameter.  The default value was always
perfectly fine and has been for years.


Trilinos 11.10:
---------------

* Continued work on the Kokkos refactor version of Tpetra

We plan to replace the current "classic" version of Tpetra with a
"Kokkos refactor" version, that uses new Kokkos for thread-parallel
computational kernels and data structures.  The classic version
continues to be the default, but the Kokkos refactor version is
available via partial specialization on the Node type.  

You may try out the Kokkos refactor version of Tpetra by doing the
following:

  1. Enable the KokkosCore, KokkosCompat, KokkosContainers,
     KokkosLinAlg, and KokkosMpiComm (which does not require MPI)
     subpackages.

  2. Set the CMake option Tpetra_ENABLE_Kokkos_Refactor to ON.

  3. Include either Kokkos_DefaultNode.hpp or
     KokkosCompat_ClassicNodeAPI_Wrapper.hpp, if they are not already
     included by the relevant Tpetra header files.

  4. Use the appropriate Node type in the Kokkos::Compat namespace:
     KokkosCudaWrapperNode with the Kokkos::Cuda device,
     KokkosOpenMPWrapperNode with the Kokkos::OpenMP device,
     KokkosThreadsWrapperNode with the Kokkos::Threads device, or
     KokkosSerialWrapperNode with the Kokkos::SerialNode.

We plan to deprecate the KokkosClassic namespace and its contents in
the next minor release (scheduled for October), with the goal of
removing it entirely by the next major release.

* CrsMatrix: replaceGlobalValues, sumIntoGlobalValues,
  replaceLocalValues, and sumIntoLocalValues now return error codes,
  instead of throwing on invalid row or column indices

This will facilitate thread parallelism, and porting Tpetra to use new
Kokkos.  It also partially addresses Bug 4918 (of which Bug 5806 is a
duplicate).  The error code tells users both whether the row index was
valid, and the number of valid column indices.  If the return value
equals the number of input column indices, the method succeeded.

* CrsGraph, CrsMatrix: getLocalRowCopy now does not throw if the input
  row index is invalid; instead, it sets numEntries=0 and returns.

This will facilitate thread parallelism, and porting Tpetra to use new
Kokkos.  It is also semantically consistent: if the calling process
doesn't own that row, then the calling process owns zero entries in
that row, so it's correct to set numEntries=0 and return without
throwing.

* New function: Tpetra::Details::makeOptimizedColMap

* MultiVector and Map now have full default template parameters

Now, if you write Tpetra::MultiVector<> (empty angle brackets
required), that sets Scalar=double.  If you write Tpetra::Map<>
(again, empty angle brackets required), that sets LocalOrdinal=int.
Tpetra includes a unit test for this feature.

* New classes in Tpetra::Experimental namespace: BlockCrsMatrix,
  BlockMultiVector, and BlockVector (constant-size small blocks, with
  block size determined at run time)

* Fixes for Bug 6139 and 6127

* Tpetra::MatrixMarket::{writeMap, writeMapFile} can now handle an
  overlapping Map.

* Import and Export now inherit from a common base class
  (useful for implementing communication methods)

* Refactored and improved examples

* CrsMatrix: fillComplete with a const graph now uses the graph's
  domain and range Maps (thus fixing an unnumbered bug, in which
  CrsMatrix was instead using the row Map for both)

* Map now implements view semantics

"View semantics" means that Map's copy constructor and assignment
operator (operator=) do a shallow copy, and that empty construction is
possible.  The new test for isOneToOne (see below) is the first Tpetra
test that assumes view semantics of Map.

* Map now has an isOneToOne predicate

isOneToOne is a collective which tests whether the Map is one to one
(that is, whether every global index is owned by at most one process
in the Map's communicator).

Trilinos 11.8
-------------

* BACKWARDS IMCOMPATIBLE CHANGE: MultiVector and Vector now implement
  view semantics

This means that the copy constructor and assignment operator
(operator=) of both classes now do shallow copies.  This change will
support gradual porting to the new ("Kokkos Refactor") version of
Tpetra.

We have propagated this change to other Trilinos packages that use
Tpetra.  Please use the new createCopy nonmember function to get a new
instance of (Multi)Vector that is a deep copy of an existing
(Multi)Vector.  Also, please use the new nonmember function deep_copy
to do a deep copy between two existing compatible (Multi)Vector
instances.

* Kokkos Refactor updates

Development continues on the Kokkos Refactor version of Tpetra.  This
is a partial specialization of some Tpetra classes that uses the new
Kokkos programming model.  We plan eventually to switch to this
version of Tpetra and deprecate the old version.

This release adds a Kokkos Refactor version of Map.  Its GID->LID and
LID->GID conversion methods are now thread-safe and thread-scalable on
the host.  It also has a "device object" that you can use on CUDA
devices.

The Kokkos Refactor version of MultiVector now implements "dual view"
semantics.  This means that the Tpetra interface lets users mark
either host or device as modified, and synchronize between host and
device on demand, if necessary.

* Sparse matrix-matrix multiply performance improvements

This release includes many performance improvements to Tpetra's sparse
matrix-matrix multiply routine, and other supporting routines, such as
explicit transpose, and {im,ex}portAndFillComplete.  Tpetra now has a
sparse matrix-matrix multiply variant for implementing Jacobi
smoothing of matrices.  This is useful for algebraic multigrid.

* CrsMatrix: "Preserve Local Graph" defaults true (17 Mar 2014)
    
In CrsMatrix, the undocumented parameter "Preserve Local Graph" now
defaults to true.  This makes the following scenario work by default:
    
  1. Create a CrsMatrix A that creates and owns its graph (i.e., don't
      use the constructor that takes an RCP<const Tpetra::CrsGraph> or
      a local graph)
  2. Set an entry in the matrix A, and call fillComplete on it
  3. Create a CrsMatrix B using A's graph (obtained via
      A.getCrsGraph()), so that B has a const (a.k.a. "static") graph
  4. Change a value in B (you can't change its structure), and call
      fillComplete on B
    
Before this commit, the above scenario didn't work by default.  This
is because A's first fillComplete call would call
fillLocalGraphAndMatrix, which by default sets the local graph to
null.  As a result, from that point, A.getCrsGraph()->getLocalGraph()
returns null, which makes B's fillComplete throw an exception.  The
only way to make this scenario work was to set A's "Preserve Local
Graph" parameter to true.  (It defaulted to false.)
    
The idea behind this nonintuitive behavior was for the local sparse
ops object to own all the data.  This might make sense if it is a
third-party library that takes CSR's three arrays and copies them into
its own storage format.  In that case, it might be a good idea to free
the original three CSR arrays, in order to avoid duplicate storage.
However, resumeFill never had a way to get that data back out of the
local sparse ops object.  Rather than try to implement that, it's
easier just to make "Preserve Local Graph" default to true.
    
The possible data duplication mentioned in the previous paragraph can
never happen with the Kokkos Refactor version of CrsMatrix, since it
insists on controlling the matrix representation itself.  This makes
the code shorter and easier to read, and also ensures efficient fill.
That will in turn make the option unnecessary.

* Many bug fixes

The most important bug fixed is Bug 6069, an error in Distributor,
which would only manifest on MPICH.  This bug fix alone is enough
reason to upgrade to Trilinos 11.8.

Trilinos 11.6
-------------

* Gradual port to use (new) Kokkos

Tpetra will migrate to use the new Kokkos programming model.  The
tpetra/src/kokkos_refactor directory contains a preview of this
migration under development.  This will include backwards-incompatible
changes.  For example, MultiVector and Vector will have view
semantics, instead of their current container semantics.  This means
that their copy constructor and assignment operator (operator=) will
make shallow copies, instead of deep copies.  This will make Tpetra's
semantics more consistent with those of Kokkos.  In order to provide
deep copies, all Tpetra objects will get the following:

  - createCopy() method: returns a deep copy of its *this argument
  - deep_copy() nonmember function: copies the contents of one
    MultiVector into the contents of another existing MultiVector.
    This works like deep_copy() for Kokkos::View objects.

MultiVector already has both of these functions.  Thus, in order to
prepare for the backwards incompatible changes to Tpetra, users must
find all uses of the copy constructor and assignment operator, and
replace them with createCopy() resp. deep_copy().  This will affect at
least the following packages which have generic adapters for
Tpetra::MultiVector:

  - Amesos2 (MultiVecAdapter)
  - Anasazi (MultiVecTraits)
  - Belos (MultiVecTraits)
  - Xpetra (Xpetra::TpetraMultiVector)

* Accepted non-backwards compatible change to KokkosClassic, in which
  that subpackage changed its namespace from Kokkos to KokkosClassic.

Trilinos 11.4:
--------------

* Performance improvements to fillComplete (CrsGraph and CrsMatrix)

* Performance improvements to Map's global-to-local index conversions

* MPI performance optimizations

Methods that perform communication between (MPI) processes do less
communication than before.  This should improve performance,
especially for large process counts, of the following operations:

  - Creating a Map
  - Creating an Import or Export communication plan
  - Executing an Import or Export (e.g., in a distributed sparse
    matrix-vector multiply, or in global finite element assembly)
  - Calling fillComplete() on a CrsGraph or CrsMatrix

* Restrict a Map's communicator to processes with nonzero elements,
  and apply the result to a distributed object

Map now has two new methods.  The first, removeEmptyProcesses(),
returns a new Map with a new communicator, which contains only those
processes which have a nonzero number of entries in the original Map.
The second method, replaceCommWithSubset(), returns a new Map whose
communicator is an arbitrary subset of processes of the original Map's
communicator.  Distributed objects (subclasses of DistObject) also
have a new removeEmptyProcessesInPlace() method, for applying in place
the new Map created by calling removeEmptyProcesses() on the original
Map over which the object was distributed.

These methods are especially useful for algebraic multigrid.  At
coarser levels of the multigrid hierarchy, it is helpful for
performance to "rebalance" the matrices at those levels, so that a
subset of processes share the elements.  This leaves the remaining
processes without any elements.  Excluding them from the communicator
reduces the cost of all-reduces and other communication operations
necessary for creating the coarser levels of the hierarchy.

* CrsMatrix: Native SOR and Gauss-Seidel kernels

These kernels improve the performance of Ifpack2 and MueLu.
Gauss-Seidel is a special case of SOR (Symmetric Over-Relaxation).
See the documentation of Ifpack2::Relaxation for details on the
algorithm, which is actually a "hybrid" of Jacobi between MPI
processes, and SOR (or Gauss-Seidel) within an MPI process.  The
kernels also include the "symmetric" variant (forward and backward
sweeps) of SOR and Gauss-Seidel.

* CrsMatrix: Precompute and reuse offsets of diagonal entries

The (existing) one-argument verison of CrsMatrix's getLocalDiagCopy()
method requires the following operations per row:

  1. Convert current local row index to global, using the row Map
  2. Convert global index to local column index, using the column Map
  3. Search the row for that local column index
    
Precomputing the offsets of diagonal entries and reusing them skips
all these steps.  CrsMatrix has a new method getLocalDiagOffsets() to
precompute the offsets, and a two-argument version of
getLocalDiagCopy() that uses the precomputed offsets.  The precomputed
offsets are not meant to be used in any way other than to be given to
the two-argument version of getLocalDiagCopy().  They must be
recomputed whenever the structure of the sparse matrix changes (by
calling insertGlobalValues() or insertLocalValues()) or is optimized
(e.g., by calling fillComplete() for the first time).

* CrsGraph,CrsMatrix: Added "No Nonlocal Changes" parameter to
  fillComplete()

The fillComplete() method accepts an optional ParameterList which
controls the behavior of fillComplete(), as opposed to behavior of the
object in general.  "No Nonlocal Changes" is a bool parameter which is
false by default.  Its value must be the same on all processes in the
graph or matrix's communicator.  If the parameter is true, the caller
asserts that no entries were inserted in nonowned rows.  This lets
fillComplete() skip the global communication that checks whether any
processes inserted any entries in nonowned rows.

* Default Kokkos/Tpetra Node type is now Kokkos::SerialNode

NOTE: This change breaks backwards compatibility.

Users expect that Tpetra by default uses "MPI only" for parallelism,
rather than "MPI plus threads."  These users were therefore
experiencing unexpected performance issues when the default Kokkos
Node type is threaded, as was the case if Trilinos' support for any of
the threading libraries (Pthreads, TBB, OpenMP) are enabled.  Trilinos
detects and enables support for Pthreads automatically on many
platforms.  Therefore, after some discussion among Kokkos and Tpetra
developers, we decided to change the default Kokkos Node type (and
therefore, the default Node used by Tpetra objects) to
Kokkos::SerialNode. This can be overridden at configure time by
specifying the following option to CMake when configuring Trilinos:

-D KokkosClassic_DefaultNode:STRING="<node-type>" 

where <node-type> any of the official Kokkos Node types, such as the
following:
- Kokkos::SerialNode (current default) 
- Kokkos::TBBNode
- Kokkos::TPINode
- Kokkos::OpenMPNode


Trilinos 11.0: 
--------------
* Significant performance improvements to local sparse matrix-vector multiply on CPU nodes. 
* Removed all deprecated methods.


Trilinos 10.12:
--------------

* Major (backwards-compatible, internal) refactor to interaction between Tpetra::CrsGraph/CrsMatrix and their interaction 
  with their LocalSparseOps template parameter. 
* Removed generic kernels for GPU nodes; GPU sparse kernel support now provided by CUSPARSE library; requires CUDA 4.1
* Additional methods in Reduction/Transformation Interface (RTI) interface, examples in tpetra/examples/MultiPrec
* Fixed major bugs in Tpetra Import/Export
* Minor bug fixes and documenting tests
* Numerous improvements to documentation
* Better MatrixMarket support in tpetra/util
* Added the ability to construct a Tpetra::Vector/MultiVector using user data (host-based nodes only)
- Deprecated: fillComplete(OptimizeStorageOption) on Tpetra::CrsGraph and Tpetra::CrsMatrix, in favor of a ParameterList.


Trilinos 10.7:
--------------

* Added (experimental) Reduction/Transformation Interface (RTI) interface to tpetra/rti, examples in tpetra/examples/RTInterface

Trilinos 10.6.4:
----------------

* Fixed some bugs in the build system
* Updates to support CUDA 4.0 and built-in Thrust

Trilinos 10.6.1:
----------------

* Added new HybridPlatform examples, under tpetra/examples/HybridPlatform. Anasazi and Belos examples are currently not built, though they are functional.
* Added Added new MultiVector GEMM tests, to evaluate potential interference of TPI/TBB threads and a threaded BLAS, to tpetra/test/MultiVector.
* Added Tpetra timers to Anasazi and Belos adaptors.
* Added test/documentation build of Tpetra::CrsMatrix against KokkosExamples::DummySpasreKernelClass
* Fixed some bugs, added some bug verification tests, disabled by default.

Trilinos 10.6:
--------------

Significant internal changes in Tpetra for this release, mostly centered around
the CrsMatrix class. Lots of new features centering around multi-core/GPUs did
not make it in this release; look for more development in 10.6.1.

* Lots of additional documentation, testing and examples in Tpetra.
* Imported select Teuchos memory management classes/methods into the Tpetra namespace.
* Updates to the Anasazi/Tpetra adaptors for efficiency, node-awareness and debugging.
* Minor bug fixes, warnings addressed.

Changes breaking backwards compatbility:
* Tpetra CRS objects (i.e., CrsGraph and CrsMatrix) are required to be "fill-active" in order to be modified.
  Furthermore, they are requried to be "fill-complete" in order to call multiply/solve.
  The transition between these states is mediated by the methods fillComplete() and resumeFill(). 
  This will only effect users that modify a matrix after calling fillComplete().

Newly deprecated functionality:
* CrsGraph/CrsMatrix persisting views of graph and matrix data are now
  deprecated. New, non-persisting versions of these are provided.


Trilinos 10.4:
--------------

The Trilinos release 10.4 came at an unfortunate time, as we were in the middle
of a medium refactor in Kokkos/Tpetra in order to better support GPU and
multicore nodes. Therefore, there has been some potential regression in performance
for GPU nodes; and some known issues regarding multi-core CPU performance (especially 
on NUMA platforms) have not been addressed. The rest of this refactor is likely to happen 
in the development branch, and will not be released until 10.6 (estimated for September 2010). 

Users that require access to this code should contact a Trilinos developer regarding access to the 
development branch repository. 

(*) Improvements to doxygen documentation.
- added ifdefs to support profiling/tracing of host-to-device memory transfers 
  These are enable via cmake options
  -D KokkosClassic_ENABLE_CUDA_NODE_MEMORY_PROFILING:BOOL=ON
  -D KokkosClassic_ENABLE_CUDA_NODE_MEMORY_TRACE:BOOL=ON

(*) VBR capability (experimental)
- added variable-block row matrix (VbrMatrix) and underlying support classes (BlockMap, BlockMultiVector)
- added power method example of VBR classes

(*) CrsMatrix:
- now implements DisbObject, allowing import/export capability with CrsMatrix objects (experimental)
- combined LocalMatVec and LocalMatSolve objects into a single template parameter. (non BC)
  this required changes to CrsMatrixMultplyOp and CrsMatrixSolveOp operators as well. (non BC)
- access default for this type via Kokkos::DefaultKernels
- removed cached views of object data. this should have no effect on CPU-based nodes, but will result in slower performance
  for GPU-based nodes. this regression is a result of the release happening mid-refactor. it will not be addressed in the 
  10.4.x sequence.
- bug fixes regarding complex cases involving user-specified column maps and graphs.

(*) DistObject interface:
- added createViews(), releaseViews() methods to allow host-based objects to temporarily cache views of host data during import/export procedure

(*) Map: 
- added new non-member constructors: createContigMap(), createWeightedContigMap(), createUniformContigMap()
- fixed some bugs regarding use of unsigned Ordinal types
- fixed MPI-stalling bug in getRemoteIndexList()

(*) MultiVector:
- added view methods offsetView() and offsetViewNonConst() to create a MultiVector view of a subset of rows
- added non-member constructor Tpetra::createMultiVector(map,numVecs)

(*) Vector:
- added non-member constructor Tpetra::createVector(map)

(*) Tpetra I/O:
- added Galeri-type methods for generating pedagogical matrices (currently, only 3D Laplacian)

(*) External adaptors (experiemental)
- Efficiency improvements for Belos/Tpetra adaptors
- Brought Anasazi/Tpetra adaptors back online
