README file for dcmf-conduit
============================
Rajesh Nishtala <rajeshn@cs.berkeley.edu>
$Revision: 1.11 $


GASNet's DCMF conduit is the native port of GASNet to the BlueGene/P
compute nodes. It uses IBM's Deep Computing Messaging Framework for
the lower level communication between the nodes. DCMF also supports
running over other network layers such as UDP, however we currently
only support DCMF over BlueGene/P. To run GASNet over other network
APIs (such as UDP) we recommend compiling directly with the
appropriate network conduit listed in GASNet README. 

GASNet also requires at least version 0.2.0 of DCMF. DCMF can be
installed by the users without root access. The DCMF library along
with documentation and installation instructions can be found at:

http://dcmf.anl-external.org/wiki/index.php/Main_Page

Section: Building GASNet for BlueGene/P
 Since the BlueGene/P compute nodes run a different processor than the
 front end nodes GASNet needs to be cross-compiled from the front end
 nodes. To aid the process we provide a script:
 other/contrib/cross-configure-bgp 
 that will handle the cross compilation for BlueGene/P with default
 environment variables and parameters for the cross
 compilation. Unless there are any major problems, these defaults seem
 to be the best. To build GASNet in its default configuration:

 cd $GASNET_SRC_DIR
 ln -s other/contrib/cross-configure-bgp .
 ./Bootstrap
 cd $GASNET_BUILD_DIR
 $GASNET_SRC_DIR/cross-configure-bgp

By default GASNet will look in /bgsys/drivers/ppcfloor for the
required DCMF headers and libraries. This directory can be overridden
with the $DCMF_HOME environment variable or
with --with-dcmf-home=/path/to/dcmf passed to configure. It assumes the following
directory structure.

 DCMF headers:
  + $DCMF_HOME/comm/sys/include (can be overridden with $DCMF_INCLUDE)

 DCMF libraries:
  + $DCMF_HOME/comm/sys/lib (can be overridden with $DCMF_LIBDIR)

 BlueGene Runtime Headers:
  + $DCMF_HOME/arch/include (can be overridden with $DCMF_SYS_INCLUDE)

 BlueGene Runtime Libraries:
  + $DCMF_HOME/runtime/SPI (can be overridden with $DCMF_SYS_LIBDIR)


 DCMF can be built with either the BlueGene XLC or BlueGene GCC. The
 default is XLC because it yields higher performance than GCC. However
 to use GCC, you may set $USE_GCC=1 in the environment before configure to use the
 gcc compilers. By default we assume the gcc compilers are in 
 /bgsys/drivers/ppcfloor/gnu-linux/bin . If this is not the case then
 this can be modified by editing the cross-configure-bgp script in the
 appropriate places.

Section: Running on the BlueGene/P Systems
 We also provide a script $GASNET_BUILD_DIR/dcmf-conduit/contrib/gasnetrun_dcmf
 that is a wrapper around qsub or cobalt-mpirun depending on the
 context in which it is called. If it is called outside a batch script
 then the script will invoke qsub to submit the batch script on to the
 queue. If it is called within a batch script then it will invoke
 cobalt-mpirun with the appropriate arguments. Currently
 gasnetrun_dcmf only supports the cobalt batch system. However, for
 other job launch mechanisms spawning a GASNet job is very similar to
 spawning any MPI job thus we refer the reader to the appropriate
 documentation in case. gasnetrun_dcmf also defaults to using the
 default thread mapping (XYZT). The BG_MAPPING environment variable configures where the
 processes are placed in the torus (see IBM's BlueGene/P Documentation). 

 Due to a known cobalt issue, gasnetrun_dcmf can only support limited
 configurations inside the batch mode. If the number of GASNet processes is less than the
 size of the allocated BlueGene/P partition then one GASNet job will
 be spawned per ComputeNode (i.e. smp mode). If the number of GASNet
 processes is exactly twice the size of the allocated partition then
 the jobs will be spawned with two processes per node (i.e. dual
 mode). If the number of GASNet processes is exactly 4 times the
 partition size then the job will be spawned with 4 processes per node
 (i.e. VN node mode). At the time of writing this document, the Argonne
 BlueGene/P installation's minimum partition size was 64
 nodes.  Even if a user requests 4 nodes, an entire partition is
 allocated and therefore any node count under 64 will be run in
 smp mode. However if the job is invoked from outside the batch script
 then our wrapper around qsub will yield the correct layout. Run 
 
 $GASNET_BUILD_DIR/dcmf-conduit/contrib/gasnetrun_dcmf -h 
 
 for more information.
 
 Subsection: Recognized Environment Variables for DCMF Conduit
  The following are a list of environment variables that DCMF conduit
  recognizes to adjust parameters that could influence the parameters. 

  GASNET_BARRIER=DCMF_BARRIER (default)
    + Enables dcmf-specific implementation of GASNet barriers.
      If one does not wish to use the dcmf-specific barrier, then
      GASNET_BARRIER=DISSEM is the recommended alternative.
  GASNET_DCMF_PRINT_TORUS_LOCATION (yes/no): 
    + Each nodes prints it's X,Y,Z,T location in the 3D Torus (set
      to 1 to enable)
  GASNET_DCMF_MAX_SEG_SIZE (in MB or GB):
    + Set the largest size for the GASNet segment. Defaults to 3/4
      physical memory. (i.e. 384 MB in VirtualNode mode, 768MB in Dual
      mode and 1536MB in SMP mode)
  GASNET_DCMF_EAGER_LIMIT (in Bytes):
    + Set the maximum size at which the DCMF_EAGER protocol is used
      (defaults to 1024 bytes)
  GASNET_DCMF_MAX_REPLAY_BUFFERS (a count):
    + USE WITH CAUTION.  Sets how many unacknowledged active message 
      can be in flight. Increase the value to allow more aggressive
      network injection at the risk of increasing the overall
      congestion in the network. (defaults to 1024)
  GASNET_DCMF_MAX_INCOMING_BUFFERS (a count):
    + USE WITH CAUTION. Sets how many active messages a node may
      accept before sending out negative acknowledgment. Increase
      the value to allow more aggressive network traffic at the risk
      of swamping the target nodes with active message
      requests.(defaults to 1024)

  See the next section (Use of DCMF Native Collectives) for more.

  All the BG_* environment variables documented by IBM
  
  All the standard GASNet environment variables (see top-level README)

Section: Use of DCMF Native Collectives

DCMF provides an optimized implementation of collectives that can leverage
IBM BlueGene supercomputers' network hardware features for high performance
collective communication.  The GASNet collectives in DCMF conduit have two
implementations: a portable implementation based on DCMF point-to-point
communication primitives and a native implementation based on DCMF collectives.
The current GASNet collectives implementation based on native DCMF collectives
includes broadcast, exchange (alltoall) and barrier operations in GASNet SEQ
mode.  The environment variables below control whether to use the native
DCMF collectives for GASNet collectives.

Each DCMF collective operation may be implemented by one or more communication
protocols (algorithms) which can be individually enabled or disabled.  The
GASNet DCMF conduit would choose the best applicable protocol among those
enabled.  Please refer to IBM BlueGene documentation listed in the next section
for details about the applicability and restrictions of DCMF collectives.

  GASNET_USE_DCMF_COLL (yes/no):  Default=yes
    Enable/disable use of the native DCMF collectives for GASNet collectives.
    + Setting this environment variable to "1" (or "yes") will enable the
      use of native DCMF collectives, subject to the control of the per-
      operation environment variables described below.
    + Setting this environment variable to "0" (or "no") will disable the native
      DCMF collectives and use the default portable GASNet collectives based on
      point-to-point communication primitives instead.  In this case the
      environment variables in the remainder of this section are ignored.

  GASNET_USE_DCMF_BCAST (yes/no):  Default=yes
    Enable/disable use of the native DCMF collectives for GASNet broadcast.
    + Setting this environment variable to "1" (or "yes") will enable the native
      DCMF broadcast.
    + Setting this environment variable to "0" (or "no") disables the native
      DCMF broadcast.
    + There are three DCMF broadcast algorithms available, which are considered
      with the following precedence (most favored to least):
         TREE_BROADCAST
         TORUS_RECTANGLE_BROADCAST
         TORUS_BINOMIAL_BROADCAST
      All are enabled by default, but any of these may be disabled individually
      by setting to "0" (or "no") the appropriate environment variable(s) from
      the following list:
         GASNET_DCMF_TREE_BCAST
         GASNET_DCMF_TORUS_RECTANGLE_BCAST
         GASNET_DCMF_TORUS_BINOMIAL_BCAST
      Because the TORUS_BINOMIAL_BROADCAST protocol is applicable to all shapes
      of compute node allocations, it should not normally be disabled.

  GASNET_USE_DCMF_EXCHANGE (yes/no):  Default=yes
    Enable/disable use of the native DCMF collectives for GASNet exchange
   (alltoall).
    + Setting this environment variable to "1" (or "yes") will enable the use of
      native DCMF exchange (alltoall).
    + Setting this environment variable to "0" (or "no") disables use of native
      DCMF exchange (alltoall) and uses default GASNet portable exchange instead.


Section: Optimizing Your Code for the BlueGene/P
 IBM has a variety of resources that will help optimize parallel codes
 to best take advantage of the BlueGene/P network hardware. All the
 serial optimizations that will help applications using MPI also apply
 to applications using GASNet.

 We recommend the following resources:

 Overview of how to run on BG/P from Bob Walkup of IBM Research:
 + https://wiki.alcf.anl.gov/images/f/f3/Bgp.doc   
 
 IBM RedBooks:
 + IBM System Blue Gene Solution: Blue Gene/P Application Development
   http://www.redbooks.ibm.com/abstracts/sg247287.html
 + IBM System Blue Gene Solution: High Performance Computing Toolkit for Blue Gene/P 
   http://www.redbooks.ibm.com/abstracts/redp4256.html

 IBM Documentation:
 + http://www.nccs.gov/wp-content/training/2008_bluegene/ORNLCompOptimization.pdf
 + http://publib.boulder.ibm.com/infocenter/compbgpl/v9v111/index.jsp?topic=/com.ibm.bg9111.doc/ 

 BG/L has the same dhummer floating point unit so many of the BG/L
 optimizations still aply to BG/P.  BG/L uses a ppc440d processor and
 BG/P uses a ppc450d processor. 

 + http://www.cbrc.jp/symposium/bg2006/PDF/Martin.pdf
 + http://www-01.ibm.com/support/docview.wss?uid=swg27007511
 + http://www-01.ibm.com/support/docview.wss?uid=pub1sc10431000 


Section: Known Issues
+ The GASNet job spawning scripts only know about the cobalt job control system. This is the current
default on Surveyor and Intrepid at Argonne.


Section: Torus Information
Virtual Node Mode (4 processes per node)

Node/Processor Count  Dims:(X,Y,Z,T) isTorus?:(X,Y,Z,T)
64/256                (4,4,4,4)      (0,0,0,1)
128/512               (4,4,8,4)      (0,0,0,1)
256/1024              (8,4,8,4)      (0,0,0,1)
512/2048              (8,8,8,4)      (1,1,1,1)
1024/4096             (8,8,16,4)     (1,1,1,1)
2048/8192             (8,8,32,4)     (1,1,1,1)
4096/16384            (8,16,32,4)    (1,1,1,1)
8192/32768            (8,32,32,4)    (1,1,1,1)
16384/65536           (16,32,32,4)   (1,1,1,1)
24576/98304           (24,32,32,4)   (1,1,1,1)
32768/131072          (32,32,32,4)   (1,1,1,1)
40960/164840          (40,32,32,4)   (1,1,1,1)

