GASNet inter-Process SHared Memory (PSHM) design
---------------------------------------------
$Revision: 1.20.2.1 $

Document by:
    Dan Bonachea <bonachea@cs.berkeley.edu>
    Paul H. Hargrove <PHHargrove@lbl.gov>
    Filip Blagojevic <FBlagojevic@lbl.gov>
Implementation by:
    Jason Duell
    Filip Blagojevic <FBlagojevic@lbl.gov>
    Paul H. Hargrove <PHHargrove@lbl.gov>

Goal:
----
Provide GASNet with a mechanism to communicate through shared memory among
processes on the same compute node.  This is expected to be more robust than
pthreads (which greatly complicates the Berkeley UPC runtime, and prevents
linking to any numeric libraries that that are not thread-safe).  It is also
expected to display lower latency than use of a network API's loopback
capabilities (though the network hardware might provide other benefits such
as asynchronous bulk memory copy w/o cache pollution).

We appreciate your feedback related to PSHM (both positive and
negative) and would be happy to work with you to improve PSHM.

To use:
------
In the current release, GASNet's PSHM support is enabled by default only on
Linux.  On all other platforms, one must pass --enable-pshm if PSHM support
is desired.  On Linux PSHM can be disabled by passing --disable-pshm at
configure time.

The PSHM support in GASNet can operate via three possible mechanisms: POSIX
shared memory, SystemV shared memory, or mmap()ed disk files.  When PSHM is
enabled, the default is for the configure step to probe only for support
via POSIX shared memory (except on MacOS, where it is known to be broken).
If no POSIX shared memory support if found, there is no automatic fallback
to any other mechanism.  So, if one wishes to use SystemV shared memory or
mmap()ed files, one should explicitly disable the POSIX support and enable
the desired mechanism:

	Usage Summary (flags to be passed to the configure script):
	----------------------------------------------------------
          OFF: --disable-pshm
        POSIX: --enable-pshm
         SYSV: --enable-pshm --disable-pshm-posix --enable-pshm-sysv
         FILE: --enable-pshm --disable-pshm-posix --enable-pshm-file

        On Linux "--enable-pshm" is the default.
        On all other platforms "--disable-pshm" is the default.

PSHM includes an environment variable for controlling use of memory for
intra-memory AM traffic:

  GASNET_PSHMNET_QUEUE_DEPTH
     Minimum number of PSHM-AMs a process must be capable of sending before
     it may stall (default 32).  The shared memory allocated on a compute
     node with P processes is roughly (2 * P * Depth * MaxMsgSz).

Parameters Setting:
------------------
Although recommended as the first option, POSIX shared memory is not
available on all systems, even systems running Linux may not be configured
to support it.  In the absence of the POSIX shared memory, users are
advised to use the SystemV shared memory as the next-best option.

In the absence of both, POSIX and SystemV shared memory, a user may try
using mmap()ed disk files. However, on some systems we see significant
performance degradation when using files (apparently due to committing the
changes from memory to disk).

On most operating systems the amount of available SystemV shared memory
and the number of shared memory segments is controlled by the kernel
parameters: shmmax, shmall and shmmni.
   shmmax = largest size of a shared memory segment (in bytes)
   shmall = total amount of memory allocatable as shared (in pages)
   shmmni = maximum number of shared memory segments

Insufficient amount of SystemV shared memory will lead to failures at
start-up of any application using a runtime configured to use PSHM over
SystemV.  Setting these parameters is system-specific and requires
administrator privileges.

* Examples to set:

   - Linux:	
	sudo /sbin/sysctl -w kernel.shmmax=<large value>
	sudo /sbin/sysctl -w kernel.shmall=<large value>
	sudo /sbin/sysctl -w kernel.shmmni=<larger than number of
					    PSHM processes + 1>
   - MacOS:
	sudo /sbin/sysctl -w kern.sysv.shmmax=<large value>
	sudo /sbin/sysctl -w kern.sysv.shmall=<large value>
	sudo /sbin/sysctl -w kern.sysv.shmmni=<larger than number of
					 PSHM processes + 1>

   - Various BSD flavors:
        FreeBSD: follow MacOS example, replacing "sysv" with "ipc".
        NetBSD: follow MacOS example, replacing "sysv" with "ipc".
        OpenBSD: follow MacOS example, replacing "sysv" with "shminfo".

   - Solaris:
        Depends on version.  Please see the Sun/Oracle documentation.

   - Cygwin:
        See Cygwin under "change parameters permanently", below.

* To change parameters permanently:

   - Linux:
	Add the following lines to /etc/sysctl.conf (sudo required):
	kernel.shmmax=<large value>
	kernel.shmall=<large value>
	kernel.shmmni=<larger than number of
	               PSHM processes + 1>

	To reload the new settings:
	sudo /sbin/sysctl -p /etc/sysctl.conf

   - MacOS:
	Add the following lines to /etc/sysctl.conf (sudo required):
	kern.sysv.shmmax=<large value>
	kern.sysv.shmall=<large value>
	kern.sysv.shmmni=<larger than number of
	                  PSHM processes + 1>

	To reload the new settings: reboot the machine.

   - Various BSD flavors:
        FreeBSD: follow MacOS example, replacing "sysv" with "ipc".
        NetBSD: follow MacOS example, replacing "sysv" with "ipc".
        OpenBSD: follow MacOS example, replacing "sysv" with "shminfo".

   - Solaris:
        Depends on version.  Please see the Sun/Oracle documentation.

   - Cygwin:
        If you have not done so yet, please Read the Cygwin documentation
        on "cygserver-config" to create an initial /etc/cygsever.conf
        and start the server as a Windows service (optional).
        You may then edit /etc/cygserver to edit
          kern.ipc.shmmaxpgs
          kern.ipc.shmmni
          kern.ipc.shmseg
        Except for shmmaxpgs, the defaults are often large enough.
        These configuration values are only read when cygserver starts.
        So, read the Cygwin documentation to determine if/how to restart
        the cygserver service.
        Under Cygwin-1.5 you may also need to add "server" to the value
        of the CYGWIN environment variable.  Again, you should see the
        Cygwin documentation for more information on this subject.

IMPORTANT, SYSTEM CLEANING:
--------------------------
If a GASNet application using PSHM is terminated before ending the
initialization phase, there is a possibility that the shared memory objects
will remain in the system.  A large amount of memory or disk space can
remain allocated, preventing users from fully utilizing all available
hardware resources.

In the SystemV case, the allocated (but not released) shared memory
segments can be listed via the "ipcs" command, and can be removed via the
"ipcrm" command.  Note that on the systems with a batch scheduler, the
"ipcs" and "ipcrm" instructions need to be run on the compute nodes.

In the mmap()ed file case, the allocated but not released shared memory files
can be found in the directory pointed by the TMPDIR environment variable
(default: /tmp). These files are named with the prefix GASNT (the lack of an
'E' is not a typo), and can be deleted using the "rm" command.

In the case of POSIX shared memory, the implementation is system-specific.
In the case of Linux and Solaris, POSIX shared memory objects are visible
in the file system.  For Linux the default location is /dev/shm and on
Solaris the default is /tmp.

Scope:
-----
* GASNet segment via PSHM only supported for SEGMENT_FAST or SEGMENT_LARGE
  (not meaningful for SEGMENT_EVERYTHING mode)
* May eventually support AM-over-PSHM for SEGMENT_EVERYTHING (but not yet)
* Applicable both w/ and w/o pthreads

Terminology:
-----------
* node: each UNIX process running GASNet
* supernode: 1 or more nodes with cross-mapped segments using PSHM support
* supernode peers: nodes which share a supernode

Interface notes:
---------------
* All node processes call gasnet_init(), each is a separate GASNet node
* PSHM is enabled/disabled at configure time and GASNETI_PSHM_ENABLED is
  #defined to either 1 or 0.  Each conduit can then #define GASNET_PSHM
  to 1 if it implements PSHM support.
* gasnetc_init() performs super-node discovery, using OS-appropriate (or
  conduit-specific) mechanisms to figure out which nodes are capable of
  sharing memory with which other nodes:
   - unconditionally calls gasneti_nodemapInit() (to drive "discovery")
   - calls gasneti_pshm_init() only if PSHM support enabled (to setup data)
* MaxLocal/Global return values reflecting the amount of segment space divided
  evenly among the supernode peers, and each node passes a size to
  gasnet_attach reflecting the per-node segment size they want. 
* gasnet_attach takes care of mapping each processor's segments as usual, but
  also maps the segments of supernode peers into each nodes VM space using
  OS-appropriate mechanisms. (shm_open()+mmap(), shmget()+shmat(), etc.).
* Nodes on a supernode typically have different virtual address map of the
  segments on that supernode.  They are typically not contiguous either.
* Client calls getSegmentInfo to get the location of his segment and those of
  other nodes (as always)
* seginfo_t for node X reflects the shared segment belonging to X, but also
  includes a supernode identifier (node_info) so nodes can see which nodes
  share their supernode
* Client may directly load/store into the segments of any node sharing their
  supernode (currently implemented in Berkeley UPC runtime library)
* remotely-addressable segment restrictions on gasnet_put/get/AMLong apply to
  the individual segments - ie gasnet_put() to an address in the segment of
  node X must give node X as the target node, not some other supernode peer

Restrictions:
------------
* gasnet_hsl_t's are node-local and while they might reside in the segment,
  they may not be accessed by more than one node in a supernode
  - we can/should add a debug-mode check for this (also applies to shmem-conduit)
* Use of GASNet atomics in the segment is allowed, but they must not be weak
  atomics (which means using the explicitly "strong" ones in client code).

Closed (previously "Open") questions:
------------------------------------
Q1) Do we need a separate build or separate configure of libgasnet and/or
    libupcr with PSHM enabled/disabled?
A1) Since the set of conduits supported by PSHM was initially a small subset
    of the total list, we chose not to complicate the UPC compiler with this.
    Thus we've chosen to configure everything (UPCR+gasnet) w/ --enable-pshm
    or w/o.  The number of conduits supporting PSHM is now irrelevant since in
    a PSHM-enabled build of GASNet any conduits not supporting PSHM are simply
    built w/o it (as opposed to not built at all as was once the case).

Q2) If we want to use the same build, then how should GASNET_ALIGNED_SEGMENTS
    definition behave?  Never true when any supernode contains more than one
    node, but don't know that until runtime.
A2) We assume that you don't use PSHM unless also using > 1 proc/node.
    May also revisit if we don't configure PSHM as a distinct build.

Q3) Can we get away with always connecting segments after all processes are
    created, or do we need to fork after setting up shared memory segments?
    Will drivers & spawners even allow that?
    If we decide that a fork is required after job launch, then it should
    definitely be done by the conduit, not the client code. But how would the
    interface look? (this would very likely break MPI interoperability)
A3) All supported conduits are attaching to segments in gasnet_attach().  We
    don't need to work about fork() at all (except that smp-conduit now has a
    fork-based spawner inside gasnetc_init()).

Q4) Does the client code between init/attach need to know the supernode
    associations? (eg to make segsize decision)
A4) So far we have not seen a need for this (though internal to GASNet we do).

Q5) Can/do we still get allocate on first write mapping for the segment?
    - If so, who's responsible for establishing processor/memory affinity
      with first touch? (probably the client)
A5) We have each node mmap() its own segment before any cross-mapping is done
    which should ensure locality if the OS does allocation at mmap() time.
    We currently have the client doing first-touch to deal with the case that
    the OS does page frame allocation on touch, rather than mmap().

Open questions:
--------------
* How do we handle 8 or 16-way SMPs on 32-bit platforms where VM space is
  already tight, or OS's where the limit on sharable memory is small? This
  design would make our per-node segsizes rather small. Do we want a mode
  where segments are not cross-mapped, but the gasnet_put/get can bypass the
  NIC using a two-copy scheme through bounce buffers?
  - This bounce buffer mode could potentially also help for EVERYTHING mode
    (without pshm segments), although due to attentiveness issues, it may be
    slower than using loopback RDMA
  - Is this mode just the extended-ref using AM-over-PSHM?
* Do we ever want to allow supernodes to share a physical node?
  (eg to increase segment size or to leverage NUMA affinity)
  - if so, need an interface to specify this (probably environment variables)
* Will there be contention with MPI for resources (and should we care)?
  
Known Problems / To do:
----------------------
* The mechanism we are using to probe for maximum segment size works fine on a
  system with plenty of memory, but dies on systems with less.  The work
  around is to set the GASNET_MAX_SEGSIZE small enough for a given system.
* There are still error cases that will leak shared memory.

Status:
------
* The entire GASNet and Berkeley UPC test suites are run on the platforms
  which support PSHM, and it is considered stable on Linux, AIX, Solaris,
  {Free,Net,Open}BSD, BG/P and BG/Q.
* IBM BG/P platform-specific notes:
  - PSHM does not work at all in "SMP" mode, only "DUAL" or "VN"
  - In "VN" mode one cannot run a hybrid GASNet+MPI code (appears that some
    scarce resource is exhausted in this case, but we have no details).
  - Currently one must manually set up a few things to use PSHM on BG/P
    + Must set CROSS_HAVE_SHM_OPEN=1 on configure command line
    + Shared segment size must be limited to a lower-than-default value
    + BG_SHAREDMEMPOOLSIZE env var must be set to fit the shared segment plus
      extra for Active Message buffers
  - Example environment variable settings for 200MB shared heap in VN mode
    when using Berkeley UPC's upcrun command:
      UPC_SHARED_HEAP_SIZE=200M
      BG_SHAREDMEMPOOLSIZE=820
  - Example environment variable settings for 400MB shared heap in DUAL mode
    when using Berkeley UPC's upcrun command:
      UPC_SHARED_HEAP_SIZE=400M
      BG_SHAREDMEMPOOLSIZE=810
* IBM BG/Q platform-specific notes:
  _ NOTE: With GASNet 1.22.0 there are known issues with leaking of shared
    memory using BG/Q driver V1R2M1.  It is unknown at the time of this
    release if the fault lies in GASNet or the BG/Q driver.
    The result is that BG_SHAREDMEMSIZE may need to be set much higher
    than expected unless GASNET_MAX_SEGSIZE has been set to a value very
    near to the true segment size to be requested by the client.
  - Currently one must manually set up a few things to use PSHM on BG/Q
    + Must set CROSS_HAVE_SHM_OPEN=1 on configure command line
    + Set env var BG_SHAREDMEMSIZE to fit the shared segment plus extra
      space for Active Message buffers.  The value is in units of MB and
      should be prefixed with '+' to ADD the value to the system default.
    + Set BG_MAPCOMMONHEAP=1 or the entire shared heap will be deducted
      from the heap of the first process on each node.
  - Example environment variable settings for 200MB shared heap in "c4"
    mode when using Berkeley UPC's upcrun command:
      UPC_SHARED_HEAP_SIZE=200M
      BG_SHAREDMEMSIZE=+820
      BG_MAPCOMMONHEAP=1
  - Example environment variable settings for 400MB shared heap in "c2"
    mode when using Berkeley UPC's upcrun command:
      UPC_SHARED_HEAP_SIZE=400M
      BG_SHAREDMEMSIZE=+810
      BG_MAPCOMMONHEAP=1
* Cray XE and XC platform-specific notes:
  - The Gemini and Aries conduits support PSHM, and the provided cross-
    configure scripts --enable-pshm-xpmem
* MacOS X platform-specific notes:
  - MacOS X with POSIX shared memory is NOT supported because we appear to 
    trigger a kernel memory leak.
  - SystemV shared memory is a valid choice:
       --enable-pshm --enable-pshm-sysv
  - Use of mmap()ed files has been seen to cause VERY slow start-up.
* {Free,Open,Net}BSD platform-specific notes:
  - FreeBSD supports POSIX shared memory and has been well tested
  - OpenBSD and NetBSD do not support POSIX shared memory, but do
    support SystemV:
       --enable-pshm --enable-pshm-sysv

* GASNet conduits known NOT to work:
  - SHMEM conduit does not support PSHM, but there is no reason to think
    that doing so would be constructive.
  Keep in mind that if you use one of these conduits on a platform with the
  necessary support for PSHM, you may still configure with --enable-pshm to
  get PSHM support in other conduits (e.g. SMP and MPI), and these few
  conduits will still build (they will simply be missing PSHM support).
