High Availability of condor scheduler daemons using a Cluster Resource Agent
----------------------------------------------------------------------------

Overview
========
This contrib provides tools necessary to provide high availability
of condor daemons with a cluster resource agent.  This replaces condor's
native High Availability mechanism for the schedd and provides High
Availability for other daemons condor that do not support in a High
Availability configuration.  However, this will not replace condor's High
Availability of the Central Manager.  These tools are built on top of
wallaby and Cluster tools, so working Cluster and wallaby configurations
are required.

There are 2 parts to the tools that work to together to achieve High
Availability with Cluster.  First, there is the condor Resource Agent
which is used like other Cluster Resource Agents.  Once installed with
the other Resource Agents, the Cluster tools can be used to configure
and manage cluster based configuration for condor daemons.  It is worth
noting that Cluster manages the daemon not condor_master, which means
any daemon that Cluster is to manage should NOT be listed in condor's
DAEMON_LIST.

The second part of the tools are the wallaby shell commands.  These are
commands that wrap wallaby and Cluster tools, ensuring that a cluster
configuration is appropriately represented in wallaby and nodes in the
cluster have the proper condor configuration.  Currently, only the schedd,
job_server, and query_server are supported with these tools.

In order for the query_server to fail over cleanly, the Aviary Locator feature
needs to be enabled on the pool.  Query servers configured with the wallaby
shell commands will be configured with locator support enabled.

Dependencies
============
wallaby shell
ccs

Files
=====
cmd_cluster.rb - Wallaby shell commands for managing HA Schedd configurations
condor.sh      - Cluster Resource Agent for condor daemons

Cluster versus condor_master
==================================
Cluster is an alternative to condor_master for managing high
availability configurations of condor daemons.  The key differences are:
  * Cluster's rgmanager uses a Resource Agent to manage a daemon
    instance whereas the condor_master manages the daemon instance.
  * A daemon run by Cluster does not rely on a lock file to ensure
    only 1 daemon instance per configuration is running whereas
    the condor_master depends on a lock file on a shared file system.
  * Cluster can monitor and keep running just about any daemon in
    a High Availability configuration, whereas the condor_master only
    supports a few.
  * The wallaby shell commands configure NFS and ensure the exported directory
    is only mounted on the node running the daemon instance whereas
    the condor_master requires the shared resource be mounted on all nodes
    that could run the daemon instance.
  * The NFS mount point is managed (mounted/unmounted) by Cluster
    and the daemon will only run if the NFS file system mounts
    sucessfully whereas the condor_master has no control over the shared file
    system.
  * Cluster provides an optional graphical interface for managing
    numerous daemon configurations whereas condor_master is only
    configured through configuration files.

Installation
============
RPM
---
Install the condor-cluster-resource-agent package for Red Hat Enterprise
Linux/Fedora.  This will install the required software dependencies and
the condor Cluster integration files in the correct locations.

Source
------
The Resource Agent (condor.sh) should be installed in the directory
containing other Resource Agents used by Cluster
(usually /usr/share/cluster).  This file MUST be executable.

The wallaby shell command (cmd_cluster.rb) should be installed in the
directory containing other wallaby shell commands
(usually /usr/lib/ruby/site_ruby/<ver>/mrg/grid/config/shell).

Usage
=====
The cluster related wallaby shell commands wrap the command line Cluter Suite
configuration tool (ccs) and interaction with wallaby to simplify configuring
a High Availabilitly schedd, job_server, or query_server.

Common options
--------------
Almost all cluster related wallaby shell commands take the following arguments:
      --riccipassword : The ricci user password
  -n, --no-store      : Only configure the cluster, don't update the store
  -s, --store-only    : Don't configure the cluster, just update the store

cluster-add-node
----------------
This command adds a node to an existing HA Schedd cluster configuration.  It
takes the common options and requires the name of the schedd configuration
and a list of nodes to add.

example: wallaby cluster-add-node name=schedd1 node1 node2 node3


cluster-add-jobserver
---------------------
This command adds a JobServer to an existing HA Schedd cluster configuration.
This JobServer will work with the schedd it is associated with and failover
when the schedd fails over.  However, failure of the JobServer will not cause
the schedd to failover.  It takes the common options and requires the name of
the schedd configuration.

example: wallaby cluster-add-jobserver name=schedd1


cluster-add-queryserver
---------------------
This command adds a QueryServer to an existing HA Schedd cluster configuration.
This QueryServer will work with the schedd it is associated with and failover
when the schedd fails over.  However, failure of the QueryServer will not cause
the schedd to failover.  It takes the common options and requires the name
of the schedd configuration.

example: wallaby cluster-add-queryserver name=schedd1


cluster-create
--------------
This command creates an HA Schedd cluster configuration.  It takes the common
options and requires the following information about the configuration in
addition to a list of nodes:
NAME   : The unique name of the schedd configuration.  This is used to
         identify configuration parameters for the schedd instance in
         wallaby, and to tell Cluster the name that should be passed to
         the condor schedd on startup.
SPOOL  : The spool directory for the schedd.  This is also the mount
         point of the NFS export.
SERVER : The server offering the NFS export.
EXPORT : The directory on the SERVER that is exported over NFS.

example: wallaby cluster-create name=schedd1 spool=/path/to/mountpoint server=NFSserver export=/path/on/NFSServer node1 node2 node3


cluster-delete
--------------
This command deletes an HA Schedd cluster configuration.  It takes the common
options and requires the name of the schedd configuration to delete.

example: wallaby cluster-delete name=schedd1


cluster-remove-node
-------------------
This command removes a node from an existing HA Schedd cluster configuration.
It takes the common options and requires the name of the schedd configuration
and a list of nodes to remove.

example: wallaby cluster-remove-node name=schedd1 node1 node2 node3


cluster-remove-jobserver
------------------------
This command removes a JobServer from an existing HA Schedd cluster
configuration.  It takes the common options and requires the name of the
schedd configuration.

example: wallaby cluster-remove-jobserver name=schedd1


cluster-remove-queryserver
------------------------
This command removes a QueryServer from an existing HA Schedd cluster
configuration.  It takes the common options and requires the name of the
schedd configuration.

example: wallaby cluster-remove-queryserver name=schedd1


cluster-sync-to-store
---------------------
This command replaces HA Schedd configuration(s) in wallaby with the cluster
configuration.  It takes no options and requires no arguments.

example: wallaby cluster-sync-to-store
