



                                              Reliable  DUCC  -  Design



                                         Written and maintained by the Apache

                                             UIMATM  Development Community



Copyright Oc   2012 The Apache Software Foundation

Copyright Oc   2012 International Business Machines Corporation



        License and Disclaimer   The ASF licenses this documentation to you under the Apache License, Version
        2.0 (the "License"); you may not use this documentation except in compliance with the License.  You may
        obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

        Unless required by applicable law or agreed to in writing, this documentation and its contents are distributed
        under the License on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
        either express or implied. See the License for the specific language governing permissions and limitations under
        the License.



        Trademarks   All terms mentioned in the text that are known to be trademarks or service marks have been
        appropriately capitalized.  Use of such terms in this book should not be regarded as affecting the validity of
        the the trademark or service mark.



Publication date: April 2019



                                                                    1
^L

Multiple  DUCC  head  nodes



This first major section describes support for multiple active DUCC head nodes.



Introduction


DUCC can be configured to run reliably by having multiple head nodes, comprising one master and one or more
backup head nodes. DUCC exploits Linux keepalived virtual IP addressing to enable this capability.

The advantages are that if the master DUCC host becomes unusable, the backup DUCC can take over seamlessly
such that active distributed Jobs, Reservations, Managed Reservations and Services continue uninterrupted.  Take
over also facilitates continued acceptance of new submissions and monitoring of new and existing submissions without
interruption.



Daemons


Each head node, whether master or backup, runs a Broker, Orchestrator, PM, RM, and SM.

The Cassandra database is expected to be located on a node(s) separate from the head nodes.

Likewise, the JD node(s) is separate from the head nodes.

The Agents are distributed, as before.



Configuring Host Machines


See  Configuring  Simple  Virtual  IP  Address  Failover  Using  Keepalived  which  can  be  found  at  this  web  address:
https://docs.oracle.com/cd/E37670_01/E41138/html/section_uxg_lzh_nr.html                               .

Sample MASTER /etc/keepalived/keepalived.conf

      !  Configuration  File  for  keepalived


vrrp_instance  VI_1  {
      state  MASTER
      interface  eth0
      virtual_router_id  51
      priority  100
      advert_int  1
      authentication  {
            auth_type  PASS
            auth_pass  1111
      }
      virtual_ipaddress  {
            192.168.6.253
      }
}



Sample BACKUP /etc/keepalived/keepalived.conf

      !  Configuration  File  for  keepalived


vrrp_instance  VI_1  {
      state  BACKUP
      interface  eth0



                                                                    2
^L

      virtual_router_id  51
      priority  100
      advert_int  1
      authentication  {
            auth_type  PASS
            auth_pass  1111
      }
      virtual_ipaddress  {
            192.168.6.253
      }
}



Linux Commands

Starting keepalived

      >  sudo  service  keepalived  start
      Starting  keepalived:                                                        [   OK   ]



Querying keepalived

      >  /sbin/ip  addr  show  dev  eth0
2:  eth0:  <BROADCAST,MULTICAST,UP,LOWER_UP>  mtu  1500  qdisc  mq  state  UP  qlen  1000
      link/ether  00:21:5e:20:02:84  brd  ff:ff:ff:ff:ff:ff
      inet  192.168.3.7/16  brd  192.168.255.255  scope  global  eth0
      inet  192.168.6.253/32  scope  global  eth0
      inet6  fe80::221:5eff:fe20:284/64  scope  link
          valid_lft  forever  preferred_lft  forever



Stopping keepalived

      >  sudo  service  keepalived  stop
      Stopping  keepalived:



Configuring DUCC


To configure DUCC to run reliable, one required property must be configured in the site.ducc.properties file. Example:

ducc.head  =  192.168.6.253



Use the virtual IP address configured for your host machines keepalived. Use of the DNS name is also supported.



Webserver


Webserver for Master

The master DUCC Webserver will display all pages normally with additional information in the heading upper left:

reliable: master

Webserver for Backup

The backup DUCC Webserver will display some pages normally with additional information in the heading upper
left:

reliable___________: backup



                                                                    3
^L

Hovering over reliable_ will yield the following information: Click to visit master

Several pages will display the following information (or similar):

no  data  -  not  master



Database



Configure the database to be on a separate machine from the reliable DUCC head nodes.  In site.ducc.properties
specify:

#  Database  location
      ducc.database.host  =  dbhost123
      ducc.database.jmx.host  =  dbhost123
      ducc.database.automanage  =  false



The existing administrator commands start_ducc and stop_ducc will honor the value specified for ducc.database.automanage
in site.ducc.properties.



Code changes


The  key  changes  include  a  new  script  (see  ducc_head_mode.py)  to  interact  with  Linux  to  determine  virtual  IP
address status and corresponding Java code (see common.head.ADuccHead.java) that interprets the status to make
transitions between master and backup states.



new scripts


ducc __head_mode.py

This is a new script employed at runtime by the various daemons to determine the current mode of operation. Status
is determined though invocation of this script upon receipt of each Orchestrator publication.

      #  purpose:      determine  reliable  ducc  status
      #  input:         none
      #  output:       one  of  {  unspecified,  master,  backup  }
      #  operation:   look  in  ducc.properties  for  relevant  keywords
      #                   and  employ  linux  commands  to  determine  if  system
      #                   has  matching  configured  virtual  IP  address



existing and new scripts


       ducc_post_install - no need to create webserver request log directory

       ducc_util.py - incorporate host name into cassandra.pid, cassandra.console path; broker host must be local
       host; head node must be eligible with respect to keepalived.conf; head node local components are all daemons
       except Database and Agents; fix remote db stop; honor ducc.database.automanage flag in site.ducc.properties

       ducc.py - incorporate host name into cassandra.pid, cassandra.console path

       start_ducc.py - head node local components must on eligible local host

       start_sim  -  (example)  honor  database  autostart  flag  in  ducc.properties  item  stop_sim  -  (example)  honor
       database autostart flag in ducc.properties



                                                                    4
^L

configuration files


ducc.properties

      #  The  name  of  the  node  where  DUCC  runs.
      #  This  property  declares  the  node  where  the  DUCC  administrative  processes  run  (Orchestrator,
      #  Resource  Manager,  Process  Manager,  Service  Manager).   This  property  is  required  and  MUST  be
      #  configured  in  new  installation.   The  installation  script  ducc_post_install  initializes  this
      #  property  to  the  node  the  script  is  executed  on.
      #  Reliable  DUCC:  if  running  reliably,  then  this  value  must  be  the  same  as  that  specified
      #  for  the  virtual_ipaddress  in  /etc/keepalived/keepalived.conf.   DUCC  CLI  and  Agents  employ
      #  this  value  to  connect  to  the  current  reliable  DUCC  head  node.
      ducc.head  =  <head-node>



Although not strictly true, the Orchestrator, RM, SM, PM, Webserver and Broker "must" all be configured on the
head node. Reliable DUCC may work with other configurations, but it has not been tested as such.

#  If  set  to  true,  DUCC  will  start  and  stop  the  Cassandra  database  as  part  of  its  normal
#  start/stop  scripting.
ducc.database.automanage  =  true

log4j.xml

      Add  DUCC\_NODENAME  to  log  file  name  for  OR,  RM,  PM,  SM,  and  system-events.
      This  allows  reliable  DUCC  head  nodes  to  share  the  same  ducc\_runtime  directory
      in  the  filesystem  without  collisions.



agent


       DuccWorkHelper - use virtual IP address configured as ducc.head node

       AgentEventListener - ignore any incoming publications from backup producer

       CGroupsTest - employ changed DuccIdFactory signature

       ServiceTester - broker must be on ducc.head node



cli


       DuccMonitor - use WS node or virtual IP address configured as ducc.head node

       DuccUiUtilities - use virtual IP address configured as ducc.head node (to submit, cancel..)



common


       AbstractDuccComponent - remove commented-out code, remove print to console, head node local compo-
       nents are all daemons except Database and Agents

       ADuccHead - abstract class with reliable DUCC share functionality

       IDuccHead - reliable DUCC interface

       IDuccEnv - remove DUCC_LOGS_WEBSERVER_DIR, not used

       IStateServices - database access control RW or RO

       NullStateServices - database access control RW or RO

       StateServices - database access control RW or RO



                                                                    5
^L

       DuccDaemonRuntimeProperties - incorporate hostname into logs directory location

       InetHelper - incorporate hostname into logs directory location

       DuccPropertiesHelper - fetch virtual IP address configured as ducc.head node

       DuccPropertiesResolver - Remove key ducc.broker.hostname, broker must be on ducc.head node

       IDuccLoggerComponents - Missing PM abbreviation

       DuccIdFactory - improved (generalized) to handle DB persisted sequence numbering



database


       IDuccHead - reliable DUCC interface

       DbOrchestratorProperties - support for OR properties table

       IDbOrchestratorProperties - interface of OR properties table

       IOrchestratorProperties - interface for OR properties

       IOrchestratorProperties - database access control RW or RO



orchestrator


       DuccHead - loggable wrapper around common.ADuccHead

       OrchestratorCommonArea - add restart capability for transition to master

       OrchestratorComponent - reject requests from CLI and JD and publications for Agents when not master,
       employ DB for Job and publication number persistence, use active workMap from common area, tag publication
       with node identity and producer state master or backup, make transitions between master and backup states

       OrchestratorRecovery - employ changed DuccIdFactory initialization requirements

       ReservationFactory - employ changed DuccIdFactory signature

       StateJobAccounting - log job state changes

       StateManager - use active workMap from common area

       WorkMapHelper - adding logging

       AOrchestratorCheckpoint - refactor checkpointing, suspend when backup resume when master

       IOrchestratorCheckpoint - refactor checkpointing

       OrchestratorCheckpoint - refactor checkpointing

       OrchestratorCheckpointDb - refactor checkpointing

       OrchestratorCheckpointFile - refactor checkpointing

       OrchestratorConfiguration - employ changed DuccIdFactory for publication sequence numbering

       OrDbDuccWorks - specification to DB only when master

       OrDbDuccWorks - orchestrator properties to DB only when master

       OrchestratorEventListener - record to system events log daemon switches between backup and master

       ReservationFactory - employ changed DuccIdFactory for Job numbering

       ReservationFactory - employ changed DuccIdFactory signature

       JdScheduler - suspend JD host management when backup resume when master

       HealthMonitor - use active workMap from common area



                                                                    6
^L

       MaintenanceThread - use active workMap from common area

       AOrchestratorState - refactor orchestrator state managements from files to DB

       DuccWorkIdFactory - refactor orchestrator state managements from files to DB

       IOrchestratorState - refactor orchestrator state managements from files to DB

       OrchestratorState - refactor orchestrator state managements from files to DB

       OrchestratorStateDb - refactor orchestrator state managements from files to DB

       OrchestratorStateDbConversion - refactor orchestrator state managements from files to DB

       OrchestratorStateFile - refactor orchestrator state managements from files to DB

       AOrchestratorStateJson - refactor orchestrator state managements from files to DB

       SystemEventsLogger - record all CLI interactions in system events log

       TestSuite - print whether backup or master



pm


       DuccHead - loggable wrapper around common.ADuccHead

       ProcessManagerComponent - make transitions between master and backup states



sm


       DuccHead - loggable wrapper around common.ADuccHead

       ServiceHandler - resume operations when state is master, quiesce operations when state is backup

       ServiceManagerComponent - make transitions between master and backup states, reject requests when in
       backup state, employ changed DuccIdFactory signature ServiceSet - handle new state Dispossessed



transport


       JobDriverStateExchanger - use virtual IP address configured as ducc.head node

       AbstractDuccEvent - tag publications with producer host identity and state master or backup

       DaemonDuccEvent - switch to master or backup state for recording to system event log

       DuccEvent - add events SWITCH_TO_MASTER and SWITCH_TO_BACKUP

       JdEvent - interrogate publications producer state master or backup

       IService - add service state Dispossessed, Service is not controlled by this Service Manager



webserver


       BrokerHelper - use local host name to find co-located broker

       DuccBoot - make boot reusable for switch to master

       DuccData - create reset function for switch to master

       DuccHead - loggable wrapper around common.ADuccHead

       WebServerComponent - make transitions between master and backup states; incorporate hostname info logs
       directory location



                                                                    7
^L

       WebServerConfiguration - make boot reusable for switch to master

       DuccHandler - servlet to produce reliable DUCC state master or backup

       DuccHandlerClassic - servlets to producen"o data - not masterw"hen appropriate

       DuccHandlerJsonFormat - servlets to producen"o data - not masterw"hen appropriate

       DuccWebServer - add method getPort; use host as part of request log directory path; incorporate hostname
       info logs directory location

       DuccWebServerHelper - incorporate hostname info logs directory location

       c4-ducc-mon.jsp - web page header location for reliable DUCC state

       ducc.js - web page header updating for reliable DUCC state



examples


       start_sim - broker must be on head node



Installing  and  Cloning



This second major section describes support for installation of head node master and backup(s).

TBD



Autostart



This third major section describes support for autostart of head node and agent daemons.

TBD



Monitoring  and  Switching



This fourth major section describes support monitoring of multiple head nodes and switching to an alternate when
the primary is dysfunctional.

TBD



                                                                    8
