High Availability Infrastructure
Armors are multithreaded processes internally
structured around objects, called elements, which
contain their own private data and provide elementary
functions or services. Every armor process
contains a basic set of elements that provide core
functionality, including reliable point-to-point
messaging between armors, response to heartbeat
messages (which indicate a given armor process’s
liveness), and the ability to checkpoint armor state.
Armor processes communicate via message passing:
the microkernel present in each distributes messages
between elements within an armor and
between the armors in a system. Every incoming
message causes the microkernel to spawn a new
thread to process the message, and the execution of
each thread invokes one or more elements within the
armor process. A thread terminates when the armor
finishes processing the message or when it sends an
outgoing message in response to the original.
Structurally, messages consist of two primary
parts: the microoperation sequence and the payload. Every message carries a series of microoperations
to be executed by armors. The microkernel
delivers each microoperation in sequence to elements that have subscribed to that operation. During
the initialization within the system, each armor
establishes a subscription list, which provides mapping
between elements and the microoperations
each element can process. The microkernel is
responsible for maintaining the list at runtime. Each message contains a general payload area
for storing data. Elements can read from and write
to the payload fields while processing the microoperations
in a message. Thus, elements can
exchange information with one another within a
single execution thread. This information exchange
doesn’t interfere with other execution threads
because each thread manipulates its own payload.
While processing a microoperation, an element
can update its local state or the state of the payload
fields; it can also change an armor process’s control
flow by adding new microoperations to the current
sequence. This modular, event-driven architecture
permits developers to customize an armor process’s
functionality and fault-tolerance services (detection
and recovery) according to the application’s needs.

Above is a simple armor configuration with runtime environment scaled to two nodes. There are four basic components:
- The fault-tolerance manager (FTM) initializes an
Armor-based environment’s working configuration,
maintains registration information on all
armor objects and application processes, and
initiates recovery from armor and node failures.
- The heartbeat armor runs on a node that is separate
from the FTM. It detects failures in the
FTM by periodically polling for “liveness,” and
then initiates FTM recovery.
- A daemon armor runs on each node in the network,
serving as a gateway for armor-to-armor
communication and detecting runtime failures
of local armors (those running on the same
node as the daemon).
- The execution armor launches local application
processes, detects their failures, and performs
recovery.
|