|
The below summarizes the Armor applications and fault-tolerance techniques we’ve implemented. The
armor processes’ reconfigurability and the Armor runtime environment’s self-checking functionality
are the key factors that let us provide solutions for multiple application domains.
With regard to the first factor, all the example
applications show that level-1 techniques are
widely useful, regardless of application specifics
and requirements. We implement these generic
fault-tolerance techniques as elements that reside
in armor processes that are external (and, hence,
transparent) to the applications.
The applicability of level-2 and level-3 techniques
depends on a given application’s characteristics,
and the degree to which it is integrated
with the armor. For example, an embedded armor
solution works well as a framework for database
auditing and checkpointing because we can get
full access to client-side database APIs without
changes to the application code.
Implementing progress indicators or heartbeats
still requires few additions to the code. A level-2
implementation wraps standard library function calls
to augment the progress-indicator functionality (for
example, the application might send a progress indicator
message whenever it called the write() function
on a socket). This approach maintains transparency
at the source-code level, although it requires
relinking if the augmented libraries are statically
linked to an application’s executable file.
In contrast, a level-3 implementation requires
the developer to alter application source code. Our study shows that approximately
5 percent of errors propagate across the network,
and roughly 1.5 percent of these cause the
remote nodes that receive the erroneous packets to
crash. Armor processes can virtually eliminate such
scenarios. Moreover, Armor middleware can successfully
recover from correlated failures (or multiple
failures occurring in short succession).
|