CALIFORNIA STATE UNIVERSITY, NORTHRIDGE
MULTILEVEL SECURITY IN DISTRIBUTED DATA PROCESSING
ARCHITECTURE WITH BACKUP AND GRACEFUL DEGRADATION
A thesis submitted in partial satisfaction of the requirements for the degree
of Master of Science in
Computer Science
by
Javad Habibi
May 1986
The Thesis of Javad Habibi is approved:
Robert\1wo
Prasanta Barkataki
Rein Tum, Chair
California State U~iversity, Northridge
11
ACKNOWLEDGEMENT
It is a pleasure to acknowledge the many people who have assisted me
throughout this thesis. I would like to point out three individuals without
whom this thesis could never have come to completion: My wife, Mastaneh:
whose incredible patience, support, and understanding is beyond
description; Dr. Rein Tum, the chairperson of my committee, who offered
me the thesis to begin with, and went out of his way to assist me with the
smallest obstacles and problems; My dear friend, Hossein Ghaemi, who
spent many hours designing the artwork. His understanding and support
through the most difficult times has always amazed me.
I am deeply in debt to my supervisor, Bill Schoene, whose extreme
flexibility will never be forgotten. A special note of thanks to Phil Rinehart:
my friend and colleague, who read this report and offered numerous helpful
suggestions. Lastly, I would like to thank Dr. Barkataki and Dr. Wong fm
serving on my committee and for their time and effort put into this thesis .
...
111
Table of Contents
ACKNOWLEDGEMENT
111
ABSTRACT
Vll
Chapter 1 INTRODUCTION
1
Chapter 2
5
FAULT TOLERANCE IN DISTRIBUTED SYSTEM
2.1 Distributed System Architecture
2.2 Fault Tolerance
Chapter 3
3.1
3.2
3.3
3 .4
3.5
Chapter 4
5
7
DATA SECURITY
16
Security Policy and Models
Secure Operating Modes
Security Evaluation Criteria
Architectural Features
Security in Distributed Systems
16
23
25
28
30
FAULT TOLERANT SECURITY
4.1 The Use of Redundancy
4.2 Necessary and Sufficient Conditions
Chapter 5
GRACEFUL DEGRADATION OF SECURITY
Chapter 6
SECUREFAULTTOLERANCE: THEMuTEAM
CASE STUDY
IV
38
40
45
6.1 Introduction
59
6.2 System Description
6.2.1 Physical Level
6.2.2 Process Level
6.2.2 Kernel Level
59
61
62
6.3 Fault Tolerance in MuTEAM
6.3.1 Error Detection
6.3.2 Error Diagnosis
6.3.3 Reconfiguration
6.3.4 Recovery
6.4 MLS
6.4.1
6.4.2
6.4.3
Chapter 7
Security Issues
General Observations
Trusted Operating System
Untrusted Operating System
CONCLUDING REMARKS
REFERENCES
63
66
68
68
69
71
73
76
78
v
List of Figures
Figure
1
Distributed System Architecture
Figure
2
Triple Modular Redundancy
15
Figure
3
The Domino Effect
15
Figure
4
Clearance Levels and Compartments
17
Figure
5
Reference Monitor
21
Figure
6
A Security Kernel
22
Figure
7
Physically Separated Subnetworks
32
Figure
8
Trusted Interface Unit Structure
33
Figure
9
LAN with Dedicated Hosts and Subnetworks
36
Figure 10
MuTEAMPU
60
Figure 11
Fault Tolerance Processing in MuTEAM
64
Figure 12
Error Detection Unit in MuTEAM
65
Figure 13
Error Diagnosis in MuTEAM
67
Vl
6
ABSTRACT
Multilevel Security in Distributed Data Processing Architecture with
Backup and Graceful Degradation
by
Javad Habibi
Master of Science in Computer Science
Data security and system reliability have long been major concerns in
computer system design. For distributed architecture, both have become
principal requirements, especially in systems supporting real-time
applications. This thesis examines the interaction of data security and fault
tolerance in distributed systems generically and illustrates the issues by
presenting a case study of the MuTEAM system.
A principal conclusion is that the use of protective redundancy
techniques (as implemented in hardware) is the preferred approach for the
design of secure fault tolerant systems. The feasibility of graceful
degradation of security is also examined and it is concluded that graceful
degradation is attainable under a particular interpretation. The MuTEAM
case study reaffirms the above conclusions and points out some of the
problems that may arise if the design and implementation of security and
fault tolerance are not coordinated from the start.
..
Vll
CHAPTER 1
INTRODUCTION
All glory comes from daring to begin.
Eugene F. Ware
Up until the early 1970's, most computer systems were designed to
operate in a centralized fashion: All the processing and data storage were
provided at a single geographical location with users' terminals at other site5
connected to the central facility by communication lines. "Grosch's Law"
[35, 47, 48], which states that the cost of each instruction executed is
inversely proportional to the square of the machine "size" (performance)
and which justifies the use of centralized systems, was widely
acknowledged as the "Golden Rule".
With the advent of inexpensive minicomputers the situation changed:
Individual departments within organizations started to utilize their own
"minis" to serve their own specific needs. Furthermore, in order to meet the
interaction and communication requirements imposed by the complex
structure of their organizations, each of these computers was "coupled" with
computers located in other departments.
Distributed Data Processing (DDP) thus evolved as the logical
consequence of the abovementioned situation. This new trend aims to
achieve the functional distribution of applications among a set of nodes or
processing units (PUs) (each containing its own private memory and 1/0
devices) interconnected by a network (for the transfer of information).
Due to its suitable architecture, DDP is chosen for a variety of differen1
applications. Some of these applications, such as aircraft flight-control
systems [50, 96], space shuttle [21, 28, 45], and power system control
require high reliability. In some of these systems it may not be possible 01
safe to shut down the system to repair faults that may have occured in
1
2
hardware or software. Thus, the system must be designed to be "fault
tolerant". By taking full advantange of redundancy in hardware, software
and data in the DDP architecture, designers will have a basis for achieving
the high degree of reliability and fault tolerance that is essential to the correc·
functioning of critical applications.
However, maintaining "total" reliability of a system over its mission
lifetime is not always achievable. Even then, it is sometimes possible to
maintain a system operational and available, but with reduced functional
capabilities. A system is said to be "Gracefully Degrading" [16, 71, 87]
when, due to faults, its resources and capabilities are diminished, but it
continues to be capable of performing the more critical of its original
functions.
In addition to the reliability requirement, many sensitive. computer
applications mandate that the integrity of their data and programs be
protected against unauthorized access, deliberate tampering or misuse.
Thus, it is necessary to provide adequate information security -- the
protection of information against unauthorized disclosure, modification, 01
destruction.
Due to the fact that threats to the security of a computer system arise
from a wide variety of sources [95], different mechanisms and procedures
are implemented in order to maintain a secure environment:
Physical Security to prevent unauthorized access to the premises, or to
the equipments. This normally includes the use of locked doors, fences
and guards.
Administrative Security to establish a security policy and set up
procedures for protection, access authorization, and access control.
Security policy determines what information should be protected and
how should protection be implemented.
3
Personnel Security to verify the trustworthiness of system employees
and users. In case of government security, this involves granting of
"security clearances".
Computer Security which deals with mechanisms to enforce access
control within the computer system: Operating system, data base
management system, and application programs. This "logical security"
is usually enforced in software.
Communication Security which deals with the protection of computer
data and/or interprocess messages in communication systems. A
standard technique here is the use of encryption [31, 52, 70].
The degree of security needed depends on the sensitivity of information
processed, the trustworthiness of the users, and the capabilities that users
have in the system. For example, a system where users may invoke
preprogrammed transactions only has a lesser security problem than one
where users can do assembly language level programming.
From the security point of view, there are several modes of system
operation. In a "dedicated" system security mode, the system is used
exclusively for a given application and is accessed only by users who are
involved in this application and who are considered trustworthy. In anothet
mode of operation, the "system-high" mode, several applications may shan
a system, but only those users considered trustworthy to handle the most
sensitive information in the system are allowed access. Physical security
techniques are the principal means for enforcing these modes of operation.
The main problem with "dedicated" or "system-high" modes of
operation is inefficiency in system use, and heavy restrictions in sharing of
the system resources. It would be much more desirable to operate the
system in a "multilevel secure" or MLS mode. Generally speaking, in an
MLS system, information of different sensitivity levels (security levels) rna)
4
concurrently reside in the system, and users with different degrees of
trustworthiness (clearance levels) may access the system, yet no
unauthorized accesses can take place. That is, everyone is confined to
access the information for which they have the appropriate clearance levels
(as determined by a "mandatory" security policy) and need-to-know
authorizations (as determined by a "discretionary" security policy). See
chapter 3 for further discussion.
With the abovementioned requirements (reliability and security) in
mind, a number of questions arise which are addressed in this thesis:
1. Are there any incompatibilities or discrepancies between the methods
of obtaining fault tolerance and the techniques for achieving data
security ? If yes, is there a way to resolve these discrepancies ?
What are some of the trade-offs involved ?
2. Can a secure system be gracefully degraded ? Can a gracefully
degrading system be secure ? Is there any difference in achieving
each?
3. How does the architectural design for fault-tolerance impact the
design of security, and vice versa ? How can the design be made
compatible ?
In addition to addressing these questions in chapters 4-5, this thesis
illustrates them in chapter 6 by presenting a case study which is an
examination of an existing fault tolerant system called "Muteam" from the
viewpoint of incorporating data security. Chapters 2 and 3 present
background information on fault tolerance and data security, respectively, to
define and clarify the terms and concepts necessary for the analysis in the
subsequent chapters. The thesis ends with a summary of the findings in
chapter 7, and a discussion of the open issues that need to be addressed in
further research on this subject.
CHAPTER2
FAULT TOLERANCE IN DISTRIBUTED SYSTEMS
Toleration is the best religion.
Victor Hugo
2.1 Distributed System Architecture
Before discussing in depth the topics of fault tolerance and multilevel
security and their interactions, the architecture under study and its
characteristics must be defined in sufficient detail.
The generic model of a distributed data processing system is shown in
Figure 1. The system consists of a set of processing units (PUs), database
units (DBs), terminal units (TUs), and a communication network; typically
a local area network (LAN). All units are connected to the LAN through the
Interface Units (IUs). Each IU receives (and transmits) messages from (to)
the LAN, and performs other interface functions. A LAN may be connected
to other LANs, or to one or more global communication networks.
Each PU has its own private memory, an 1/0 subsystem, and its copy
of the distributed operating system. Although the PUs operate
autonomously (no central controller exits), they cooperate in performing
various computing tasks and share various databases. This type of
architecture, which is commonly referred to as "loosely coupled", does not
require that the PUs be homogeneous (i.e., identical). Permitting
heterogeneous PUs leads to the use of special-purpose PUs as shared
resources in the system (e.g. an array processor).
The operating system programs in each PU permit multi-tasking (more
than one process can be active in a PU), control the resources, and manage
interprocessor communications which are achieved via message exchange
over the LAN.
5
6
Although local area networks can be structured in a number of different
ways (e.g. ring, star, bus, or mesh) [26], this thesis assumes operation in
the broadcast mode using a shared channel such as a bus or a ring. A
contention-based protocol would be employed for the bus structure. In the
case of a ring structure, a contention-free protocol (e.g. token passing)
would be used. However, the specific nature of the LAN is not a critical
consideration in this thesis.
DATA
BASE
......
..............
. .·. .. ..........
........
.....
..·.·....: ....
TERMINALS
110 UNITS
.. .. .. .. ..
... ..
.. ...
SHARED
PU
.. .. .. .. .. .. .. .. ...
... ....
................
.............
.. .. .. .. .. .. .. ~~~~
... ...
.~~;~~9~
...............
.. ...
GLOBAL
NETWORK.
Fig. 1 Distributed System Structure
A distributed system architecture has several features which have
impacts on fault tolerance [54] and security [30]. For example, the
processing power as well as functional utility of the system can be increased
(or decreased) easily by adding PUs, DBs, or TUs (or removing them);
reliability and availability are enhanced by the use of multiple units, such
7
that these can be used to provide backup to each other; and the cost tend tc
be lower, as compared to a centralized (mainframe) computer system with
the same computational power and resources (if such a mainframe system i~
available at all). On the other hand, it is more difficult to exercize
management control in a distributed system, especially when it covers a
large geographical area, provides services to a large, heterogeneous user
population, and/or involves numerous quasi-autonomous organizations.
2.2 Fault Tolerance
Achieving the required level of reliability in the performance of a
computer system has long been a principal concern in system design.
Initially the efforts were concentrated in choosing high quality hardware
components which had been tested thoroughly and "burned-in"-- operated
for a period of time to remove the units which would tend to fail early in
their operational lifetime. Subsequently the design efforts were focused on
designing systems for "fault tolerance" [7, 8, 9, 56] -- capability to operate
correctly even in the presence of certain faults in the system's hardware or
software.
In reliability terminology, a "fault" is an erroneous state of the system
due to a physical failure of some hardware component, a design or
implementation error in some software routine, a wrong data value, an
incorrect operator's action, or the like. Failures may be transient or
permanent. Fault tolerance techniques strive to "hide" failures and faults
and, thus, to prevent incorrect operation of the system.
All techniques for obtaining fault tolerance rely on the utilization of
some kind of redundancy in order to provide the needed information to
offset the effects of failure [20, 38, 76, 86]. Redundancy may be classified
as either "physical" (in which it is achieved by replication in the hardware
units or the software modules), or "temporal" (where the redundancy is
realized through repeated performance of the same set of computations).
8
From a technical viewpoint, redundancy may be further categorized as:
-Protective Redundancy: This method, also referred to as hardware 01
static redundancy, employs replicated hardware units (e.g., extra
gates, bus lines, memory cells) in order to "mask" the effects of
faults. Classical example of protective redundancy is the use of
Triple Modular Redundancy or TMR [64, 79, 94] where all
subsystems are triplicated, and voters (on the basis of
two-out-of-three) determine the correct outputs (Fig. 2). Here, it is
assumed that the majority of the units will operate correctly and that
the probability of identical faults in majority of the units is extremely
small (although it may occur in the case of software errors).
- Corrective Redundancy: This approach employs redundancy in a
dynamic fashion where, in response to the occurence of a detected
failure, the system will try to identify and locate the fault. Following
the error detection phase and diagnosis of the fault's location, the
system will attempt to limit the scope of the fault's effect
(confinement). Dynamic reconfiguration techniques are then utilized
to replace (or isolate) the faulty unit(s). The erroneous state of the
system is then transformed into an error free state from which poin·
the normal system operation is continued.
Protective redundancy is costly and inflexible, but within the design
specifications, it masks the faults and their effects without significant loss o1
performance. Corrective redundancy usually permits the fault to occur and
then takes steps to remove its effects. This method involves less replication
of hardware or software modules, provides considerable design and
implementation flexibility, and is less costly than protective redundancy.
But it requires additional functions to be performed and, thus, may reduce
the nominal performance of the system. Corrective redundancy is dynamic
and more complex to implement than the "static" protective method. Thus,
it also requires additional explanation.
{l
9
2.3 Principles of Dynamic Fault Tolerance
In order to prevent faults from spreading through the system and
causing catastrophic results, a corrective redundant system may go through
as many as six phases:
Fault Detection. The logical starting point for all corrective fault tolerant
systems is the detection of a faulty event. This is the most critical stage
because the remaining phases depend solely on its effectiveness. Various
alternatives are available for the implementation of fault detection. Some are
based on hardware mechanisms and some utilize software techniques. The
following is a list of the some of the mechanisms adopted:
Replicating Checks. This measure involves the replication of some
physical components which executes in parallel to the original unit. Errors
may be detected by comparing the set of results obtained from each copy.
ESS No.lA [27, 91] employs this technique by duplicating the entire CPU.
Replicating checks technique is a powerful and simple method for providing
error detection. High cost, and inability to detect errors caused by faults in
the design are two of its drawbacks.
Timing Checks. This method is used to reveal the presence of faults in
the system by raising a timeout exception. During normal system operation,
a "watchdog timer" is maintained which is reset periodically. If the resetting
does not occur within a certain specified time limit, an exception flag is
raised with regard to the "suspicious" process. Since timing checks indicate
the presence of faults and not their absence, they are normally employed as ~
supplement to other detecting measures. Several systems such as ESS
No.1 A [91], Tandem 16 [66], and PLURIBUS [51] make use of watchdog
timers to realize error detection for their systems.
Reversal Checks. Under this scheme, the outputs from a process are
used to calculate the inputs that are needed to produce these outputs. The
•
10
calculated result is compared with the original set of inputs to detect any
possible discrepancies. This method is limited to applications where the
inverse computation is easily performed. For example the result of a square
root function may be checked by squaring the result and comparing the
result with the actual input.
Consistency Checks. This method is one of the simplest type of error
detection technique where the check is performed by verifying that the
results obtained are "reasonable". Range checking is an example where a
computed value is checked to be within a range (e.g. probabilitiy values
should lie within 0 and 1).
Several other methods are also available for detecting errors within a
system. A comprehensive discussion of these techniques is, however, out
of the scope of this thesis and, thus, left to other sources [4, 88].
- Fault Diagnosis: Fault detection measures do not usually report the
location of the fault. A diagnosis process is thus needed to provide
information regarding the whereabouts of the fault(s)[l, 39, 57]. This
normally involves the execution of some "diagnostic test" by all the units
and the comparison of the obtained outcome with the expected results. If the
outputs produced by a PU differ from those expected, it can be deduced that
the fault lies within that PU.
The diagnosis process may be handled by a surpervisory entity which h
employed to detect any possible discrepancies between the produced output
from each PU and the expected result. A second, more desirable method
involves the execution of a testing algorithm by dedicated processes
residing in each PU. At the end of this process, all fault-free units will have
reached a unified conclusion regarding the status of the system (i.e. they
will know which PUs are faulty) [73]. This is the preferred method due to
its decentralized mechanism and, thus, will be discussed in greater detail in
chapter 6.
11
Fault Confinement. Fault detection techniques do not guarantee the
total identification of the undesired effects of a fault. The system must
therefore prevent the faulty component(s) from contaminating the
processing in fault-free units. One way to achieve fault confinement
consists of deleting the access rights of the faulty PU, and disabling its
logical channels from the rest of the system.
Reconfiguration. After an error has been properly located, the
associated PU is disconnected from the rest of the system either physically
(by removal) or logically (by isolation), and replaced by a back-up unit
which will resume execution of the interrupted process.
Although the reconfiguration step may be carried out in a manual
fashion (where the reinitialization is performed by the operator), this thesis
assumes automatic (dynamic) reconfiguration, employed by the system.
Dynamic reconfiguration [92] is costly and more difficult to achieve than the
manual method; however, it may be the only acceptable method in systems
that cannot tolerate a delay in their operation such as SIFf [96] and FTMP
[50]. Thus the reliability as well as the availability of the system is of great
importance in systems that employ dynamic reconfiguration.
A standard method for reconfiguring the system involves the
replacement of a "suspected" PU with a stand-by spare PU. Normally the
back-up units are idle but available until they are called to take over the
functions that were performed by the unit(s) that are now faulty. For
example, ESS No.lA [91] employs "rover stores" to replace erroneous
storage units.
Recovery. The removal or isolation of a faulty unit does not guarantee
the elimination of its effects. Recovery techniques [22] are needed to restore
the system to an error-free state [6, 68, 82] (i.e., to the state that existed
before the occurence of the error) from which the execution can be resumed
with little or no loss of information .
12
Error recovery techniques are classified as either "forward recovery" or
"backward recovery". Forward recovery involves continuation of operation
from the current state despite the fact that an error has occurred. This
technique is normally adopted by some highly specialized applications and i~
not suitable as a means of recovery from unanticipated faults. Due to its
specialized nature, forward recovery is not discussed any further. More
detailed examination is left to other sources [4, 78].
Backward recovery, in contrast, makes little or no assumption about the
errors it has to deal with. Under this type of recovery, the execution is
"rolled back" to the state before the occurrence of an error. In order to
provide for roll back, the state of each process is saved periodically as the
process executes. Each saved state is termed a "checkpoint" [67]. As soon
as an error has been diagnosed, and the system has been reconfigured, all
the "involved" processes are backed up to their last checkpoint. Thus
backward recovery consists of two phases: A periodic recording of
process-state information (data, machine state, etc), and a roll back stage,
which occurs only after an error is detected. Clearly the computation of
results between the last checkpoint and the roll back are lost and must be
repeated.
Backward error recovery raises many issues: The frequency of
recovery point setting is one of the most important considerations, because
too much checkpointing increases the overhead due to state saving whereas
too little checkpointing implies a more substantial loss of computation time
between the checkpoint and the roll back. Another important issue is the
amount of information which must be saved in a checkpoint. This amount
should be minimized (consistent with effective recovery) so as to avoid
unnecessary time and space overhead [42, 43, 72, 99].
When multiple processes communicate, backward recovery may result
in what is known as the "Domino Effect" (Fig. 3). This phenomenon is the
consequence of uncoordinated establishment of recovery points. As shown
13
in figure 3, each process has established its recovery point frequency
independently of the others. An error at point "e" causes process C to roll
back to checkpoint 6. This will force process A to roll back to checkpoint 5
(due to the communication between process A and C after C's checkpoint
was recorded). But checkpoint 5 precedes another communication to C.
So, to be consistent, processes A and C must be rolled back once again.
Thus the recovery points are reestablished in a toppling "dominos" fashion.
Requiring the setting of a recovery point at the time of an interprocess
communication is one way to avoid the "domino effect."
Graceful degradation. The ability of the system to continue operation
despite the presence of a fault but with reduced capabilities is called
"graceful degradation" [63]. Degradation may take a variety of different
forms such as a reduction in the available memory capacity, a decrease in
the number of available CPUs, lower throughput, longer response time, etc.
Distributed system architectures are especially suitable for achieving
graceful degradation because of the multiplicity of each type of module
(e.g., processing units, database units). Failure of one unit will not disable
the system, provided that provisions have been made for reconfiguration
and prioritizing of tasks -- the more critical tasks should continue to be
performed using available resources.
Two views of fault tolerance and graceful degradation are examined in
this thesis:
- Computational Fault Tolerance: where correct execution of a
specified algorithm is preserved despite defects in the system. Thi~
aspect of fault tolerance was discussed in this chapter in general
terms and will be re-examined using an example in chapter 6.
- Security Fault Tolerance: where correct operation of the security
mechanism is maintained in the presence of faults in the system.
Security fault tolerance raises new issues, because unlike
14
computational fault recovery, security compromise may not be
recoverable. Other issues such as the time span during which an
error remains latent must re-examined because, by the time the errm
is detected, sensitive information might have been disclosed to
untrustworthy users. These and other security issues are explored in
more detail in chapter 3-5.
15
Process
A
Process
B
Process
c
Fig. 2 Triple Modular Redundancy (TMR) [88]
.
.
ProcessA __~!~------~~------•?--~------dr---~'~
Process B -------4~...,_--r-----'----!.---------~------+-----.•
•
b
Process C ___2.1 _____......_______________-'.L--___..6__...~.----i'IK:----.•
a
c
• Checkpoint
X Error
--+ Message passed
Fig. 3 The Domino Effect [88]
e
il
CHAPTER3
DATA SECURITY
From whom I trust may God defend me;
From whom I trust not, I defend myself.
An Italian proverb.
Data Security denotes the employment of measures and techniques for
the protection of sensitive information in a computer system against
unauthorized access, alteration, dissemination and destruction. In order to
achieve data security, many installations apply physical methods. These
measures, however, are not suitable for systems which contain and proces~
information of different sensitivity level and which are used concurrently by
users with different levels of trustwortiness. A discussion of security
requirements for such systems, known as multilevel security (MLS), and
the Department of Defense (DoD) criteria for achieving MLS are the focus
of this chapter.
3.1 Security Policy and Models
The first step in the development of any "secure" system is the
establishment of a specific security policy pertinent to the system. A
security policy is "the set of laws, rules, and practices that regulate how an
organization manages, protects, and distributes sensitive information" [32].
Since the security policy influences the structure of the system, its
objectives must be established at the very outset of the design. They define
the operating modes of the system, specify the rules regarding the flow of
information, and establish the "sensitivity levels". Once the security policy
is defined, the corresponding formal model is developed.
The model that corresponds to the DoD security policy is the
well-known "Bell-LaPadula" model [14, 15, 33, 90] which categorises all
16
'
17
entities in the system into a set of passive "objects" (0) which are
information-holders, and a set of active "subjects" (S) which require acces5
to and manipulate the objects. Certain subjects may also be objects and thus
be accessed or manipulated by other subjects. Every object (files, 1/0
devices, programs, directories, etc) has a "Security Classification Level"
(SL) which defines the minimum authorization necessary for its access.
Furthermore, a "Security Clearance Level" (CL) is associated with each
subject to indicate its maximum trustworthiness level and authority to acces5
objects. Within each sensitivity level, subjects and objects may have
"Compartments" or "Categories" (CAT) to which they belong (objects) or
are permitted to access (subjects).
In DoD, the clearance and classification labels are assigned from the
following hierarchical set: Top Secret (TS), Secret (S), Confidential (C) ,
and Unclassified (U). The categories, on the hand, are non-hiera~chical in
nature. Figure 4 illustrates these concepts using Navy, Air Force, and Army
as compartments.
NAVY
ARMY
Fig. 4 Clearance Levels and Compartments [37]
18
The fundamental access modes used in the DoD security policy are:
- Read-only: Subject may read information in the object, but
modification is not allowed.
- Write: Subject may write information into the object but observation i~
not permitted.
- Execute: Subject may execute the object but may not read or write it
(used in the case when the object is a process).
- Read/write: Subject may both read and/or write the object.
In order to maintain security in an information sharing environment, the
DoD security policy defines two types of access control:
1. Discretionary: Which is "a means of restricting access to objects
based on the identity of subjects and/or groups to which they belong. The
controls are discretionary in the sense that a subject with a certain access
permission is capable of passing that permission (perhaps indirectly) on tc
any other subject" [32]. This requirement, also known as the
"need-to-know" principle, is similar to the access control mechanisms in
many commercial systems. A typical example is the restrication of access o1
the payroll information to the payroll department employees only.
2. Mandatmy (non-discretionazy): is defined as "a means of restricting
access to objects based on the sensitivity (as represented by a label) of the
information contained in the objects and the formal authorization (i.e.,
clearance) of subjects to access information of such sensitivity" [32].
It can be deduced from the above that the mandatory access control is
imposed by an establishment (e.g., the government) in order to protect the
interests of the whole establishment. Needless to say, the mandatory rule
19
has precedence over the discretionary principle. In the case of the DoD, the
mandatory security policy consists of two fundamental conditions:
l.Simple Security Condition: allows a subject to read or execute an
object only if the subject's clearance level is greater than or equal to the
security classification level of the object and if the subject's categories
contain those of the object. This condition may be formulated as:
CL(S) ~ SL(O)
and CAT(S) =:) CAT(O).
or, simply, there must be "no reading up" -- no reading from an object
which needs more access authority than available to the subject.
2. *-Property Condition: This rule, also called "the confinement
property", permits the "write" access to a subject only if the subject's
clearance level is less than or equal to the security classification of the objec1
and that the subject's categories are included in those of the object.
Expressed in a precise form, the *-property mandates that:
CL(S) ~ SL(O)
and CAT(S) CCAT(O).
in other words, "no writing down" to a less protected object.
The purpose of the *-property is to prevent "Trojan Horse" programs
from leaking unauthorized data. A Trojan Horse in computer software is a
program that contains a clandestine task beyond its normal, and legitimate
function. For example, a compiler containing a Trojan horse could make
unauthorized copies of programs on behalf of a user with a lower security
clearance (e.g., the author of the Trojan Horse) and store these in files that
are accessible to this user.
The Trojan Horse illustrates only one type of the more general problem
called the "Confinement problem" [59, 62] which deals with the prevention
20
of sensitive information leakage via "covert" channels -- unintended
communication paths used for transferring unauthorized information. There
are two types ofcovert channels:
- Covert storage channel: is an information path that allows the transfei
of information through some storage location in violation of the
security policy.
- Covert timing channel: Refers to the modulation of some system
resources (e.g. paging rate) utilization in order to send information
over clandestine path.
Both types of covert channels are real and can be found in
contemporary computer systems. An important consideration is the
"bandwith" of a covert channel, the data rate that can be achieved in the
channel.
Channels with very small bandwiths (e.g. a few bits per second) may
be too slow to constitute a real threat of unauthorized disclosure. But even
these must be identified for systems that are being considered for
processing lightly sensitive information.
The Bell-LaPadula model is based on the notion of the "reference
monitor". As its name suggests, a reference monitor is an abstract
mechanism that mediates every access request of a subject to an object.
This abstract concept (Fig. 5) may be realized in an operating system by
means of a "security kernel" [3, 34, 74, 75, 85] which is a hardware I
software mechanism that embodies all the specified security rules and
provides an interface to the rest of the operating system (Fig. 6). The
security kernel must be proven to be designed and implemented correctly.
Besides mediating the accesses made to objects, the security kerl).el
defines each object and produces the necessary access operations.
21
Moreover, the security kernel can allow some "privileged" subjects (e.g.,
system software packages) to violate the security rules as figure 6 illustrates
These "trusted subjects" are needed to control the access policy,
"downgrade" files, and set the scheduling policy.
OBJECTS
USERS,
PROCESSES,
JOB STREAMS
REFERENCE
MONITOR
FILES,
PROGRAMS,
1----I;ITERMI NALS,
TAPES, .••
REFERENCE MONITOR DATABASE
USER ACCESS, OBJECT
SENSITIVITY, NEED-TO-KNOW, •. _
Fig. 5 Reference Monitor [3]
22
USERS
,,_-~A---,
TRUSTED
USERS
TRUSTED
SOFTWARE
Fig. 6
A Security Kernel [3]
23
3.2 Secure Operating Modes
A required criterion for any "multilevel secure" system (definition
follows), is the application of formal verification techniques to the operating
system to ensure the correct design and implementation of the mandatory
and discretionary policy [2, 23, 36, 69]. To date this requirement has been
difficult to meet. Thus, in order to process classified information in the
presence of possible deficiencies in the security mechanisms (e.g. covert
channels, trap doors, Trojan Horses, etc.) other modes of operation are
being used. Together with the multilevel secure (MLS) mode, they make up
the four fundamental modes of secure operation:
Dedicated Mode. The entire system is dedicated to function in a single
security class. Subjects with appropriate security clearances and
need-to-know are allowed to access all the information. All other subjects
are physically segregated from the system. For example a system dedicated
to the "Secret" classification and "Navy" operations could only accept usen
possessing the Secret-Navy clearance level.
The main advantage of this approach is that separation is achieved
without any need for hardware or software security mechanisms other than
physical security. This advantage does not come without a drawback,
however: Under this mode, system resources may become extremely
under-utilized. The employment of "period processing" techniques reduces
the severity of this weakness to some extent. Period processing allows the
use of a single system to be partitionned into different time intervals, with
each interval belonging exclusively to a single security partition. Allleftovet
information is "sanitized" before the system is reassigned to another
partition.
This method, however, cannot mask the second problem intrinsic to the
whole approach: Dedicated modes and periods processing do not allow
controlled information-sharing among application with different sensitivity.
24
System High Mode. Requiring all users to be cleared to the level of the
highest classified information in the system, is referred to as "System High"
operation (the subjects are still constrained by the discretionary policy). The
basic premise behind the system high scheme is that a security failures
would not violate the mandatory policy (although the need-to-know
violation continues to be a major concern.)
Although this mode of operation is less restrictive than the dedicated
mode, it suffers from three major drawbacks:
- It is not always feasible to obtain the necessary clearance for
indispensable people.
- It is unnecessary and costly to obtain highest level of clearance for
those who do not have such need.
- The information availability may be reduced unnecessarily as a resul1
of overclassification.
Controlled Mode. Systems operating under this mode are permitted
information processing over a limited range of sensitivity levels.
Ordinarily, three adjacent levels may be present concurrently (e.g. U, C, S
or C, S, TS.)
As opposed to the previous two operating modes, controlled mode
mandates a clear separation between subjects and objects of different
sensitivity levels. Security hazard on the other hand is a somewhat lesser
consideration as compared to the Multilevel secure mode (definition
follows) in that the least trusted subjects (U-level) and the most sensitive
objects (TS-level) are not present in the system concurrently.
Multilevel Secure Mode. Under this mode, both the mandatory and the
discretionary policies are enforced by a system which is capable of
25
operating under all sensitivity levels. Architecturally speaking, the
controlled and the MLS modes are functionally equivalent. The difference
lies in the degree of trust as obtained by the use of formal design
specification and verification. It is higher for an MLS system.
3.3 Security Evaluation Criteria
In August 1983, the DoD Computer Security Center published the
Trusted Computer System Evaluation Criteria. This document provides a
tangible framework for assessing the suitability of different systems for
processing classified information. The criteria defined in this document are
concerned with four basic objectives:
- Security Policy: This requirement addresses the fundamental principle
of security. It mandates a control over the access and dissemination of
information. This policy includes detailed regulations regarding the
management of information in the system.
- Accountability: No security policy is meaningful in the absence of a
proper individual accountability. The DoD Computer System Criteria
defines accountability by three of its major functions: !I~-~r..i4~nt!fi£"~l!~~n,
IDtth~.ntication, andJlJ,ulit~an~9iliti~s.
- Assurance: Correct and accurate implementation of the security
policy is a requirement for all systems that process classified information.
This objective is further categorized into ~:Life,.,c~cle,.,~.&~YtanGe which is
concerned with assuring the proper interpretation, design, and development
of the security policy, and QPJ~rat.iQnaL. assur~nce which guarantees the
correct operation of the policy throughout the system's life cycle.
- Documentation: This objective is concerned with providing users
with manuals that describe the security features and mechanisms of the
system. In addition, the documentation offers system administrator
26
guidelines for the maintenance and inspection of audit files, the description
of test plans and design of philosophy.
With the above objectives in mind, the DoD Criteria defines four
hierarchical security divisions. Each division (with the exception of division
"D" ) is further subdivided into security classes as follows:
Divisi?n D: Minimal Protection: Systems belonging to this category de
not satisfy the requirements of any of the other divisions. In other. words,
system classified under this division have been evaluated and found "least
secure."
Division C: Discretionary Protection: is reserved for systems that
implement the need-to-know protection. This type of discretionary control,
however, expands only over one sensitivity level. This means that the
system is capable of separating the subjects from the objects within a single
sensitivity level, but there is no segregation of the information in the DoD
sense (i.e, between several sensivity levels).
-Class Cl: Discretionary Security Protection: Under this category,
access control and auditing mechanisms are provided in order to
enforce the need-to-know principle.
-Class C2: Controlled Access Protection: Includes systems that
provide for a finer discretionary control than systems classified as
Cl: User are made accountable for their actions by means of login
mechanisms and security-related events auditing.
Di':'ision B: Mandatory Protection, covers systems that are capable of
processing data of ,Q_iffe!~!!Lf.!~§.§Jfi£'!tion. Information in such systems are
properly marked with non-forgeable labels. Systems belonging to this
division must show the enforcement of the security policy which is based
on the notion of a reference monitor (i.e., the developer must show that a
27
functional reference monitor has been implemented).
- Class B 1: Labelled Security Protection: In addition to the requirement
of class C2, systems belonging to this class require data labels for al
information kept in and processed by the system.
- Class B2: Structured Protection: Under this class, formal security
policy model is documented, storage and timing channels are
addressed, and critical functions are separated from non-critical ones
- CLass B3: Security Domain: is reserved for systems that can
mediates all accesses of subjects to objects, provide support for the
security administrator, and exclude security-irrelevant code from the
security-related sections (i.e., from the security kernel).
Division A: Verified Protection: Systems categorized under this
division are functionally equivalent to class B3. The employment of formal
security verification techniques to assert the correct operation of mandatory
and discretionary control, distinguishes this division from B3.
- Class A 1: Verified Design: Through the use of formal verification
methods, this class guarantees that the specification given,
corresponds to the formal model.
-CLass A2: Verified Implementation: Formal verification techniques
are expanded to establish the correctness of the implementation in
accordance with the specification.
The described requirements for each division and class is to be
implemented via what is known as the "Trusted Computing Base" or the
"TCB" which denotes all the protection mechanisms of a system responsible
for enforcing a security policy.
28
There is a one-to-one correspondence between the operation mode (as
discussed in the previous section) and the DoD security division as follows:
Dedicated mode
System high mode
Controlled mode
MLS mode
-----
Division
Division
Division
Division
D
C
B
A
3.4 Architectural Features
The implementation of an MLS security policy normally mandates the
presence of certain architectural features necessary for a proper secure
operation. The identification of these characteristics not only provides a
well-defined set of requirements, but also serves as a means for a concrete
analysis of the security and fault-tolerance interaction:
- Labels: All decisions reqarding the mandatory access control are
based on the sensitivity labels. These labels are assigned to all resources
(subjects as well as objects) in the system. The association of labels to
resources must be done in both an internal (machine-readable) and
extemal(human-readable) form. As discussed earlier in section 3.1, each
label is a combination of classification/clearance level and categories. The
TCB is responsible for maintaining the integrity of all labels.
- Mandatory Access Control: Based on sensitivity labels, the TCB
must enforce the mandatory access control policy. Basically, this is the
implementation of the reference monitor concept.
- Discretionary Access Control: The implementation of this feature hm
two general approaches:
- Capability-based : Subjects present "Tickets" in order to gain the
access right for an object [44, 83]. Every subject possesses a set o1
Q
29
capabilities that defines the access rights that the particular subject
has to various objects of the system. In the computer system, a
capability is a protected pointer to an object.
- Access Control List: A list of subjects authorized to access a specific
object is maintained with each object. When an access is sought, thi~
list is checked for the proper authorization.
These two approaches - capabilities and access control lists may .also be
combined to provide a more secure discretionary control as done in the
Multics system[84].
- Object Reuse: The TCB must ensure that when a storage object (e.g.
memory segment) is allocated to a new subject, all information in the objec1
"left over" from previous processing is erased.
- Identification and Authentication: It is the TCB's responsibility to
require all users to identify themselves when they try to log in. The identity
of the individual is then authenticated. Once this function is performed,
other security-relevant information such as clearance level may be
determined. Provision must be made to protect the authentication data
against unauthorized access.
The TCB is also responsible for maintaining a trusted communication
path. This logical path, which is isolated from other paths in the system, is
to be used for login purposes and other TCB-to-user connections such as a
change in the subject's clearance level.
-Security Audit: An audit trail must be maintained to log (in real time)
any security-related event (e.g., initial logon, occurrences of security
violation). This log should be protected against unauthorized access and
must remain intact even if the system is compromised.
'
30
- Protected communications: Encryption techniques or physical security
mechanisms may be utilized to provide a secure environment for message
transmission over the LAN.
- Confinement: Covert timing and storage channels should be eliminate(
or at least curbed by minimizing their bandwidth.
- TCB Protection: TCB 's code and structure should be encapsulated in
a privileged mode. This way, its internal structure is not accessible to users
Tagging each storage section, and virtualizing the memory are techniques
that may be employed to protect the TCB.
The above architectural features do not all possess the same degree of
importance, neither do they all require similar effort for their
implementation; they do however constitute a set of tangible requirements
which can be used for a meaningful study of the impact of security on fault
tolerance and vice versa.
3.5 Security in Distributed Systems
As indicated previously, distributed systems are the preferred
architecture for the design of fault tolerant systems, due to their ability to
provide redundancy. This type of architecture, however, brings with it
numerous issues concerning the security which are addressed:
- LAN: From the time a message leaves an IU to the moment it arrive~
at the destination IU, the data on the network medium must be protected
against clandestine interception. One method to achieve this requirement is
to enforce a "system high" operation mode on the LAN. This could be
achieved by utilizing appropriate encryption and physical techniques.
However, this may not be a feasible solution, especially in a less protected
environments (e.g. systems with a majority of unclassified subjects and
objects). A more realistic scheme would be to divide the network into
31
separate subnetworks [40, 80] as illustrated in figure 7. Each subnetwork
would be protected to the highest level of that particular subnetwork (which
may not be the maximum level of the entire LAN). The subnetworks are
connected via "bridges" in order to form the LAN. Link encryption
techniques may be used to protect the data passing through unprotected
areas between two bridges.
- Interface Unit: The basic function of an IU is to route messages from
a host to the LAN and accept messages coming from the LAN.
Furthermore, in order to allow for a realistic operation, the interface unit
must incorporate many security-related tasks. In particular, it must be
"trustworthy" to mediate the message traffic from one host to another.
Thus the IU must be a "Trusted Interface unit" or TIU (Fig. 8) Some of the
more important security functions of a TIU are:
- Identification: The operational security status of every device attached
to the network must be identified by the TIU (e.g., the TIU must be
able to properly identify the security level of a terminal).
- Labelling: Every transmitted mes~age must be properly labelled by
the TIU. These labels must be checked by the receiving TIU so that a
lower classified host does not receive a higher classified message.
- Protection: The TIU is responsible for protecting the security label of
a message while the message is in the TIU.
-Confinement: A TIU may accept messages transmitted only between
hosts connected and identified by the TIU. For example if the header
bytes of a message do not specify an authentic destination, the
remaining data is ignored by the TIU.
Needless to say, the TIU must be physically protected to the level of
"network-high". This will ensure a reliable mediation for all messages sent
.............. ________..._
._
Top Socrat
Key
0
D
Cryplo Unlla
=
~-------'~---------------
LAHCabla
Claultled
------- Environment
Boundary
IT!ill o:!D
IU
T ru ale d/
Unlruale~ LAN
lnlerlaco Unlls
~
<§><1ffi!>
H6000 lfo sl or
UaerTermlnal
(Subscriber)
Orldoo.
Haii-Orldgc 1
(B)
Fig. 7 Physico11y Seporoted Subnetworks (40)
w
1\)
33
Red
( LAN Medium )
.....................................
-
Block
Block
Red
II nterf oce I
ICSMA/CD
CPU
Memory
I
I
I
I
ISecuri ty Processor
••••••..•....•..•....• •....•...•••...•.•.
Microprocessor
Bus
I
[1/0 Port J
.
-I-
Terminal or Host
Fig. 8 Trusted Interface Unit Structure [40]
34
and received.
- Bridge: The operation of a bridge is similar to that of a TIU. The
TIU connects hosts to hosts, the brigde connects subnetworks. Its
functions are also similar to those of a TIU: A bridge is responsible for
taking packets from one subnetwork and sending it to another. The security
and destination verification is handled identically to security checks
performed by the TIU: If the header destination does not match any of the
entries in the routine table, the remainder of the packet is rejected; otherwise
the packet is transmitted to the receiving subnetwork.
- HW: The principal security issue involving the host lies in the area
of operating system security. The fundamental question is whether the
operating system is responsible for providing security in a LAN-based
distributed system.
Two cases are considered here:
1. Trusted Operating System: Under this scheme, the operating system
in each host is evaluated and rated at division A or B. The operating system
is therefore responsible for preserving a multilevel or a controlled mode of
secure operation. As a result of this rating, the operating system (and not
the TIU) is tasked to maintain the security of the labels and all the security
features discussed in the previous sections.
2. Untrusted Operating System: This approach deals with operating
systems rated at the DoD security evaluation criteria division Cor D. These
type of operating systems are not considered secure and cannot be trusted
to enforce the mandatory and discretionary security policy. Consequently,
trusted interface units are used as the primary vehicle for providing secure
operation. As illustrated in Figure 9 each PU is dedicated to a single
classification level. The attached TIU s are responsible for labelling the
outputs of a PU with the classification level of that particular PU. This
II
-
-
II
35
method sometimes leads to the overclassification of information. One way
to correct this problem is to employ a "guard" process to relabel
overclassified information with its correct security levels.
,,
36
...
.:Jt.
0
.:Jt.
....
0
.!G
.....~
c
0
t-:il:
c:
..0
:l
-•en
~...J
«J
t
::r:=>
..0
-
::I
en
en
)-
0
en
-...
~
en
en
~
::J
i=
....
Q
...
3:
"'0_
Cl)
c
.c
.. 0
=::t:
"
=
r.n
::1
't:!
i=
c
CCI
...enen
Q
:z:
-•
ocn
:X:
't:!
...
Cl)
:::::l..J
t=:il:
....en•
CCI
··-z...3:
u
CJJ
't:!
Cl)
Q
.s::.
:::l..J
t=>
en
....
•
(..)
<
...J
-o,_
.CI)
:X:
- - •
-.E -. >
::J
t=
::l
G
i=
::l
t=
U)
)-
G
>
G
>
G
-J
-J
•
0
0
..0
CD
"t:
(0
G
G
>
G
..J
..!.
::1
:1
[]]~~
.
C'\
·-
0)
1.1..
'
CHAPTER4
FAULT TOLERANT SECURITY
He who gives up the smallest part of a secret
has the rest no longer in his power.
Jean Paul Richter.
One of the requirements of the Trusted Computer System Evaluation
Criteria for classification of a system under Division B3 or above (i.e. a
system authorized for use in controlled or multilevel secure mode) is the
presence of a "trusted recovery" mechanism. The criteria define trusted
recovery as: "Procedures and/or mechanisms (which) must be provided to
assure that, after an ADP system failure or other discontinuity, recovery
without a protection compromise is obtained" [32].
Trusted recovery can be implemented in several ways. The simplest
way to handle the "trusted" requirement is to stop the system in a secure
state. This prevents security compromise by preventing accesses or data
transfers until the failure condition is corrected. The "recovery" requiremen1
is then handled by using conventional techniques. However, in systems
where continued operation is essential even in the presence of faults, i.e.,
fault tolerance is required, trusted recovery means that security mechanisms
(the TCB), must be fault tolerant as well.
In chapter 3, different approaches for obtaining fault tolerance were
presented. Some of those techniques however, may not be suitable for
providing "recovery without a protection compromise." This chapter
examines the generic problems associated with the interaction of security
and fault tolerance, using a set of assertions. Although these assertions are
not yet provable, they constitute a meaningful framework for further studie~
of the problem.
37
38
4.1 The Use of Redundancy
Fault tolerance is predicated on the use of redundancy in hardware,
software, storage, or processing. From the security point of view,
however, redundancy may increase vulnerability by: (1) increasing the
complexity of the system which makes it more difficult to understand and
prove correct, and (2) by increasing the amount of sensitive data that must
be protect~d, the number of locations that must be protected, and the time
period in which sensitive information is exposed to attack. Thus, there exist
certain basic incompatibilities in achieving both fault tolerance and security
which depend on how redundancy is provided (protective redundancy or
corrective redundancy), and at what level of granularity it is provided (e.g.:
system or subsystem level, or logic circuit or individual instruction level).
If the granularity of redundant design is very fine, such that replicated
information or repeated processing is in very small units (e.g., one byte of
data, or repeated computation of each instruction one at a time), very little
may be exposed to security threats. This is usually the case when protective
redundancy is designed into logic circuits. On the other hand, coarse
granularity in redundancy involves replication of larger amounts of
information or repeated computations of entire programs. Now the exposure
is much greater and for longer periods of time. This is characteristics of
protective redundancy at subsystem level, and of corrective redundancy
implemented in software.
The analysis of security compatibility of protective redundancy could
begin with the following statement:
Assertion 1: Corrective redundancy techniques are not suitable for
providing fault tolerant security.
This conclusion may be drawn directly from the definition of correctivf
redundancy and from the trusted recovery requirement: Corrective
39
redundancy does not prevent errors. It is a method which lets errors
happen, but depends on detecting them with high probability, and then
being able to recover. This creates an insecure environment because security
failures may not be recoverable. For example, a failure affecting the
reference monitor mechanism may result in releases of sensitive information
to unauthorized subjects which cannot be "recalled".
The degree of vulnerability created by the use of corrective redundancy
in making security enforcement fault tolerant depends on several factors,
especially on the time period in which the fault remains latent before it is
detected. If the latency time can be made so short that erroneous disclosure
would not be possible, then the use of corrective redundancy may become
acceptable. The operational environment, likewise, affects the decision of
whether or not to use corrective redundancy. If the environment is relatively
benign (e.g., system high operation) then accidental releases of sensitive
information would not be catastrophic. In a true MLS environment,
however, where highly sensitive objects and untrustworthy subjects are in
the system simultaneously, the use of corrective redundancy in security
enforcement can be very risky.
Assertion 2: Protective fault tolerant techniques implemented tn
hardware appear suitable for secure fault tolerant.
As described earlier, protective redundancy techniques (implemented
by replicating modules or by repeating operations) "mask" a certain set of
errors and generate correct results despite these errors. This approach is
especially effective against transient errors. Protective redundancy
implemented in hardware, such as the use of replicated logic circuitry, can
be very effective for rendering a hardware-supported reference monitor fault
tolerant (e.g., in comparing subject/object labels and applying the security
policy requirements in memory management hardware). Software
implemented protective redundancy, such as automatically repeated
computations, can also be effective although now software complexity has
40
been increased. In general, hardware implemented security mechanisms are
less vulnerable than those implemented in software since, by definition,
software can be altered easily while hardware cannot. When using repeated
computations in rendering a security function fault tolerant, different
software routines are used in each repeated computation.
Error correcting codes (redundant bits computed by using special
algorithms and included in data entities) are another technique which is
effective in maintaining fault tolerance of security-related data, such as the
subject and object labels or subjects' authentication data. Cryptographic
checksums are useful for detecting any changes in data, but cannot be used
to provide fault tolerance since they cannot correct errors.
Permanents faults in one or more modules of a set of redundant
hardware modules may destroy the fault masking property of the set, but the
fault detection capability remains since it is very unlikely that the majority o1
the modules fail in a precisely identical way. Thus the system could
continue to operate in a gracefully degraded way -- the security mechanism
works, but is no longer totally fault tolerant.
4.2 Necessary and Sufficient Conditions
In developing the foundations for fault tolerant security it would be
useful to derive a set of criteria for assessing the degree of fault tolerant
security that might be achievable through using various fault tolerant
techniques. Based on the two assertions in the previous section, the
following may be deduced:
Assertion 3:
A sufficient condition for fault tolerant security is that
all security mechanisms in the system be fault tolerant
through the employment of protective redundancy
techniques.
41
If this assertion holds, all hardware failures in the system (within the
specified fault tolerant design criteria) which could affect the system's
security mechanisms are masked automatically and they cannot cause any
erroneous functioning of the security mechanisms. Similarly, software
errors are masked by software-implemented protective redundancy.
The amount of hardware and software that must be fault tolerant
through protective redundancy and must reside in the system's trusted
computing base (TCB) is specified in the DoD security classification criteria
For example, a division C system would need to render its discretionary
access control mechanisms fault tolerant, while in a division B system, both
the mandatory and discretionary access control mechanisms must be fault
tolerant.
Sometimes however, it may not be possible to impleme~t all the
security requirements in a protective fault tolerant fashion. In that case, at
least the indispensable features of the security policy and its enforcement
(such as those dealing with the mandatory access control) must be made
fault tolerant through the employment of protective redundancy. From the
above discussion it follows that, for the DoD multilevel security policy:
Assertion 4:
A necessary condition for fault tolerant security is that
security mechanisms required for enforcing the
mandatory access control be fault tolerant through the
use of protective redundancy techniques.
This assertion has an important corrollary which relates to the security
labels: No errors in the system should be allowed to convert a legitimate
security label to some other valid label. Error correction or, at least, error
detection is needed for the prevention of a possible illegitimate label
conversion. Thus:
42
Assertion 5:
A necessary condition for fault tolerant security is that
the security labels be fault tolerant.
Any absence of full error correction capability for the security labels
will undermine the desired fault tolerant capability of the security
mechanism. In any case, a very reliable error detection capability for the
security labels is an absolute necessity.
An important subsystem to be considered in distributed systems is that
of trusted interface units (TIU's), especially when the entire security system
is based on their correct functioning. In these systems, the fault tolerance o1
the TIU s is of primary concern. Corrective redundancy techniques are out
of the question since the main function of a TIU is to release information
from its host system. Therefore:
Assertion 6:
In distributed systems with an untrusted operating
system, trusted interface units must be made fault
tolerant through the employment of protective
redundancy techniques.
In summary, it appears that fault tolerant security necessitates the use of
protective redundancy as implemented in hardware. However, there must
also be in place reliable and rapid means to detect and immediately repair 01
replace the failed unit(s). After a permanent failure has occurred, the level
of fault tolerance has been reduced and is restored only after repairs or
replacement have been completed.
4.3 Secure Fault Tolerance
The interaction of data security and fault tolerance may be viewed from
a different perspective where computational fault tolerance is sought in a
system which must also be secure. The issue here is the possible effects tha
the implementation of fault tolerance may have on the data security
43
mechanisms.
Implementing fault tolerance in a protective redundant manner normally
requires the employment of additional hardware mechanism at circuit level,
switching circuit level, register transfer sublevel or even at system level
such as the replication of the entire CPU. Hardware redundancy, especially
at higher architectural levels, imply system complexity which is considered
a generic vulnerability for the security of the system. The addition of new
data paths introduces new source of potential covert channels.
Furthermore, due to the changes in the system configuration, the design and
the implementation must be proved correct once again. The following may
therefore be stated:
Assertion 7:
Computational fault tolerance through the use of
processor or system level protective redundancy
increases the difficulty of maintaining security.
Corrective redundant implementation of fault tolerant also increases the
difficulty of maintaining security. In order to obtain software fault tolerance:
a great deal of communication must take place among the processes for
diagnosis, reconfiguration, and recovery. Violation of the *-property,
therefore, becomes a major concern whenever two processes at different
clearance levels try to communicate. Moreover, the amount of intermediate
results such as checkpoints that must be kept, may allow covert channels
which could be used by Trojan Horses that may have infiltrated in the fault
tolerant processes. Form the above discussion, the following can be
asserted:
Assertion 8:
Software implemented corrective redundancy will have
serious impacts on system security.
In summary, there are serious interactions between security and fault
tolerance which must be considered and resolved before a trusted fault
44
tolerant system is designed and implemented.
(l
CHAPTERS
GRACEFUL DEGRADATION OF SECURITY
Grace is more beautiful than beauty
Emerson
As noted in chapter 2, gracefully degradable systems are those that are
able to continue correct operation, though degraded (with reduced
capabilities), in the event of a failure. The basic philosophy is that degraded
operation is preferable to no operation at all. This chapter deals with the
interplay of security and graceful degradation; that is, whether it is
meaningful to speak of graceful degradation of security. In other words, i~
it possible to maintain a certain level of security in the presence of failure in
some security related functions ?
Answering the above question in a meaningful and tangible way
requires a reference point- a model. Using the DoD security evaluation
criteria as the model, graceful degradation of security may be defined as a
series of downward migrations of the system in the hierarchy of security
modes and classification/division. This means that as faults occurs in the
operation of the security modules, the system's security classification may
decrease from Al to B3, to B2 and so forth, down to D. Consequently the
system's operating modes may migrate from multilevel secure, to
controlled, to system high and finally to dedicated mode (due to the
one-to-one correspondence between secure operating modes and the
evaluation criteria).
Just as other types of degradation impact the users and data in some
way, security degradation as defined above, affects the users in its own
ways:
-Users who do not possess the newly imposed, stricter requirements
45
46
on security clearances (due to the degradation) are disconnected from the
system.
- Some data may need to be purged from the system because of their
higher (or lower) classification relative to the newly imposed clearance
levels.
With reference to the above conceptual model of security degradation in
the DoD criteria system, the following may be asserted in principle:
Assertion 9: Graceful degradation of security is feasible.
The important issue is then the practical implementation of security
degradation. As noted before, the basis of any graceful degrading system i~
redundancy in all the system's resources (which ensures the absence of any
single point of failure).
Degradation manifests itself in the loss of computational capacity and/m
performance when one of a set of resource units fails. Degradation worsem
as additional resource units of the same type or of some other types are lost.
If security were implemented in the same fashion, with more than one of
each security resources modules (e.g., implementations of the reference
model) then there would exist other phases of graceful degradation before
the downward migration in the DoD criteria classification system. For
example, if N reference monitor units were operating concurrently in a
loadsharing fashion to improve performance (a concept which has not been
proposed so far in computer security community), the loss of one unit
would increase the work load on the others and, under heavy demand,
degrade performance. However, not until the last unit of a particular type
fails would there be a requirement to reduce the security classification of the
system. Distributed systems in general, and LAN-based architectures in
particular can, therefore, provide a suitable environment for obtaining
graceful degradation of security.
47
A possible implementation may ask for the employment of multilevel
trusted interface units (TIU/ML) due to their capability to limit
communications to the authorized range of security levels for each host,
rather than enforcing a single security level. This means that graceful
degradation of security can be easily implemented in a LAN-based system
by reducing the authorized operating range of the TIUs associated with a
host which needs to undergo graceful degradation of its security. Therefore
Assertion 10:
A Distributed architecture provides a suitable
environment for obtaining graceful degradation of
security.
Having established the conceptual feasibility of the design of gracefully
degrading secure systems, it is necessary to identify decision criteria for thf
implementation. One such set of criteria is shown in Table 1. It is based on
the architectural requirements defined by the DoD criteria for the TCB, as
discussed in chapter 3. The interpretation of Table 1 is as follows:
- Each row represents a security class of the DoD criteria.
-Each column represents a security-related function as defined in the
DoD criteria.
- An "X" appearing in the cross-section of a row and a column
denotes that the presence of the security function (presented in the heading
of the column) is required for operating the system under the specified
division (row). For example, an "X" in row "C1" and column "TCB
protection" indicated that the protection of the TCB is a necessary condition
for a system to operate under class C1, and that the abscence of the TCB
protection will lower the system's security level to the division and class
specified under "X".
- If more than one of the required functions become inoperative, the
48
system's security level is reduced to the highest division and class where
these inoperative functions are not required. For example, if the system is
operating under class B2 and both the "mandatory access control" and the
"storage object reuse" fail to operate properly, the system's security level
will be reduced to C2 where these functions are not required.
- Divisions B3, AI, and A2 are grouped together under one category:
because, as discussed before, their differences lie only in the area of
assurance. All three require the same security functionality and architectura
features.
The above discussion can be summarized as follows:
Assertion 11:
Rendering a security function inoperative will lower the
system's security level to the highest division and clas~
where this function is not required.
Security degradation should be viewed mainly in terms of individual
(faulty) PUs and not the whole system. This way, the reduced capabilities
affect only the failed PU and the rest of the system may continue operation
in the same security mode as before, given that the faulty PU is physically
or logically isolated from the rest of the system. To summarize:
Assertion 12: Graceful degradation of security results in reducing the
range of data classification levels that may be handled
by the faulty module or system, and the range of user
clearance levels that will be permitted access.
Graceful degradation of security in a PU or in the entire system is a
consequence of total loss of some security-related function. This happens
in systems without fault tolerant security at the first failure in this function.
In systems where this function is fault tolerant, at least two failures in the
function (or N+l failures, if the function is replicated N-fold) must occur
49
before graceful degradation to a lower security level takes place. However:
there is also a graceful degradation in the fault tolerance of the function each
time one of its replicated versions fails. Likewise, if the replicated versions
were all operating in load sharing mode, there would also be a certain
degradation in security system's performance.
Two important considerations which are beyond the scope of this thesi~
are: (1) design and implementation of a system, itself fault tolerant, for the
detection of failures in security functions, and (2) the design and
implementation fo a system, also fault tolerant, for securely changing the
operating mode of the system when graceful degradation is required. The
first seems to require a security system design where each security function
is implemented modularly, with diagnostics available to test for its correct
operation. The second requires rapid containment of all subjects and object~
until the new operating mode has been established and decisions have been
made about the subjects' authorizations. For example, in the migration from
MLS mode to controlled mode, a decision has to made about the three (or
two) contiguous security levels to retain of the four security levels which
may have been in the system before the change. Further research is needed
to develop these designs and decision criteria.
50
Security
Policy
Discretionary
Access Control
(1)
(2)
Object
Mandatory
Reuse Access Control
(3)
(4)
Labels .
(5)
(6)
X
C2
Dedicated
(7)
D
Cl
X
D
System-High
C2
X
D
System-High
Bl
B2
X
Cl
Dedicated
X
Cl
Dedicated
X
X
D
System-High
Cl
Dedicated
X
Cl
Dedicated
X
C2
Dedicated
X
D
System-High
X
Cl
Dedicated
X
Cl
Dedicated
X
C2
Dedicated
X
C2
X
Bl
Dedicated Controlled
----------------------------------------------------------------------------------------------------------
B3
Al
A2
X
D
System-High
X
Cl
Dedicated
X
B2
Controlled
X
Cl
Dedicated
X
C2
Dedicated
X
C2
X
Bl
Dedicated Controlled
Table 1 Graceful Degradation of Security
0
'
51
Accountability
Identification
Authentication
(8)
(9)
Audit
(10)
(11)
(12)
(13)
(14)
D
Cl
C2
Bl
X
D
System-High
X
X
D
System-High
Cl
Dedicated
X
D
X
System-High
X
B2
D
System-High
X
Cl
Dedicated
Cl
Dedicated
X
C2
Dedicated
X
Cl
Dedicated
X
C2
Dedicated
X
Cl
Dedicated
X
C2
Dedicated
X
Cl
Dedicated
X
C2
Dedicated
X
Bl
Controlled
---------------------------------------------------------------------------------------------------------B3
AI
A2
X
D
System-High
X
Cl
Dedicated
X
C2
Dedicated
X
Cl
Dedicated
X
C2
Dedicated
X
X
Bl
B2
Controlled Controlled
Table 1 Graceful Degradation of Security (Continued)
52
Operational Assurance
System
Architecture
(15)
(16)
(17)
System
Integrity
Trusted
Recovery
(18)
(19)
D
C1
C2
X
D
System-High
X
X
D
Cl
System-High
B1
B2
B3
A1
A2
X
D
System-High
X
D
System-High
Dedicated
X
X
X
D
System-High
Cl
C2
D
Dedicated
Dedicated
System-High
X
D
X
X
X
Cl
C2
D
System-High
Dedicated
X
D
System-High
X
Cl
Dedicated
X
Dedicated System-High
X
C2
Dedicated
X
D
System-High
X
B2
Controlled
Table 1 Graceful Degradation of Security (Continued)
53
(1)
The TCB shall define and control access between named users
and named objects (e.g., files and programs) in the ADP system.
The enforcement mechanism (e.g., self/group/public controls,
access control lists) shall allow users to specify and control
sharing of those objects by named individuals or defined groups
or both (Section 2.1.1.1 ).
(2)
The discretionary access control mechanism shall, either by
explicit user action or by default, provide that objects are
protected from unauthorized access. These access controls shall
be capable of including or excluding access to the granularity of'
single user. Access permission to an object by users not already
possessing access permission shall only be assigned by
authorized users (Section 2.2.1.1 ).
(3)
The discretionary access control mechanism shall be capable of
specifying, for each named object, a list of named individuals
and a list of groups of named individuals with their respective
modes of access to that object. Furthermore, for each such
named object, it shall be possible to specify a list of named
individuals and a list of groups of named individuals for which
no access to the object is to be given (Section 3.3.1.1).
(4)
When a storage object is initially assigned, allocated, or
reallocated to a subject from the TCB's pool of unused storage
objects, the TCB shall assure that the object contains no data
for which the subject is not authorized (Section 2.2.1.2).
(5)
The TCB shall enforce a mandatory access control policy over all
subjects and storage objects under its control (e.g., processes,
files, segments, devices). These subjects and objects shall be
assigned sensitivity labels that are a combination of hierarchical
classification levels and non-hierarchical categories, and the
54
labels shall be used as the basis for mandatory access control
decisions. The TCB shall be able to support two or more such
sensitivity levels (Section 3.1.1.4).
(6)
Sensitivity labels associated with each subject and storage object
under its control (e.g., process, file, segment, device) shall be
maintained by the TCB. These labels shall be used as the basis
for mandatory access control decisions. In order to import
non-labeled data, the TCB shall request and receive from an
authorized user the security level of the data, and all such actions
shall be auditable by the TCB (Section 3.1.1.3).
Sensitivity labels shall accurately represent security levels of the
specific subjects or objects with which they are associated.
When exported by the TCB, sensitivity labels shall accurately
~and unambiguously represent the internal labels and shall be
associated with the information being exported (Section
3.1.1.3.1).
The TCB shall designate each communication channel and 1/0
device as either single level or multilevel. Any change in this
designation shall be done manually and shall be auditable by the
TCB. The TCB shall maintain and be able to audit any change in
the current security level associated with a single-level
communication channel or 1/0 device (Section 3.1.1.3.2).
When the TCB exports an object to a multilevel 110 device, the
sensitivity label associated with that object shall also be exported
and shall reside on the same physical medium as the exported
information and shall be in the same form (i.e., machine-readable
or human-readable form). When the TCB exports or imports an
object over a multilevel communication channel, the protocol
used on that channel shall provide for the unambiguous pairing
55
between the sensitivity labels and the associated information that
is sent or received (Section 3.1.1.3.2.1).
Single-level 110 devices and single-level communication channels
are not required to maintain the sensitivity labels of the
information they process. However, the TCB shall include a
mechanism by which the TCB and an authorized user reliably
communicate to designate the single security level of information
imported or exported via single-level communication channels or
1/0 devices (Section 3.1.1.3.2.2).
The TCB shall mark the be ginning and end of all
human-readable, paged, hardcopy output (e.g., line printer
output) with human-readable sensitivity labels that properly
represent the sensitivity of the output. The TCB shall, by default
mark the top and bottom of each page of human-readable, paged,
hardcopy output (e.g., line printer output) with human-readable
sensitivity labels that properly represent the overall sensitivity of
the information on the page. The TCB shall, by default and in an
appropriate manner, mark other forms of human-readable output
(e.g., maps, graphics) with human-readable sensitivity labels that
properly represent the sensitivity of the output. Any override of
these marking defaults shall be auditable by the TCB (Section
3.1.1.3.2.3 ).
(7)
The TCB shall immediately notify a terminal user of each change
in the security level associated with that user during an interactive
session. A terminal user shall be able to query the TCB as
desired for a display of the subject's complete sensitivity label
(Section 3.2.1.3.3).
(8)
The TCB shall require users to identify themselves to it before
beginning to perform any other actions that the TCB is expected
,, .
56
to mediate. Furthermore, the TCB shall use a protected
mechanism (e.g., passwords) to authenticate the user's identity
The TCB shall protect authentication data so that it cannot be
accessed by any unauthorized user (Section 2.1.2.1).
(9)
The TCB shall be able to enforce individual accountability by
providing the capability to uniquely identify each individual ADP
system user. The TCB shall also provide the capability of
associating this identity with all auditable actions taken by that
individual (Section 2.2.2.1 ).
(10) The TCB shall maintain authentication data that includes
information for verifying the identity of individual users (e.g.,
passwords) as well as information for determining the clearance
and authorizations of individual users. This data shall be used by
the TCB to authenticate the user's identity and to determine the
security level and authorizations of subjects that may be created tc
act on behalf of the individual user (Section 3.1.2.1).
(11) The TCB shall be able to create, maintain, and protect from
modification or unauthorized access or destruction an audit trail
of accesses to the objects it protects. The audit data shall be
protected by the TCB so that read access to it is limited to those
who are authorized for audit data. The TCB shall be able to
record the mechanisms, introduction of objects into a user's
address space (e.g., file open, program initiation), deletion of
objects, and actions taken by computer operators and system
administrators and/or system security officers. For each recordec
event, the audit record shall identify: date and time of the event,
user, type of event, and success or failure of the event. For
identification/authentication events the origin of request (e.g.,
terminal ID) shall be included in the audit record. For events that
introduce an object into a user's address space and for object
57
· deletion events the audit record shall be able to selectively audit
the actions of any one or more users based on individual identity
(Section 2.2.2.2).
(12) The TCB shall also be able to audit any override of
human-readable output markings (Section 3.1.2.2).
(13) The TCB shall be able to audit the identified events that may be
used in the exploitation of covert channels (Section 3.2.2.2).
(14) The TCB shall contain a mechanism that is able to monitor the
occurrence or accumulation of security auditable events that may
indicate an imminent violation of security policy. This
mechanism shall be able to immediately notify the security
administrator when thresholds are exceeded (Section 3.3.2.2).
(15) The TCB shall maintain a domain for its own excution that
protects it from external interference or tampering (e.g., by
modification of its code or data structures) (Section 2.1.3.1.1).
(16) The TCB shall isolate the resources to be protected so that they
are subject to the access control and auditing requirements
(Section 2.2.3.1.1).
(17) The TCB shall maintain process isolation through the provision
of distinct address spaces under its control (Section 3.1.3.1.1 ).
(18) Hardware and/or software features shall be provided that can be
used to periodically validate the correct operation of the on-site
hardware and firmware elements of the TCB (Section 2.1.2.1.2).
(19) Procedures and/or mechanisms shall be provided to assure that
after an ADP system failure or other discontinuity, recovery
{J
•
58
without a protection compromise is obtained (Section 3.3.3.1.5).
CHAPTER6
SECURE FAULT TOLERANCE: THE MuTEAM CASE STUDY
Precept begins, but example completes.
A French proverb.
6.1 Introduction
MuTEAM is a prototype multimicroprocessor distributed syste~ which
aims at the development of design methodologies for real time distributed
process control applications [24, 29, 46]. Its goals- modularity, functional
distribution, and fault tolerance - are achieved by means of a decentralized
architecture in which there is no logical or physical single point of failure.
This type of architecture makes MuTEAM an appropriate vehicle for the
study of secure fault tolerance.
6.2 System Description
6.2.1 Physical Level
MuTEAM is composed of set of clusters of processing units (PUs)
which are loosely connected via a local area network, the Cluster Bus.
Each PU consists of a CPU, and address translator, a shared memory
subsystem, a private memory, an 1/0 subsystem, and a communication
controller (Fig. 10). Through the Cluster Bus, the processor of a PU at a
node can access shared memory blocks of the PUs at other nodes.
Inter-PU communication is permitted only via mailboxes in the shared
memory blocks located at the nodes of the receiver processes. This method
allows logical insulation of a faulty PU from other PUs in the cluster (once~
fault is detected) by revoking its access rights to other PUs. In addition, the
PUs are also connected via another bus, the Signaling Bus, which is used
for transmitting interprocessor interrupts for fault tolerance and other
purposes.
59
60
trom /to Cluster Bus
... ~
Processor
(P)
Address
Translator
(AT)
t
~
...
~""
Shared Memory
...,. Subsystem vith
.. t'VO port facility
~
Privileged...,......__ _ _ _ _ __,
local bus
Private Memory
and 1/0 Subsystem
Private Bus
Interrupt R.equest Line
.
Communication
Con troller CC
"'ll"
trom/to Signalling Bus
Fig. 10 MuTEAM PU [24]
61
6.2.2 Process Level
Both the resource management and the fault treatment mechanisms in
MuTEAM are based on a concurrent programming model, ECSP, an
extended version of CSP [49], which is a suitable language for
message-based concurrent programming.
MuTEAM's programs consist of a set of active entities called processes
each running in a local protected environment. Processes communicate
among themselves by means of message passing. This structure allows an
easier control over the way processes interact (an area which has shown to
be error-prone), provides modularity which in turns permits easy
modification, and allows natural error confinement, an important
consideration in the area of fault tolerance as well as security. The set of
processes executing at any one time may be envisioned to be a set of trees
whose "active" processes are represented by the leaves. Higher level
processes (those closer to the root) remain idle until their "children"
terminate.
The logical communication channel between two interacting processes
consists of the triple (source, destination, message pattern). A variety of
communication mechanisms are provided at the process level such as
synchronous/asynchronous, symmetric/asymmetric communications as well
as static/dynamic channels. The latter are important consideration in system
reconfiguration where new communication paths must be established
dynamically whenever an error occurs.
In order to carry out the fault tolerance mechanism, a "twin process' is
allocated at sysgen time to run concurrently with each "primary process". A
twin process is an identical copy of tits primary process, but it runs in a
different node. In normal operation, every time a primary process is
activated, the same is done for its twin. Any message sent to a primary
process is also sent to the twin. Twin processes are used in the
62
reconfiguration phase as replacements for their primary processes running ir
the faulty node (i.e., they act as "hot standby spares).
6.2.3 Kernel Level
The MuTEAM's fault tolerance mechanism is implemented in software
at the operating system kernel level [10, 19]. The kernel (which is replicated
in each node) consists of a set of activation, termination, input, and output
primitives. The activation process is initiated by a suitable kernel command
which signals the dispatcher and gives it the starting addresses of the
process to be activatedand its twin process. When a process terminates, the
kernel also ends all the suspended I/0 commands referring to the terminated
process. Processes that were activated in parallel with the terminated one are
also notified of this termination by the kernel.
Output primitives are the mechanism used to implement message
passing. These primitives inspect the receiver's data structure for an existing
logical channel. If the message pattern of the sender matches with what the
receiver expects then the message is recorded in the receiver's buffer called
INCTABLE, which is a table residing in the local memory of each PU. If
the message pattern does not match the receiver's expectation, then it is
recorded in another buffer and an "unsuccessful" return code is sent back.
Altogether the output primitive has three return codes. Successful,
unsuccessful (e.g., a process tries to communicate with a terminated one),
and abnormal termination. The latter will cause the error detection unit to
send an interrupt to the kernel of the violator's processor which will start up
the fault tolerance process.
6.3 Fault Tolerance in MuTEAM
In MuTEAM, a PU is considered faulty as soon as a fault is detected
within it. A faulty node is isolated from the rest of the system by the
cooperation of the non-faulty PUs. This is realized by trevoking the acces~
{J
63
rights of all the processes that reside in the faulty PU. After the faulty PU i~
isolated in this manner, its twin processes are restarted and the system
continues its normal operation.
In the present implementation of MuTEAM, only one twin process is
created (statically) for each primary process. Such design creates a major
problem when an error occurs in a PU running a twin process whose
primary process has already suffered a failure and is not operating. A
solution to this problem is the dynamic allocation of additional twin
processes in different PUs every time an error is detected within a PU.
To obtain the above, MuTEAM goes through four phases - error
detection, diagnosis, reconfiguration, and recovery (Fig. 11 ).
6.3.1 Error Detection
In MuTEAM, errors are detected by the Protection unit (Fig. 12), a
hardware entity which resides in each PU. Access right violation, memory
segment limit violation, and incorrect message type are among the possible
errors detected by the protection unit. The protection unit consists of severa
registers each containing access rights for a segment belonging to the shared
block of a node. The desired access right may be specified through the
Status Line. The access right violation checker examines the requested
access type against the rights of the accessing processor on the specific
segment. In case of a violation, the hardware unit generates an interrupt tc
the violator's processor and to the processor where the addressed segment
resides. Upon receiving the interrupt, the kernel of each PU invokes the
respective diagnosis routine.
•
64
Q
Ni detects a permanent error
!
Ni stops normal activity, informs the other
nodes and performs the diagnostic algorithm
J,.
Any faulty nodes ?
No
... r
(Let Nj be the
detected faulty
node)
Yes
"'I
Ni resumes suspended }
fu ncti o ni ng
...
Ni logically disconnects itself from Nj
+
If needed.. Ni redi st ri b utes processes among
nodes
+
Ni loads the processes in its ovn memory
... ...
Ni recovers in a co nsi ste nt state
...l
Ni restores communication paths vith
non-faulty nodes and properly updates tables
+
Restart of the system
Fig. 11
Fault Tolerance Processing in MuTEAM [12]
{l
•
65
Status Bit
Segment Virtual Name
Processor Name
4
4
/
v
1 o. . [/
Protection Unit
Registers
•
•
•
•
•••
• • •
'~
~
•
•
•
,..
• • •
...
r
Dual 16-1
Multiplexer
... .,
... r
..
~
• •
Right Violation
Checker
~
...
4
Access Violation Bit
Fig. 12 Error Detection Unit in MuTEAM [24]
66
6.3 .2 Error Diagnosis
After being invoked, the Diagnostic Processes (DPs) residing in each
node start testing a preassigned set of other nodes. This is accomplished a5
follows: Each DP checks its diagnostics matrix DM to see which set of
nodes it is supposed to test. DM is an N x N matrix (where N is the
number of active nodes in the system) holding boolean values whuch
indicate whether a node is to test another node. DM is initialized at sysgen
time with a degree of connectivity sufficient to diagnose a prespecified
number of faults in the system.
This method of error diagnostic [25, 89] is based on the graphical
model of Preparata et al. [77] in which the DPs are represented by graph.
Each PU is represented by a node in the graph. An arc from PUi to PUj
indicates that PUi is able to carry out a test on PUj and judge about its
status. In MuTEAM however, the DPs do not actually run any test on othe1
PUs, they merely require other DPs to run test routines on their nodes and
report the results. Thus, the function of each node is twofold: (1) Sending
test request messages to other DPs, (2) running test routines on its own
node on behalf fo other DPs. These functions are accomplished by a set of
concurrent processes which are activated when the DPs are invoked (Fig.
13). The Under Test Manager (UTM) is in charge of receiving and
serializing test requests from other DPs. These requests are sent by the
Tester Process (Ti). The Auto Test Process (A Ti) is responsible for
running a self-test on its own PU.
In order to prevent DPs from being blocked when communicating with
a faulty node, a Timer Process (TIM) is employed to enforce time-outs in
communication. The test results (faulty, fault-free, not received) are
recorded into a syndrome matrix (SM), another N x N matrix. This matrix
is initialized to "not received" and is examined after each update to see if the
status of every node can be determined from the syndrome. This method
eliminates the problem of decoding an incomplete syndrome which occurs
67
when a faulty PU does not report its test results to other nodes.
As a result of the diagnostic process two problems may arise:
-Deadlocks: Since each PU takes on both a "tester" and an "under test'
status, the nodes may find themselves in the same status, or oscillating
between statuses for an indefinite period of time, and thus create a deadlock.
One way to tackle this problem is to employ a unique mutual exclusion
semaphore as practiced in the design of many operating systems. But the
introduction of such a singular entity contradicts MuTEAM's philosophy.
A better method, which is employed in MuTEAM, is the introduction of a
random delay before a request or a busy message is sent.
- Status contamination: A major problem would arise if a PU could
invalidate another PU's test results. MuTEAM prevents this from
happening by means of protected communication channels among the PUs.
Test Request
1
Test Request
Reply
r---
Next
(Process)
UTM 1
Free ()
Term ()
AT1
Test
Request
Term ()
Term ()
TIMi
Signal ( )
Index ()
DP
I
Ti
Reply
Test Result
SM Modifg
-DP-1
Fig. 13 Error Diagnosis in MuTEAM [29]
---DP
68
6.3.3 Reconfiguration
As a result of applying the distributed diagnosis algorithm to detect the
source of incorrect operation, all non-faulty PUs become active participant5
in the reconfiguration phase. The first step is the logical disconnection of the
faulty PU(s) in a "non-aggressive" manner: No PUis permitted to disable
another PU. The logical disconnection is realized by the consensus of all
fault-free PUs to revoke all access rights owned by the faulty PU for
memory segment located in the non-faulty PUs.
This is accomplished as follows: The twin processes of those primary
processes running on a faulty node are notified by the diagnostics processes
Primary processes running on fault-free PU(s) are also notified to modify
the destination field of the corresponding logical channel triplet (source,
destination, message type) in order to establish communication with the
twin processes.
6.3 .4 Recovery
After the system has been reconfigured, all processes are "rolled back"
to most recent correct past state in order to cancel the effects of errors and
bring the system into an operational mode once again. This is acomplished
through the use of "recovery points" which are established periodically for
each process during the normal operation of the system. The setting of
recovery point in MuTEAM is performed in an "application transparent"
manner such that programmers will not be responsible for the correct setting
of the recovery points (which may be initiated via appropriate OS calls
within the program) [13]. To implement the above, a logical clock is
associated with each running process. Every time the clock reaches zero, a
new "state" is created for that process and the associated clock is reset to
predetermined value. Once the recovery states are established locally at a
node, the primary process involved sends a copy of its "new" states to the
node where the twin process resides.
69
The advantages of using a logical clock for setting recovery points are
twofold:
- The recovery point setting is transparent to the programmer.
- The frequency of the update is controlled by the clock's value. For
example if a highly sensitive process is involved, the clock's period
may be decreased by the system programmer so that recovery point~
are established more often.
The "rolling back" is based on a dynamic coordination among the
recovery points of the processes: Once an error has occurred and processe~
have been reconfigured successfully, a recovery point is chosen by one of
the processes which will present its choice to the other processes involved.
A consistent system recovery line Uoined recovery points of the processes
that constitutes a consistent state in the system) is determined when all
processes accept the given proposal. In case of a rejection, the opposing
processes communicate their own proposals for a recovery line.
It should be noted that during the course of any fault treatment, the
error detection mechanism continues its normal operation. In this way any
additional errors occurring during the fault treatment will be detected and the
whole fault tolerant treatment process is invoked recursively.
6.4 MLS Security Issues
6.4.1 General Observations
MuTEAM's overall philosophy regarding fault treatment has a positive
as well as a negative aspect: The detection of a fault in a PU will render the
entire PU faulty, resulting in its logical disconnection from the rest of the
system. This advantage however, does not come without a security related
drawback: The faulty unit is not "disconnected" in a physical sense, it is
70
simply being "isolated". As a result, a PU may continue its operation
despite the fact that it is being "ignored" by the rest of the system. This
passive approach may cause serious problems for security: (1) the PU can
transmit sensitive information without precautions, (2) it can be employed a~
a covert channel, and (3) it can cause a denial of use situation by preempting
the available resources (e.g., the LAN).
A related security problem arises in the fault detection process when
two communicating PUs are both faulty, in which case the fault will remain
latent until another fault free PU attempts to communicate with either one o1
the faulty PUs.
A genuine security-related weakness of the MuTEAM lies in the
employment of corrective redundancy technique to achieve fault tolerance.
As discussed earlier in chapter 4, corrective redundancy method appears
inappropriate for a secure fault tolerance, since security compromises may
occur during the error latency time.
MuTEAM's fault tolerance is based on a significant amount of
internode communication: In requesting diagnosis tests, generation of
recovery points, transmission of these points to the twin processes,
determination of a recovery line, reconfiguration. In an MLS environment:
these communications must also meet the mandatory security policy
requirement.
Two cases should be considered:
- Trusted operating system in each PU capable of enforcing a
multilevel secure mode of operation.
- Untrusted operating system in PUs, in which case the TIU is
responsible for enforcing the security policy.
p '
71
6.4.2 Trusted Operating System
Presently, the only security-related feature in MuTEAM is the access
control list (with possible read and write access right) which may be
categorized as some sort of discretionary access control. The introduction of
a trusted operating system in MuTEAM would imply a considerable amount
of new features which themselves must be subject to fault tolerance
requirements. They include the reference monitor, an audit capability,
labels, and mandatory access control. Storage object reuse feature ~ay then
be applied to purge classified information from an isolated faulty unit which
may still try to broadcast its memory contents.
A trusted operating system, however, does not solve all of the
MuTEAM's security deficiencies. As mentioned before, there is a
substantial amount of interprocess communication in order to obtain fault
tolerance. Under an operating system that enforces the DoD security policy:
many of these communication will have to violate the *-property in order to
carry out their fault treatment functions properly. One solution may be the
designation of these modules as "privileged" so that they will be permitted
to violate the security policy, another is to partition the system on the basis
of security levels of PUs, such that all fault tolerance functions are done
within the partition, and then communicated to appropriate other partitions.
In the following discussion, the error detection phase of the system will not
be examined from an MLS viewpoint, because this phase is performed by ~
hardware unit which is not likely to affect by the security mechanism.
Further research is, however, needed on this subject.
Diagnosis
This phase entails a great deal of message and information exchange.
The violation of the *-property is thus an important issue here. As
mentioned earlier in this section, the designation of the diagnosis processes
as "trusted" may solve the problem. This solution however, may introduce
72
covert channels which could be utilized by Trojan horse software that may
have been planted in the diagnosis processes. For example the syndrome
matrix may be used by a Trojan Horse as a covert channel to send sensitive
information to other units.
Reconfi~uration
As discussed earlier, the reconfiguration stage starts by the logical
disconnection of the faulty PU. This process, however, may create a major
disaster in the security mechanism of the system, because, even though the
faulty PU is isolated, it may continue to broadcast sensitive information in
violation of the security policy. The faulty PU must therefore be isolated
from the LAN. A TIU may be employed for this purpose (even though the
PUs are running trusted OS). Moreover, additional steps may be required
to purge all the classified information from the faulty PU in order to prevent
the intermingling of higher-classified information with lower-classified one.
Recovery
Once the faulty PU is isolated and the system successfully
reconfigured, the recovery/ restart process begins its activities. Care must
be taken to adequetely preserve the recovery point information and prevent
Trojan horse software to utilize this information path as a covert channel.
Furthermore, it is essential to establish that the security policy requirement~
(both mandatory and discretionary) are observed throughout the process of
determining the recovery path.
Overall, the fault tolerance functions in MuTEAM appear to be security
sensitive. It may therefore be required to include them in the TCB of the
operating system which however, necessitates a revalidation of the TCB
correctness.
From the above the following may be stated:
73
Assertion 13:
The mechanisms supporting each phase of the fault
treatment in MuTEAM are based on a substantial
amount of interprocess communication which makes it
difficult to obtain multilevel security.
6.4.3 Untrusted Operating System
Incorporating in an untrusted operating system, like MuTEAM, feature5
such as software controlled mandatory access, labels, full discretionary
access is not an easy task. A different approach involves the employment of
trusted interface units (TIUs) in order to achieve a multilevel secure
environment (See chapter 3). Using this concept, the network will be
divided into a set of subnetworks, each with its own sensitivity level (Figun
9).
Diagnosis
Using the untrusted OS scheme, the violation of the *-property by the
interprocess communications required for fault diagnosis may be avoided b)
utilizing extra nodes for each security level, and thus creating an
independent MuTEAM system for each such subnetwork. This approach
however, requires additional PUs beyond what is required for computation.
Reconfiguration
With untrusted OS and the use of TIUs, there must be sufficient
number of PUs for backup at each security level. Besides being expensive
(requiring extra PUs beyond the normal requirement of the reconfiguration
phase), this method takes away the ability to distribute the twin processes
equally among all the PUs. One can forsee a situation where one security
subsystem is saturated, while other levels are all under-utilized.
In the present implemetation of MuTEAM, both the primary and twin
74
processes are allocated statically at the system generation time. A study
[12] by MuTEAM designers however, draws attention to the possibility of
dynamic allocation of substitute processes (after the diagnosis phase, rathe1
than at sysgen time). This dynamic approach would create additional
security problems: The failure of a PU with no available PU (with the same
security level) would require either an upgrade or a downgrade in the
present security level of other PUs which would need to assume the
processes and data of the failed PU.
Downgrading a node implies "sanitizing" the node of any
higher-classified processes or data before it can be used as a backup, clearly
a time-consuming task. Furthermore, those "removed" processes that are
operational-critical may face a similar problem (the unavailability of a
suitable node for their transfer) thus creating a vicious circle.
Upgrading of a node amounts to the elevation of all the labels of the
node's processes and messages when transmitted out through the TIU,
hence lowering the availability of these processes.
Recovery
This phase is based on the periodic setting of "recovery points" of each
primary process. This information is sent to the PU containing the
corresponding twin process. With an untrusted OS, the TIUs will prevent
any information sent to a lower classified PU which interferes with the
normal activity of this phase.
To summarize this:
Assertion 14:
The availability of sufficient number of back-up nodes
at each security level is a necessary condition for
obtaining multilevel security and fault tolerance under
an untrusted OS such as MuTEAM.
75
In summary, obtaining a multilevel secure environment in a corrective
fault tolerant systems such as MuTEAM seems to be difficult to achieve.
The main reason is the great deal of interprocess communication needed for
obtaining fault tolerance.
CHAPTER 7
CONCLUDING REMARKS
From the end spring new beginnings.
Pliny The elder,
Historia Naturalis.
This thesis has strived to touch upon the problems that arise from the
interaction· of security and fault tolerance in a computer system. A
framework was established for further research on the issues and tradeoff~
involved.
This study has demonstrated that the use of protective redundancy
techniques, rather than corrective redundancy methods, is the suitable
approach for obtaining a fault tolerant, secure system. Cprrective
redundancy creates a risky and unreliable environment as far as the security
is concerned because an unauthorized disclosure of sensitive information
may not be recoverable.
The granularity of redundancy in design was also discussed. It was
shown that the finer the level of redundancy (e.g., at circuit logic level), the
lesser the chance of security threats, since the amount of information that
may be exposed (due to a failure) is smaller.
Graceful degradation of security was examined from a feasibility
viewpoint. "Security degradation" could be interpreted as the downward
migration of the system in the hierarchy of security modes and divisions. In
this context, degradation is the restriction imposed upon the users due to
newly established security classifications.
The consequences of adding system-wide fault tolerance to an existing
secure system was examined next. It was concluded that this not only
increases the complexity of the system, but also requires a new security
76
77
proof of correctness for the entire system design and implementation as
required for an MLS system.
This study is a starting point in the important topic of the interaction o1
fault tolerance and security. To obtain more detailed results, further research
is needed. The following is a proposed plan for future study on this subject
- Refinement of the key issues involved: Secure fault tolerance, fault
tolerant security, and graceful degradation of security.
- Identification of a generalized design procedure for the
implementation of the above.
- Identification of possible tradeoffs when different approaches are
taken.
- Formulation of a computer simulation of a hardware system which
demonstrates the desired interaction.
- Development of a prototype system featuring the interplay of
security and fault tolerance.
Finally the analyses in this thesis and the MuTEAM case study
demonstrate that security and fault tolerance interactions are significant
indeed. If both are required in a distributed systems, the selection design
and implementation of appropriate mechanisms must be coordinated from
the beginning, rather than each being developed in isolation. Retrofitting
security or fault tolerance features, and especially both, in a system is not
likely to succeed.
REFERENCES
1.
M. Adham and A.D. Friedman, "Digital System Fault Diagnosis,"
Journal of Design Automation and Fault Tolerant Computing, Vol. I,
pp. 115-132 (February 1977).
2.
3.
S. R. Ames, Jr. , "Security Kernel Design and Implementation: An
Introduction," Computer 16, pp.14-22 (July 1983).
4.
T. Anderson and P. A. Lee, Fault Tolerant Principles and Practice,
Prentice-Hall (1981).
5.
T. Anderson and B. Randell (eds.), Computing Systems Reliability,
Cambridge University Press, Cambridge (1979).
6.
J. A. Arulpragasm and R.S. Swarz, "A Desi~n for process State
Preservation on Storage Unit Failure," Proceedmts FTCS-1 0: Tenth
Annual International Conference on Fault To erant Computing.
Kyoto, Japan, pp. 47-5~ (October 1980).
7.
A. Avizienis et al., "The STAR (Self-Testing and Repairing)
Computer: An Investigation of the Theory and Practice of Fault
Tolerance Computer Design," IEEE Transactions on Computers
C-20, pp. 1312-1321 (November 1971).
8.
A. Avizienis, "Fault Tolerant Systems," IEEE Transactions on
Computers C-25, pp. 1304-1312 (December 1976).
9.
A. A vizienis, "Fault Tolerance: The survival Attribute of Digital
Systems, " Proceedings of the IEEE 66, pp. 1109-1125 (October
1978).
10.
F. Bairardi, et al., "Mechanisms for a Robust Multiprocessing
Environment in the MuTEAM Kernel," Proceedings FTCS-11:
Eleventh International Conference on Fault Tolerant Computing,
Portland (OR), pp. 20-24 (June 1981).
11.
78
79
12.
G. Barigazzi, A. Ciuffoletti, and L. Strigini, "Reconfiguration
Procedure in a Distributed Multiprocessor System," Proceedings
FfCS-12: 12th Annual International Conference on Fault Tolerant
Computing, Santa Monica (CA), pp. 73-80 (June 1982).
13.
G. Barigazzi, L. Strigini, "Application Transparent Setting of
Recovery Points," Proceedings FfCS-13: 13th Annual International
Conference on Fault Tolerant Computing, pp. 48-55 (1983).
14.
D. E. Bell and L. J. LaPadula, "Secure Computer Systems," ESDTR - 73 - 278, Vols. I-III, Mitre Corporation, Bedford (MA)
(November 1973- June 1974).
15.
D. E. Bell and L. J. LaPadula, "Secure Computer Systems:
Mathematical Foundations and Model," M 74-244, The Mitre
Corporation, Bedford (MA) (October 1974).
16.
B. R. Bergerson and R. F. Freitas, "A Reliability Model for
Gracefully Degrading and Standby-Sparing Systems," IEEE
Transactions on Computers, pp. 517-525 (May 1975).
17.
D. Briatico, A. Ciuffoletti, and L. Simoncini, "A Domino Effect
Free Recovery Algorithm: Formal Specification," Unpublished.
18.
D. Briatico, et al, "Error Detection/Fault Treatment in the MuTEAM
System," unpublished.
19.
D. Briatico, et al, "The Operatin~ System Kernel of a Message
passing Distributed Multiprocessor,' unpublished.
20.
W. G. Brown, J. Tietney and R. Wasserman, "Improvement of
Electronic Computer Reliability Through the Use ofRedundancy,"
IRE Transactions on Elec. Computers EC-10, pp. 407-416
(September 1961).
21.
D. D. Burchby, L.W. Kern, and W. A. Sturm, "Specification of
the Fault Tolerant Spaceborne Computer," Proceedmgs FTCS-6:
Sixth International Symposium on Fault Tolerant Computing.
Pittsburgh (PA), pp. 129-133 (June 1976).
22.
K. M. Chandy and C. V. Ramamoorthy, "Rollback and Recovery
Strategies for Computer Programs,' IEEE Transactions on
Computers C-21, pp. 546-556 (June 1972).
23.
M. H. Cheheyl, et al., "Verifying Security," ACM Computing
Surverys 13, pp. 279-340 (September 1981).
24.
G. Cioffi, et al., "MuTEAM: Architectural Insights of a Distributed
80
Multimicroprocessor System," Proceedings FTCS-11: Eleventh
International Conference on Fault Tolerant Computing, Portland
(OR), pp. 17-19 (June 1981).
25.
P. Ciampi, F. Grandoni, L. Simoncini, "Distributed Diagnosis in
Multiprocessor Systems: The MuTEAM Approach," Proceedings
FTCS~ 11: Eleventh International Conference on Fault Tolerant
Computing, Portland (OR), pp. 25-29 (June 1981).
26.
D. D. Clark, K. T. Po&ram, and D. R. Reed, "An Introduction to
Local Area Networks,' Proceedin~s IEEE 66, pp. 1497-1517
(November 1978).
27.
G. F. Clement and R. D. Royer, "Recovery from faults in the No.
1A Processor," Proceedings FTCS-4: Fourth International
Conference on Fault Tolerant Computing, Urbana (IL), pp. 5.2-5.7
(January 1974).
28.
A. E. Cooper and W. T. Chow, "Development of On-Board Space
Computers Systems," IBM Journal of Research and Development
20, pp. 5-19 (January 1976).
29.
P. Corsini, L. Simoncini, and L. Strigini, "MuTEAM: A
Multiprocessor Architecture with Decentralized Fault Treatment,"
Submitted for Publication to IEEE Transactions on Computers.
30.
G. I. Davida, R. A. DeMilio, and R. J. Lipton, "Multilevel Secure
Distributed Systems," Proceedings 2nd International Conference on
Distributed Computin.& Systems, Paris, France, pp. 8-10 (April
1981 ).
31.
D. E. Denning, Cryptography and Data Security, Addison-Wesley
(1982).
32.
Department of Defence Trusted Computer Evaluation Criteria, CSC STD- 001 - 83, DoD Computer Security Center, FT. Meade (MD),
(August 1983 ).
33.
L. C. Dion, "A Complete Protection Model," Proceedings IEEE.
Symposium on Secunty and Privacy, pp. 49-55 ( 1981 ).
34.
D. Downs and G. J. Popek, "A Kernel Design for a Secure Data
Base Management System," Proceedings 3rd International
Conference on Vezy Large Data Bases, Tokyo, Japan, pp. 507-514
(October 1977).
35.
P. Ein-Dor, "Grosch's Law Re-Revisited: CPU Power and the Cost
81
of Computation," Communications of the ACM, pp. 142-151
(February 1985).
.
36.
R. J. Feiertag, K. N. Levitt, and L. Robinson, "Proving Multilevel
Security of a System Design," Proceedings 6th ACM Symposium on
Operatm~ System Principles, pp. 57-65 (1977).
37.
E. B. Fernandez, R. C. Summers, and C.
Security and Integrity, Addison-Wesley (1981).
38.
B. J. Flehinger, "Reliability Improvement Through Redundancy at
Various Systems Levels," 1BM Journal of Research and
Development 2, pp. 148-158 (April1958).
39.
A. D. Friedman, L. Simoncini, "System Level Fault Diagnosis,"
Computer 13, pp. 47-53 (March 1980).
40.
M. Gasser and D. P. Sidhu, "Multilevel Secure Local Area
Network," Proceedings IEEE. Symposium on Security and Privacy,
pp. 137-143 (1982).
Wood, Database
41.
42.
E. Gelenbe, "On the Optimum Checkpoint Interval," Journal of the
ACM 26, pp. 259-270 (April 1979).
43.
E. Gelenbe and D. Derochette, "Perfomance of Rollback Recovery
Systems under Intermittent Failures," Communications of the ACM
21, pp. 493-499 (June 1978).
44.
V. D. Gligor, "Review and Revocation of Access Priviledges
Distributed Through Capabilities," IEEE Transactions on Software
Engineering SE-5, pp. 575-586 (November 1979).
45.
J. Goldberg, K. N. Levitt, and R. A. Short, "Techniques for the
Realization of Ultra-Reliable Spaceborn Computers," Menlo Park
(CA): Stanford Research Institute (1966).
46.
F. Grandoni, et al., "The MuTEAM System: General Guidelines,"
Proceedings FTCS-11: Eleventh International Conference on Fault
Tolerant Computing, Portland (OR), pp. 15-16 (June 1981).
47.
H. A. Grosch, "High Speed Arithmetic: The Digital Computer as a
Research Tool," J. Opt. Soc. Am. 43 (April1953).
82
48.
H. A. Grosch, "Grosch's Law Revisited," Computerworld 8 (April
1975).
49.
C. A. R. Hoare, "Communicating Sequential Processes,"
Communications of the ACM, pp. 668-679 (August 1978).
50.
A. L. Hopkins, T. B. Smith, and J. H. Lala, "FTMP- A Highly
Reliable Fault Tolerant Multiprocessor for Aircraft," Proceedings of
the IEEE 66, pp. 1221-1240 (October 1978).
51.
D. Katsuki et al., "Pluribus - An Operational Fault Tolerant
Multiprocessor," Proceedings of the IEEE 66, pp. 1146-1159
(October 1978).
52.
S. T. Kent, "Encryption-Based protection for Interactive
User/Computer Communications," 5th Data Communications
Symposium, Snowbird (UT), pp. 7-13 (September 1977).
53.
K. H. Kim, "An Approach to Programmer Transparent coordination
of Recovering Parallel Processes and its Efficient Implementation
Rules," Proceedings of International Conference on Parallel
Processin~, Detroit (MI), pp. 58-68 (August 1978).
54.
K. H. Kim, "Error Detection, reconfiguration and Testing in
Distributed Processing Systems," Proceedmgs FTCS-1: First Annual
International Conference on Fault Tolerant Computing, pp. 284-295
(October 1979).
55.
K. H. Kim, "An Implementation Model for the Programmer
Transparent Scheme for Coordinating Concurrent Processing
Recovery," Proceedings IEEE 4th International COMPSAC (October
1980).
56.
K. H. Kim, "Software Fault Tolerance," Software Engineering, pp.
437-455, Van Nostrand (1984).
57.
C. Kime, "Fault Diagnosis of Distributed Systems," Proceedings
COMSAC, pp. 355-364 (1980).
58.
J. Kuhl and S. M. Reddy, "Distributed Fault Tolerance for Large
Multiprocessor Systems, Proceedings FTCS-7 Seventh Annual
International Symposium on Computer Architecture, pp. 23-30 (June
1980).
59.
B. W. Lampson, "A Note On the Confinement Problem,"
Communications of the ACM 16, pp. 613-615 (October 1973).
60.
C. E.
Landwehr, "A Survey of Formal Models for Computer
83
Security," ACM Computing Surveys 13, pp. 247-278 (September
1981).
61.
C. E. Landwehr, "The Best Available Technologies for Computer
Security," Computer 16, pp. 86-100 (July 1983).
62.
S. B. Lipner, "Comment on the Confinement Problem," ACM
Operating Systems Review 9, pp. 192-196 (May 1975).
63.
J. Losq, "Effects of Failures on Gracefully Degrading Systems,"
Proceedings FTCS-7: Seventh Annual International Conference on
Fault Tolerant Computing, Los Angeles (CA), pp. 29-34 (June
1977).
64.
R. E. Lyons and W. Vandekull, "The Use of Triple Modular
Redundancy to Improve Computer ReliabilitY.," IBM Journal of
Research and Development 6, pp. 200-209 (Apnl1962).
65.
G. H. MacEwen, B. Burwell, and Z. J. Lu, "Multilevel Security
Based on Physical Distribution," nroceedings IEEE. Symposium on
Security and Privacy, pp. 167-179 (1984).
66.
D. Mackie, "The Tandem 16 NonStop System," State of the Art
Report on System Reliability and Integrity, pp. 145-161, Infotech.
Maidenhead (1978).
67.
J. A. Me Dermid, "Checkpointing and Error Recovery in Distributed
Systems," · Proceedings. 2nd International Conference on Distributed
Systems, pp. 271-282 (1971).
68.
P. M. Merlin and B. Randell, "State Restoration in Distributed
Systems," Proceedin¥ FTCS-8: Eighth Annual International
Conference on Fault olerant Computing, Toulouse, France, pp.
1 29-134 (June 1978).
69.
J.K. Millen, "Security Kernel Validation in Practice,"
Communications of the ACM 19, pp. 243-250 (May 1976).
70.
R. M. Needham and M. Schroeder, "Usin~ Encryption for
Authentication in Large Networks of Computers,' Communications
of the ACM 21, pp. 993-999 (December 1978).
71.
Y. W. Ng and A. Avizienis, "A Reliability Model for Gracefully
Degrading and Repairable Fault Tolerant Systems," Proceedings
FTCS-7: Seventh Annual International Conference on Fault Tolerant
Computing, Los Angeles (CA), pp. 22-28 (June 1977).
72.
F. J. O'Brien, "Rollback Point Insertion Strategies," Proceedings
,, .
84
FfCS-6: Sixth Annual International Symposium on Fault Tolerant
Computing, Pittsburgh (PA), pp. 138-142 (June 1976).
73.
M. Pease, R. Shostak, and L. Lamport, "Reaching Agreement in the
Presence of Faults, " Journal of the Association for Computing
Machinery 27, pp. 228-234 (April1980).
74.
G. J. Popek, "A Principle of Kernel Design," AFIPS Conference
Proceedings 43, Montvale (NJ), pp. 977-978 (1974).
75.
G. J. Popek and C. S. Kline, "Issues in Kernel Design," AFIPS
Conference Proceedings 47, Montvale (NJ), pp. 1079-1086 (1978).
76.
J. G. Posa, "Memory Makers Tum to Redundancy," Electronics 53,
Me Graw-Hill, pp. 108-110 (December 1980).
·
77.
F. P. Preparata, G. Metze, and R. T. Chien, "On the Connection
Assignment Problem of Diagnosible Systems," IEEE Transactions on
Elec. Computers EC-16, pp. 848-854 (December 1967).
78.
B. Randell, P. A. Lee, and P. C. Treleaven, "Reliability Issues in
Computing System Design," Computing Surveys 10, pp. 123-165
(June 1978).
79.
D. K. Rubin, "The Approximate Reliability of Triply Redundant
Majority-Voted Systems," D~est. First Annual IEEE Comp.
Conference, Chicago (IL), pp. 4 -49 (1967).
80.
J. Rushby and B. Randell, "A Distributed Secure System,"
Computer 16, pp. 55-58 (July 1983).
81.
J. Rushby, "The design and Verification of Secure System,"
Proceedings eighth ACM Symf:osium on Operating System
Principles, Asilomar (CA), pp. 12- 1 (December 1981).
82.
D. L. Russell, "State Restoration in Systems of Communicating
Processes," IEEE Transactions on Software Engineering SE-6, pp.
183-194 ( (March 1980).
83.
J. H. Saltzer and M.D. Schroeder, "The Protection of Information
in Computer Systems," Proceedings IEEE 63, pp. 1278-1308
(September 1975).
84.
J. H. Saltzer, "Protection and the Control of Information Sharing in
Multics," Communications of the ACM, pp. 388-402 (July 1974 ).
85.
R. R. Schell, "Security Kernels: A Methodical Design of System
Security," USE Inc., Spring Conference (March 1979).
85
86.
R. A. Short, "The attainement of Reliable Digital Systems Through
the Use of Redundancy: A Survey," IEEE Computer Group News
2, pp. 2-17 (March 1968).
87.
D. P. Siewiorek, "Multiprocessors: Reliability Modeling and
Graceful Degradation," Infotech State of Art Conference on System
Reliability, London, England, pp. 48-73 (1977).
88.
D. P. Siewiorek and R. S. Swarz, The Theory and Practice of
Reliable System Design, Digital Press (1982).
89.
L. Simoncini, F. Saheban, and A. D. Friedman, "Design of
Self-Diagnosable Multiprocessor Systems with Concurrent
ComputatiOn and Diagnosis," IEEE Transactions on Computer C-29,
pp. 540-546 (June 1980).
90.
T. Taylor, "Comparison Between the Bell and LaPadula Model and
the SRI Model," Proceedings IEEE. Symposium on Security and
Privacy, pp. 195-202 (1984).
91.
W. N. Toy, "Fault Tolerant Design of Local ESS Processors,"
Proceedings of the IEEE 66, pp. 1126-1145 (October 1978).
92.
R. Troy, "Dynamic Reconfiguration: An Algorithm and Its
Efficiency Evaluation," Proceedings FTCS-7: Seventh Annual
International Conference on Fault Tolerant Computing, Los Angeles
(CA), pp. 44-49 (June 1977).
93.
R. Troy, "Rollback Model for Interactive Processes," Proceedings
FTCS-8: Eight International Conference on Fault Tolerant
Computing, Toulouse, France (June 1978).
94.
J. F. Wakerly, "Microcomputer Reliability Improvement Using
Triple Modular Redundancy," Proceedings of the IEEE 64, pp.
889-895 (June 1976).
95.
W. H. Ware (Ed.). "Security Controls for Computer Systems,"
Report R-609-1, Rand Corporation, Santa Monica (CA) (October
1979).
96.
J. H. Wensley et al., "SIFT: Design and Analysis of a Fault Toleran
Computer for Aircraft Control," Proceedings of the IEEE 66, pp.
1240-1255 (October 1978).
97.
W. G. Wood, "A Decentralized Recovery Control Protocol,"
Proceedings FTCS-11: Eleventh Annual International Conference on
Fault Tolerant Computing, Portland (OR), pp. 159-164 (June 1981).
86
98.
J. P. L. Woodward, "Applications for Multilevel Secure Operating
Systems," AFIPS Conference Proceedings 48, Montvale (NJ), pp.
319-3288 (1979).
99.
J. W. Young, "A First Order Approximation to the Optimum
Checkpoint Interval," Communications of the ACM 17, pp. 530-531
(September 1974).
© Copyright 2026 Paperzz