intro-eng - the Distributed Systems Laboratory (LSD)

Introduction
Ricardo Jiménez-Peris, Marta Patiño-Martínez
Lsd
Distributed
Systems
Laboratory
Universidad Politécnica de Madrid (UPM)
http://lsd.ls.fi.upm.es/lsd/lsd.htm
Contents
•
•
•
•
•
2
Introduction to dependable distributed systems.
Coordination and agreement.
Transactions.
Replication.
Security.
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Bibliography
• Distributed Systems Books:
– Distributed Systems: Concepts and Design. G. Colouris, J. Dollimore, T.
Kindberg. 3rd edition, Addison-Wesley, 2000.
– Distributed Systems: Principles and Paradigms. A. Tannenbaum & M.
van Steen. Prentice Hall. 2002.
– Distributed Systems, S. Mullender, ed. ACM-Press. 2nd Ed. AddisonWesley, 1993.
– Building Secure and Reliable Network Applications. K. Birman.
Manning, 1996.
– Distributed Systems for System Architects. P. Veríssimo, Luís
Rodrigues. Kluwer, 2001.
– Distributed Algorithms. N. Lynch. Morgan-Kaufmann, 1996.
– Distributed Computing. H. Attiya and J. Welch. McGraw Hill. 1998.
– Fault Tolerance in Distributed Systems. P. Jalote. Prentice Hall. 1996.
– Gray, J. and A. Reuter, Transaction Processing: Concepts and
Techniques, Morgan-Kauffman, 1993.
3
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Bibliography
• T. D. Chandra and S. Toueg, Unreliable Failure Detectors for
Reliable Distributed Systems, Journal of the ACM, pp. 225-267,
v. 43, n. 2, mar., 1996.
• J. C. Laprie, Dependable Computing and Fault Tolerance:
Concepts and Terminology, Proc. of 15th Int. Symp. on Fault
Tolerant Computing Systems, jun. 1985.
• J. Laprie and J. Arlat and C. Béounes and K. Kanoun",
• Definition and Analysis of Hardware- and Software-FaultTolerant Architectures, IEEE computer, v. 23, n. 7, pp. 39-51,
1990.
• Herlihy, M.P. and J. M. Wing, Linearizability: A Correctness
Condition for Concurrent Objects, ACM Trans. on
Programming Languages and Systems. v. 12, n. 3, pp. 463-492,
jul. 1990.
4
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Definition
• A distributed system consists of a set of independent
hosts interconnected by means of a network.
• Inherent features of a distributed system:
–
–
–
–
5
Its components compute concurrently.
There is not global clock.
Components do not share memory.
Components fail independently (ideally).
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Features
• Desirables features of distributed systems:
–
–
–
–
6
Transparency.
Scalability.
Concurrency Control.
Fault tolerance.
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Features: Transparency
• Transparency can be applied to different aspects:
– Heterogeneity: The resources are accessed in the same way
independently of their architecture, operating system,
programming language and software vendors.
– Access: Local and remote resources are accessed in the same
way.
– Location: Resource can be accessed without knowing its
physical location (e.g. through naming and directory services
or discovery services).
7
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Features: Transparency
– Replication: A replicated resource can be accessed in the
same way as a non-replicated resource.
– Failures: Resources can be accessed in the same way despite
failures.
– Mobility: Resources can be moved without affecting their
operation.
– Performance: The system balances the load without affecting
the resource access.
8
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Features: Scalability
• Metrics:
– Throughput: Number of operations per unit of time.
– Response time: Elapsed time between the client request and
the reception of the response.
– Reliability: Global system reliability with respect the
reliability of its components.
9
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Features: Scalability
• A distributed system is said to be scalable if one or
more of its performance/dependability metrics improve
by adding additional sites:
– Enhance the throughput with a growing number of sites
(ideally linearly).
– Decreases the response time (or at least keeps it constant or
grows very slowly) with a growing number of sites.
– The system reliability increases logarithmically with the
number of sites.
10
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Features: Concurrency Control
• The resources can be accessed concurrently by
different users without loosing their coherence.
• Two important coherence definitions:
– Linearizability: The result of concurrent invocations to a
resource should be equivalent to a sequential execution of
them.
– Serializability: The result of a sequence of operations (a
transaction) executed concurrently should be equivalent to a
sequential execution of them.
11
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Features: Fault Tolerance
• Failures in a distributed system are inherently partial.
• Partial failures should not affect the rest of the system that
should continue offering service, possibly in a degraded mode
(graceful degradation).
• Two important properties:
– Availability: A resource remains available despite failures.
– Atomicity: A resource remains consistent despite failures.
12
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Features: Fault Tolerance
• Levels of Fault-Tolerance:
– Detection. e.g.: Checksum.
– Recovery.
• Backward.
• Forward.
– Masking.
• Fault treatment:
– Redundancy:
• temporal (e.g. message retransmission),
• spatial (replication),
• design/value (diversity).
13
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Arquitectonic Models
• Remote Access: A central system is access by a
terminal through the network.
• Client-server: The application executes at the clients
whilst the resources are stored at the servers.
14
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Arquitectonic Models
• Multi-tier architectures:
– Clients only contain software to interact with the system
(thin clients).
– Client request are handled by a front end (e.g. a web server)
that forwards them to the applications that will process them.
– Applications reside on application server (stateless o
stateful) and are invoked by clients.
– Resources are kept on data servers that are accessed by
application servers.
15
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Arquitectonic Models
• Mobile systems:
– Mobile code: Components are fixed, but application code
move (e.g. mobile agents).
– Mobile sites: System which sites can move : wireless
networks, ad-hoc networks, etc. that disconnect and connect
again at a different point of the network.
16
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Arquitectonic Models
• Event-based Arquitectures: The interaction is asynchronous,
some components produce messages and other components
consume them, but they are not necessarily connected at the
same time. (e.g.: asynchronous messaging services).
• Peer-to-peer systems: Totally decentralized massive systems
(e.g. eDonkey, eMule, Gnutella, Freenet).
• Service oriented architectures, SOA (e.g. web services).
17
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
System Models
• Synchronous Systems :
– The execution time of a process step has a known bound.
– The time for transmitting a message through a
communication channel is bounded and known.
– The local clock drift has a known bound.
– Another typical definition is that sites execute computation
steps in lock-step.
18
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
System Models
• Asynchronous Systems:
– Computational steps are eventually executed.
– Messages sent by a channel are eventually received.
– Local clocks has no bounded drifts.
19
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
System Models
• Partially synchronous systems.
– Def. 1: The message transmission time and time taken by a
computational step has a bound but is unkown.
– Def. 2: The system behavior is divided into two periods. In
the first period the system is totally asynchronous. In the
second period, the time that takes a message transmission
and a computational step has a known bound. The instant at
which the system changes from the asynchronous behavior to
the synchronous one is bounded, but unknown.
20
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
System Models
• Failure Detectors:
– The system is extended with an unreliable failure detector
(that can make wrong failure detections).
– The degree of synchrony of the system is modeled by the
guarantees provided by the failure detector.
– Failure detector properties:
• Accuracy: It does not suspect from correct processes.
– The accuracy can be permanent or eventual.
– Strong: no correct process is suspected.
– Weak: at least one correct process is never suspected.
• Completeness: Incorrect processes are suspected.
– Strong: By all correct processes.
– Weak: By at least one correct process.
21
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
System Models
Accuracy Strong
Weak
Eventually Eventually
Strong
Weak
Perfect
P
Strong
S
Eventually
Perfect
◊P
Eventually
Strong
◊S
Q
Weak
W
◊Q
Eventually
Weak
◊W
Completeness
Strong
Weak
• The weakest failure detector that can solve consensus is ◊ W.
• ◊ W can be extended with a small distributed algorithm to obtain ◊ S.
22
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Fault Models
• Terminology:
– Failure: Deviation of the system behavior from its
specification.
– Error: Part of the state that provokes the failure.
– Fault: The cause of the error.
• Faults can be:
– Transitory: Limited duration (e.g. a network failure).
– Intermittent: Transitory defect that repeat over time.
– Permanent: Once the system fault happens it remains till the
system is repaired.
23
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Fault Models
• Faults can be classified according their nature:
– Design: They are originated during the conception of the
system or during its upgrade (e.g. the design fault in the
Pentium mathematical co-processor).
– Operational: originated by physical causes (e.g. a fault in the
CPU fan that causes an overheating of the CPU that at its
time makes the CPU to not behave according its
specification).
• A system tolerates failures if despite its ocurrence in one or
more of its components the system remains fulfilling its
specification with respect its users. That is, if it is able to mask
failures of its components.
24
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd
Fault Models
• Fault Hierarchy:
– Crash (fail-silent sites). The system crashes and stop
working. Before crashing it fulfills its specification.
– Omission. It might omit some action(s).
– Timing. A component might execute an action before and
after the interval it was specified.
– Byzantine. Any kind of faults (e.g. arbitrary behavior
resulting from memory corruption or CPU overheating)
including malicious faults (e.g. those provoked by a hacker).
25
Laboratorio de Sistemas Distribuidos, Universidad Politécnica de Madrid
Lsd