Shared-Memory Simulations on a Faulty

Shared-Memory Simulations on a
Faulty-Memory DMM
Bogdan S. Chlebus1 and Anna Gambin12 and Piotr Indyk3
1
Instytut Informatyki, Uniwersytet Warszawski, Banacha 2, Warszawa 02-097,
Poland.
2 Fachbereich Informatik LS-2, Dortmund Universitat, D-44221 Dortmund 50,
Germany.
3 Department of Computer Science, Stanford University, Stanford, CA 94305, USA.
1 Introduction
There is a variety of existing parallel architectures, what is in contrast with the
sequential computing where the RAM model is dominant. The RAM can be
generalized directly to a parallel computer by combining a number of processors
and global memory, each processor having random access to any memory word.
The obtained parallel model PRAM is theoretical but powerful and convenient
to work with. The Distributed Memory Machine (DMM) is a more realistic
model. It is weaker than the PRAM, the dierence is that the memory words
are organized in memory units (MUs), and only one processor may access a
MU at a time. Any processor can contact with arbitrary MU in a single step.
This could be realized by the optical technology, and the DMM is essentially
equivalent to the Optical Communication Parallel Computer (OCPC).
Both the PRAM and DMM considered in this paper are synchronous, and
the time performance is our major eciency criterion. We consider a DMM with
faulty memory words, otherwise everything is assumed to be operational. In
particular the communication between the processors and the MUs is reliable,
and a processor may always attempt to obtain an access to any MU, and, having
been granted it, may access any memory word in it, even if all of them are
faulty. The only restriction on the distribution of faults among memory words
is that their total number is bounded from above by a fraction of the total
number of memory words in all the MUs. In particular, some MUs may contain
only operational cells, some only faulty cells, and some mixed cells. This report
presents fast simulations of the PRAM on a DMM with faulty memory.
Models of computation. A Parallel Random Access Machine (PRAM) consists of a number of synchronized processors and the global shared memory. Each
processor has a random access to any memory word for both reads and writes.
We work with the EREW PRAM which does not allow concurrent access to a
memory word by more than one processor.
A Distributed Memory Machine consists of a set of n synchronized processors,
denoted Pi, each equipped with its local memory of O(1) words of size O(log n),
and n Memory Units (MU), denoted Mi , for 1 i n. The size m of a MU,
that is, the number of memory words it stores, is a parameter of the model.
Throughout the paper we assume that m = n , for some constant > 0. A
memory unit can be accessed by any processor, but only one at a time. On the
low level, the communication of a processor with a MU is in two stages. The
rst is of reserving an access: there is a special connection register of the MU
which shows the result of an attempt to be connected. If only one processor
makes such an attempt then the connection register shows a conrmation of
access. Once a processor has been granted access, it uses the MU as a RAM, for
one memory access. If more than one processors attempt to access a MU then a
special collision symbol appears in the connection register and is read by all the
involved processors; in such a case no processor is granted access. This variant
of the DMM is usually referred to as the 1-Collision DMM (see [14]), we write
simply Collision DMM.
We consider a DMM with possibly faulty memory words. This model has an
additional feature, not present in the ideal fully operational model, that it is
able to recognize memory errors. Namely, each processor of the faulty DMM is
equipped with a special fault-detection register, and after each attempt to access
a memory word, the register stores a bit whose value depends on whether the
accessed memory word was operational or not (this is analogous to the faultymemory PRAM of [2, 3, 6]). On the high level, the processors access memory as
follows. If the operation is a read, then the processor species the MU M and
an address of a word x in M; it receives back one of the three possible kinds of
messages:
1) conict for access to M;
2) access to M granted, word x faulty;
3) access to M granted, x operational, x stores value v.
If the operation is a write, then the processor species additionally a value v0 to
store in x.
The errors are static, what means that the status of a memory word (operational or faulty) does not change in the course of a computation. It is assumed
that there is a constant 0 < q < 1 such that the total number of faulty memory
words does not exceed q n m. The distribution of errors across MUs can be
arbitrary; it is possible that a particular MU does not contain any operational
memory words at all. The distribution of errors is not known in advance, hence
the processors begin a computation with preprocessing the MUs, to create a
mechanism of access to the (possibly) faulty memory. We refer to such a preprocessing as the memory formatting. Each processor has a private local memory
consisting of O(1) registers, which is assumed to be fully operational. Since the
memory is faulty, the input data cannot be stored there. They are either in the
processors' local memories or are provided to the processors after formatting.
An Optical Communication Parallel Computer consists of a completely connected set of processors. A processor may send a message to any other processor
in one step. The message is received successfully provided it was the only one
sent during the step to the receiver. The operational Collision DMM and OCPC
are equivalent. If these machines may have memory faults, then the DMM is
weaker because the processors cannot exchange messages directly, the only way
to communicate is via the memory.
Related research. Simulations of the PRAM on a DMM were given by Czumaj, Meyer auf der Heide and Stemann [4], Dietzfelbinger and Meyer auf der
Heide [5], Karp, Luby and Meyer auf der Heide [8], Mehlhorn and Vishkin [13],
Meyer auf der Heide, Scheideler and Stemann [15]; see also the survey article
[14] and the references therein. All these simulations assumed a fully operational
DMM.
There has been a lot of research done recently with the aim of designing either specic fault-resilient PRAM algorithms or PRAM simulations on machines
prone to errors. The applied approaches diered in the nature of faults (static
versus dynamic, deterministic versus random, fail-stop versus restartable), the
properties of the underlying model (synchronous versus asynchronous) and the
eciency criteria (time versus work). Kanellakis and Shvartsman [7] introduced
the fail-stop PRAM and developed many deterministic and robust (that is, work
ecient) algorithms. Kedem, Palem and Spirakis [10] and Shvartsman [17] designed robust general PRAM simulations for dynamic fail-stop errors. Kedem
et al. [9] developed robust randomized tentative simulations with a constant expected slowdown.
Randomized fast simulations of the operational PRAM on a synchronous
PRAM with memory faults were designed by Chlebus, Gambin and Indyk [2];
the models of faults considered were both static and dynamic, for each case two
simulations were developed: one operating with a constant expected slowdown
and the other in the logarithmic expected slowdown, depending on the number of
processors available. Deterministic logarithmic-time-slowdown simulations of a
PRAM on a synchronous PRAM with both processor and memory failures were
developed by Chlebus, Gasieniec and Pelc [3]. Indyk [6] studied computations
exploiting bit operations and resilient to memory faults.
Overview of the simulations. The simulations are randomized and Monte
Carlo; this means that they always operate within the stated time bounds, and
are correct with a large probability. A simulation consist of two phases: preprocessing and the proper part. The preprocessing is like disk formatting, we refer
to it as "memory formatting", its goal is to provide an access mechanism to the
operational memory cells. The simulation proper is done in a step-by-step way,
the time of simulating one step is referred to as the slowdown of the simulation.
The DMM has n processors and n MUs, and each MU has the capacity of m
memory words, where it is assumed that m = n , for a constant > 0. Two simulations are presented. One is of a n logn-processor PRAM, it has the optimal
(expected) slowdown O(log n). The other simulation is of a n= logn-processor
PRAM, and the slowdown is O(log logn). There is one MU of capacity m words
per each processor, and preprocessing all the words would require time O(m). We
develop formatting
algorithms operating in time sublinear in m, more precisely
in time O(pm log n). All the above resource bounds hold with high probability.
Probability. A property depending on a natural number k is said to hold
with a high probability (abbreviated to w.h.p.) if, for some constant > 0, holds with the probability at least 1 ? k? , as long as k is suciently large.
Whenever this phrase is used, it is understood that can be made arbitrarily
large by manipulating other constants, often assumed only tacitly.
Let H be a family of hash functions h : U ! [1::n], where jU j nc , for
some constant c. H is said to be k-universal if for any xed sequence of k
distinct keys hx1 ; : : :; xki and a function h selected from H at random, P[h(x1) =
y1 ; : : :; h(xk ) = yk ] = n?k , for any y1 ; : : :; yk . We will have the memory of a
DMM hashed by random O(log n)-universal hash functions h : [1::n m] ! [1::n]
selected from the family of functions HS dened by Siegel [16]. Each such a
function needs space O(n ) to be stored, for < 1, and can be evaluated in time
O(1). If the memory is operational, then a representation of such function is
stored in an array, otherwise the function is distributed throughout the memory.
Paper organization. Section 2 includes some low-level techniques useful in a
faulty-memory environment: testing blocks of memory, broadcasting and graph
processes. The high-level techniques are presented in the next two sections, they
are the major building blocks of the simulations. In Section 3 we discuss communication issues between the processors and MUs, and present an adaptation
and extension of the algorithm originally developed by Anderson and Miller [1]
and Valiant [18]. Section 4 includes the description of an implementation of a
dictionary in a faulty-memory environment. Having developed the tools, the
simulations are presented in Section 5. The proofs will be described in the nal
version.
2 Basic algorithms
It is convenient to have a uniform addressing across all the MUs of the DMM.
We simply refer to the jth word of the ith MU as the (global) xth word, where
x = i m + j, and denote it by M[x]. Let [a::b] be the set of integers i such that
a i b. Suppose that there is given a sequence hdi i, for 1 i k, of numbers
in [1::m n]. We associate with it the sequence hDi i of address functions, for
1 i k, where Di (y) = (my + di ) mod (m n), for 1 y n. Throughout
the paper, the number k is equal to d log n, where d is a constant (usually)
depending on q.
The algorithms are presented in framed boxes. Whenever they consist of a
list of steps, this always means that there is a global synchronization between
the steps.
Finding Useful Memory. The total number of operational memory words
is at least (1 ? q) m n, where the constant q in known by the algorithm. Let
us dene a MU to be useful if there are at least 1?2 q m operational memory
1?q n such MUs.
words among all the m words in the MU. There are at least 1+
q
Each processor Pi can determine in time O(logn) w.h.p. whether the MU Mi is
useful by checking a random sample of size O(log n) of memory cells if they are
operational.
algorithm: sample module
Each processor Pi selects randomly and attempts to read a logn memory
words in the MU Mi . A MU is accepted (as useful) if there are at least
5
8 (1 ? q) a log n operational words in the sample.
Lemma 1. For any constant q > 0 there is a suitable parameter a > 0 in
algorithm sample module such that w.h.p. (n) MUs are accepted and all the
accepted MUs are useful.
To make all the processors participate in computations, a useful MU is assigned
to each processor. This is done in such a way that each MU is assigned to O(1)
processors, hence there is a O(1)-time delay for a processor to access its MU. A
description of the algorithm assign modules is given.
algorithm: assign modules
S-1. Each processor runs sample module. If it accepts the MU (as useful)
then the processor is referred to as good.
S-2. Each good processor
Pi writes the address bi of one of the operational
p
cells of Mi to c m logn dierent operational memory cells of Mi .
S-3. The processors which are not good try to nd a useful MU. They are
partitioned into 1= groups, each of size at most n, where is chosen
such that n is smaller than half of the number of useful MUs. Each group
performs the computation in dierent phases. During the kth phase, each
processor Pi from the kth group repeats the following steps, until it nds
some value bj :
1. Pi attempts to connect to a random MU until granted access to some
Mj .
2. Pi reads pm random memory cells from Mj , in order to nd bj . If it
is successful, Pi keeps attempting to connect to this MU till the end
of the phase to block other processors.
Lemma 2. The algorithm
p assign modules assigns useful modules to all the
processors in time O( m logn) w.h.p..
Some items may be stored in a useful module in a list in time O( + log n)
w.h.p. : each processor Pi keeps reading randomly chosen memory words from
Mi until operational ones are found, adding every new word to the list. This
algorithm operates in time O( + log n) w.h.p..
We assume from now on that each MU is useful (in the applications the
preprocessing will guarantee that each processor is assigned to a useful MU
w.h.p., with O(1) processors per one MU).
Broadcasting. The operation of broadcasting propagates the messages initially
known by a few processors to all the other processors. A related broadcasting
algorithm for one message was presented in [2]. Here we develop an algorithm
to propagate O(log n) messages. In the beginning each of processors Pi , for
1 i d logn and some d > 0, knows one such a message.
Lemma 3. The algorithm broadcast
p propagates successfully d logn messages
among all the processors in time O( m log n) w.h.p..
Graph processes. A directed acyclic graph G = (V; E) is given. A node
process is associated with each node: it is a sequence of trials, each with the
constant probability p of success (which causes termination of the process), the
algorithm: broadcast
p
S-1. Each processor Pi, for i = 1; : : :; d logn, writes its value to a m different operational cells of Mi .
S-2. Each processor Pi repeats the following action, called a round, until a
message is found, but no more than b log n times.
1. Select and contact some memory
p unit Mk at random,
2. If granted access then read m randomly chosen memory cells from
Mk .
S-3. Each processor Pi that found a message, writes it to apm dierent
operational cells of Mi .
S-4. Each processor Pi repeats a round until nding a message, but at most
c logn times. If a message is found, Pi performs Step 3 and stays idle
during the following rounds.
p
S-5. Each processor Pi writes the message it knows into c m log n randomly
chosen memory cells of Mi .
S-6. Each processor performs the following procedure f log2 n times :
1. Choosep some module Mk at random,
2. Read m= logn random memory cells of Mk ,
3. If a new message has been found then add it to the list.
probability p is the same for all the nodes. The graph process proceeds in steps,
at one step there is one trial for each (operating) node process. The process
starts at nodes with in-degree 0 (input nodes) by initializing the associated node
process. In general, the process of a node is initialized, if the processes of all
its predecessors have terminated. The graph process terminates if all the node
processes have terminated. The trials in a step need not to be independent, but
any collection of trials from distinct steps must be independent. Graph processes
were investigated in [12].
Let us assume that each node of G has the in-degree bounded by t and the
out-degree bounded by s. G is layered if the set V can be partitioned into disjoint
layers V1 ; : : :; Vd , such that for any edge (v; w) 2 E there is i such that v 2 Vi
and w 2 Vi+1 . Let n = maxi jVij. A sequence w1 ; : : :; wc of nodes is a column
if its elements belong to distinct layers such that the layer of wi has a greater
index than that of wj provided i > j. There exists a partition of V into at most
n columns.
The graph represents a computation in the following sense. A node corresponds to computing a value, and an edge from v1 to v2 means that the value
computed in v1 is needed at v2 to compute its value. Also, at a given step,
the value at node v can be computed with at least some constant probability
p provided that the nodes connected with v by incoming edges have completed
their evaluations. We consider such a graph of computation of an n-processor
EREW PRAM algorithm, where the processors are associated with nodes in a
dynamic way, and a processor can compute the incoming and outgoing edges in
its local registers, knowing the current node. The input and output nodes are
distinguished. The input nodes are set to some values, and the goal is to evaluate
the output nodes in parallel. Each processor is assigned a column and evaluates
its consecutive nodes.
Lemma 4. Graph of computations can be evaluated on a n-processor DMM
with faulty memory by referencing only 2n memory addresses, in time (s +
t)2(5c=p)(d + logP) with the probability 1 ? P ?c, where P = n(t + s)d .
3 Realizing h-relations in faulty memory
A set of access requests among processors and MUs is a h-relation if each processor sends at most h requests and each MU is to receive at most h requests. The
task of realizing a relation is that of satisfying all the requests. Anderson and
Miller [1] and Valiant [18] proposed an algorithm to realize O(log n)-relations.
We refer to it as the AMV algorithm. The algorithm is in two phases: queue
stage and the cleanup stage. We present an adaptation of the AMV algorithm
to a faulty-memory environment. The algorithm will be used in a PRAM-step
simulation after some preprocessing has already been done. During the preprocessing, some O(log n) address functions will have been generated randomly and
made known to all the processors.
There are two kinds of memory requests that we will need to be able to
realize. The simple one is when a processor knows the MU it needs to access and
the address of an operational memory cell in the MU. The probing request is
when a processor knows some virtual address x, it is an integer 1 x n, and
it needs to access some operational memory cell of the form M[Dt(x)]. In the
simulations, all the processed requests are either simple or probing. We describe
and analyze AMV for the case of probing requests, this subsumes the case of
simple requests. A code for the queue stage is given. The queue of a processor or
algorithm: queue stage
t = 0 ; ft is the index for address functions g
for k = loglog n downto 1 do
for j = 1 to a 2k do
t = t+1 ;
set i to a random number in [1::2k] ;
if A[i] is not blank then
attempt to read M[Dt(A[i])] ;
if communication with MU was successful
and the cell is operational then
read and store the information ;
set A[i] to blank ;
if there are still more than 2k?1 requests
then become idle
else compress array A to the size 2k?1
a MU is the set of its requests not realized yet at the moment. The requests are
stored in the array A with log n entries. After satisfying the request from A[i],
this position is kept blank, and the array is compressed at the end of each round.
In the original AMV-algorithm, the rightmost request was put in the place of
the just realized one, hence the array stored the still-not-realized requests in a
contiguous part. Our modication simplies the analysis and is crucial in proving
Lemma 5. The requests with the virtual address x are said to be in queue to the
virtual unit Gx. For a selected function Dt , the queue of Gi belongs to the queue
to the MU with number (i + b dm c) mod n. Notice that if two processors attempt
to realize two distinct virtual addresses then there is no conict between them
for access to a MU. Anderson and Miller [1] showed that the expected number
of requests still not satised when the queue stage algorithm (with simple
requests) terminates is O(n). The algorithm queue stage for probing requests
has the same property. We strengthen this fact and show that the queue stage
algorithm for probing requests satises all but O(n) requests w.h.p..
t
Lemma 5. The number of requests still not realized after the algorithm queue
stage is nished is O(n) w.h.p..
The cleanup stage starts by redistributing the memory references still not satised among all the processors in such a way that each processor receives O(1)
of them. Next the requests are sorted on their MU addresses. To this end, we
simulate the randomized hypercubic sorting developed by Leighton and Plaxton [11], which sorts n elements in time O(logn) w.h.p.. The algorithm can be
represented as a graph with a simple hypercubic structure, hence the connections between the processors may be generated on-line and Lemma 4 applies.
This shows:
Theorem 6. A DMM with n processors can realize a O(logn)-relation of probing memory requests in time O(log n) w.h.p..
4 Dictionary in faulty memory
A dictionary supports the operations of insertion and lookup. We develop a
sequential implementation of a dictionary, to be run in each MU. There is a
memory of size m available, having at most q m faulty cells w.h.p.. The keys
are from the universe U = [1::n m]. Let MU[x] be the xth word of MU in which
a dictionary is implemented. A description for the dictionary preprocessing is
given.
algorithm: dictionary preprocessing
S-1. Build a list of length d logn.
S-2. Select d logn random numbers d1; d2; : : : from the interval [0::m ? 1]
and store them in the list. These numbers dene a family D of local
address functions as follows: Di (x) = (x + di) mod m.
S-3. Generate two random functions f and g from HS , where f : U ! [1::m]
and g : [1::m] ! [1::m]. Store the functions as s-entries < i; si >, each
such a pair in the operational cells of the form MU[Dk (i)], for all indexes
of address functions.
Lemma 7. The dictionary preprocessing can be accomplished in time O(m )
w.h.p., for any xed 0 < < 1.
algorithm: dictionary operation
x 2 U is the key in a dictionary operation.
y:=f(x) ;
while g(y) is faulty do y:=y + 1 ;
f now MU[g(y)] is operational and is the head of a list, maybe emptyg
if the operation is a lookup then
traverse the list until the key is found or the end encountered
else finsertiong
repeat
select a memory address v at random
until MU[v] is operational and not occupied ;
store x in MU[v] and add MU[v] to the beginning of the list
A code for the dictionary operation is given.Let R be a sequence of dictionary
operations, where the number of keys is at most m, for a constant > 1. The
requests from R are coming in packets of length O(log n), all the keys in one
packet are distinct. To evaluate f or g at a point, O(1) s-entries are needed.
They have been distributed throughout the memory during preprocessing. If, at
the moment t of processing a packet, we need a s-entry with index i, then the
cells MU[Dt(i)]; MU[Dt+1(i)]; : : : are searched, until it is found. Each address
function is used at most once while processing the packet.
Theorem8. The time to process a packet L of dictionary requests from R is
O(logn) w.h.p..
5 Simulations
Optimal simulation. A simulation of a PRAM with n log n processors and
O(n m) memory words is developed. The processors of the PRAM are parti-
tioned into n groups of log n elements, a group is simulated by one processor
of the DMM. Each step of the simulated PRAM is executed in time O(log n)
w.h.p. on the DMM, what is optimal. A description of formatting is given.
algorithm: optimal format
S-1. Each processor Pi is assigned to a useful MU by the procedure assign
modules, together with an operational cell bi . The cell bi is to be used
by other processors to communicate with Pi.
S-2. The processors P1; : : :; Pd log n generate d logn random numbers d1, d2 ,
: : :, determining the address functions D1 , D2 , : : :, and broadcast them
to all the processors. The processors store them in lists.
S-3. Each processor Pi places records consisting of number i, the number
of the assigned MU and bi in the words M[Dk (i)], for all the address
functions Dk .
S-4. All the processors are organized as a full binary tree. A processor gets
to know about its neighbors in the tree, say i1 and i2 , by reading the
information stored in words M[Dk (i1 )] and M[Dk (i2 )], for all indexes k.
S-5. The root selects a random hash function h : [1::mn] ! [1::n] from HS
and propagates its description on the tree to all the processors. Each
processor stores the s-entries in a dictionary.
Theorem 9. The algorithm optimal format, run on a DMM with n processors andpthe capacity of each MU equal to m = n , preprocesses all the MUs in
time O( m log n) w.h.p..
During the simulation proper, the memory is hashed by a random function h in
HS . Two kinds of access requests to MUs are generated: probing and simple. The
former are to nd the number of the useful MU storing address h(x), the latter
are sent to the useful MUs, and then back, if the request is a read operation. A
description of a simulation of a PRAM step is given. Step 1 of the algorithm
algorithm: optimal step
S-1. Each processor nds the physical number of the MU of the h(x)-th
processor, for all the PRAM cells x that it needs to access.
S-2. Each processor sends the access request to the MU of the h(x)-th
processor, for all the cells x that it needs to access. Simultaneously, the
processor receives requests sent to it from other processors.
S-3. Each processor performs dictionary operations on its assigned MU,
corresponding to the received read and write requests.
S-4. Each processor sends back the answers to the read requests.
optimal step is accomplished by probing requests, and step 2 by simple ones.
The following are the key facts:
(1) The number of PRAM addresses that hash to the same MU by h is O(m)
w.h.p..
(2) For each step of the simulated PRAM, the number of requests directed to
each MU is O(logn) w.h.p..
They provide the assumptions under which the eciency and correctness of
realizing h-relations and supporting dictionary requests are proved in Sections 3
and 4. This shows:
Theorem 10. A DMM with n processors and the capacity of each MU equal to
m = n possibly faulty memory cells can simulate, after preprocessing, a single
step of a PRAM with n log n processors and O(n m) memory cells with the delay
O(log n) w.h.p..
Faster simulation. A simulation of a PRAM with (n= logn) processors and
(mn= log n) memory words is developed. Each step of the simulated PRAM is
executed in time O(log log n) w.h.p. on the DMM. A description of formatting
is given.
algorithm: fast format
S-1. The processors perform Steps 1 through 4 of optimal format.
S-2. The processors in each group form a tree.
S-3. p logn hash functions hi : [1::mn] ! [1::n] are selected randomly from
HS . The kth function is stored by every kth processor in a group of
processors in a dictionary.
Theorem 11. The time of the fast format run onp a DMM with n processors
and the capacity of each MU equal to m = n is O( m logn) w.h.p..
Let us denote = p logn. Let the simulated PRAM have n= processors and
cm n m= memory cells. The processors are partitioned into n= groups Pi of
elements. The ith group simulates the ith processor of the PRAM. We assume
that all the processors in group Pi know a PRAM memory address x to access.
To simulate the PRAM step, the processors in Pi reference the memory words
M[hk(x)], for 1 k , and then identify the one with the latest update. In
order to reduce the number of collisions in each step, only a fraction of a group
is activated, say cp processors, and after 1=cp steps all the tasks are completed.
A description of the simulation of a step is given. All the processors in a group
know the next PRAM address to access after the last step of the algorithm fast
step.
algorithm: fast step
Access by the ith PRAM processor to address x.
S-1. The processors in Pi make read attempts: the kth processor attempts
to read M[hk (x)].
S-2. All the successful processors identify the value v with the maximum
time stamp among the retrieved ones. This is accomplished by running
a maximum-nding algorithm on the tree, which nds the maximum of
keys on processors in time O(log ).
S-3. The value v computed is communicated to all the processors in the
group via the tree. Then all the processors perform the required local
computations of the PRAM. The memory cells M[hk(x)] are attempted
to be accessed, the successful processors perform the writes.
Let S = fx1; : : :; xn=g be the set of cells to access in the current step. Let
h0k (x) = bhk (x)=mc. A memory cell M[z] of the DMM is meaningful , with respect
to hash functions (hi ), if there is at most one k and x such that z = hk (x). A
memory cell M[z] is accessible, with respect to the family of hash functions (hi)
and S, if there is at most one k and x 2 S such that M[z] is in the MU with
number h0k (x).
Lemma 12. For each x 2 S, w.h.p. there exists some k such that hk (x) is
meaningful and accessible.
Theorem13. The algorithm fast step simulates a single step of a PRAM in
time O(log logn) w.h.p..
Acknowledgement: Thanks are due to Artur Czumaj for his comments.
References
1. R.J. Anderson, and G.L. Miller, Optical Communication for Pointer Based Algorithms, Tech. Rep. CRI 88-14, Comp. Sci. Dpt., University of Southern California,
Los Angeles, 1988.
2. B.S. Chlebus, A. Gambin, and P. Indyk, PRAM Computations Resilient to Memory
Faults, in Proceedings of the 2nd Annual European Symposium on Algorithms, 1994,
Springer LNCS 855, pp. 401{412.
3. B.S. Chlebus, L. Gasieniec, and A. Pelc, Fast Deterministic Simulation of Computations on Faulty Parallel Machines, in Proceedings of the 3rd Annual European
Symposium on Algorithms, 1995, Springer LNCS 979, pp. 89{101.
4. A. Czumaj, F. Meyer auf der Heide, and V. Stemann, Shared Memory Simulations with Triple-Logarithmic Delay, in Proceedings of the 3rd Annual European
Symposium on Algorithms, 1995, Springer LNCS 979, pp. 46{59.
5. M. Dietzfelbinger, and F. Meyer auf der Heide, Simple, Ecient Shared Memory Simulations, in Proceedings of the 5th Annual ACM Symposium on Parallel
Algorithms and Architectures, 1993, pp. 110{119.
6. P. Indyk, On Word-Level Parallelism in Fault-Tolerant Computing, in Proceedings
of the 13th Annual Symposium on Theoretical Aspects of Computer Science, 1996.
7. P.C. Kanellakis, and A.A. Shvartsman, Ecient Parallel Algorithms Can Be Made
Robust, Distributed Computing, 5 (1992) 201-217.
8. R. Karp, M. Luby, and F. Meyer auf der Heide, Ecient PRAM Simulations on
Distributed Memory Machine, in Proceedings of the 24-th Annual ACM Symposium
on Theory of Computing, 1992, pp. 318{326.
9. Z. M. Kedem, K. V. Palem, A. Raghunathan, and P. Spirakis, Combining Tentative and Denite Executions for Very Fast Dependable Parallel Computing, in
Proceedings of the 23rd Annual ACM Symposium on Theory of Computing, 1991,
pp. 381-390.
10. Z. M. Kedem, K. V. Palem, and P. Spirakis, Ecient Robust Parallel Computations, in Proceedings of the 22nd Annual ACM Symposium on Theory of Computing, 1990, pp. 138-148.
11. T. Leighton, and G. Plaxton, A (Fairly) Simple Circuit That (Usually) Sorts, in
Proceedings of the 31st Annual Symposium on Foundations in Computer Science,
1990, pp. 264{274.
12. C. Martel, A. Park, and R. Subramonian, Work-Optimal Asynchronous Algorithms
for Shared Memory Parallel Computers, SIAM J. Comput (1992) 1070{1099.
13. K. Mehlhorn, and U. Vishkin, Randomized and Deterministic Simulations of
PRAMs by Parallel Machines with Restricted Granularity of Parallel Memories,
Acta Informatica 21 (1984) 339{374.
14. F. Meyer auf der Heide, Hashing Strategies for Simulating Shared Memory on
Distributed Memory Machines, in Proceedings of the 1st Heinz Nixdorf Symposium \Parallel Architectures and their Ecient Use," F. Meyer auf der Heide, B.
Monien, A.L. Rosenberg (Eds.), Paderborn, Germany, 1992, Springer LNCS 678,
pp. 20{29.
15. F. Meyer auf der Heide, C. Scheideler, and V. Stemann, Exploiting Storage Redundancy to Speed Up Randomized Shared Memory Simulations, in Proceedings,
12th Annual Symposium on Theoretical Aspects of Computer Science, Munich,
Germany, 1995, Springer LNCS 900, pp. 267{278.
16. A. Siegel, On Universal Classes of Fast High Performance Hash Functions, Their
Time-Space Tradeo, and Their Applications, in Proceedings of the 30th Annual
Symposium on Foundations of Computer Science, 1989, pp. 20{25.
17. A. A. Shvartsman, Achieving Optimal CRCW PRAM Fault-Tolerance, Information Processing Letters, 39 (1991) 59-66.
18. L.G. Valiant, General Purpose Parallel Architectures, in "Handbook of Theoretical
Computer Science," J. van Leeuwen (Ed.), Elsevier, 1990, vol. A, pp. 869{941.
This article was processed using the LATEX macro package with LLNCS style

Download Report

Shared-Memory Simulations on a Faulty

Paperzz.com

Your Paperzz