Lightweight Task Graph Inference for Distributed Applications
Bin Xin, Patrick Eugster, Xiangyu Zhang
Jinlin Yang
Dept. of Computer Science
Purdue University
Center for Software Excellence
Microsoft Corp.
{xinb, peugster, xyzhang}@cs.purdue.usc
[email protected]
2010 29th IEEE International Symposium on Reliable Distributed Systems
Introduction
•
New Challenges to reliability as applications move
to Cloud
•
•
Distinct corporate entities managing the infrastructure
and the owing the application deployed
Application developer do not have access to lower level
debugging information in case of failures/faults.
• Depends on Application output or app level custom
Logs for diagnosis
•
Goal: Describe the high-level structural view of a
distributed program execution to facilitate easy
“after the fact” diagnosis.
Contributions
•
Define abstraction for representing distributed
executions – “Tasks”
•
A lightweight approach to generate “Task Graphs”
from the application event logs.
•
A declarative formulation of the rules to generate
Task Graphs using Prolog.
•
Demonstrate use of Task Graph to help understand
the distributed execution including anomaly
detection.
Relevance to SmartGrid and CiC
•
Extensions
•
Fault Detection by real-time log processing (CEP?)
• The patterns for CEP can be defined by the
application developer
• OR can be auto-generated using code augmentation
and static code analysis.
•
•
On fault-detection, the task graph can be used to decide
“recovery” mechanisms (other than naïve restart process
strategy)
Shortcomings
•
•
Do not explicitly consider the “Data Repository”
• Considered only as one of the ‘tasks’.
Not sure how it handles Transactions
Definitions
Event: is the execution of an operation that sends (or receives) data/signal to a different
thread/process (Smallest building blocks)
Signaling Event: is the operation of Sending
Acting Event: is the operation of Receiving
Happens Before (a e b):
partial ordering of events. A is the Sender and B
is the receiver who acts on that signal.
Task: Autonomous computation within a thread between to “acting” events. [Astart, Aend)
Task contains exactly one Acting Event
Zero or more Signaling Event
Task Graph:
A DAG whose nodes are tasks and edges represent Happens
Before relations
A Request: A pair of signaling and acting events, where the signaling event is
originating from outside the System.
A Reply: A pair of signaling and acting events, where the Acting event is triggered
outside the System.
E2E service Graph:
Example
System Setup
•
•
•
Uses HDFS as the example application on Cloud
HDFS logs are not sufficient/standardized
Uses Instrumentation using a tool called “AspectJ”
• AspectJ lets the developer insert code based on specific
“rules” during compilation
• Each event is logged as a 7-field tuple
• (EventID, ProcID, threadID, SourceLocation, Type,
Tag, Value)
Constructing Task Graphs (Prolog formulation) I
Events
A “Fact” to parse and
store all events
An entry for hb is
made only if the
Rules on the right
are true for events X
&Y
Constructing Task Graphs (Prolog formulation) II
Tasks
Issues & Solutions - I
Problem:
False +ves caused by Common Sycn Objects
Notion of “Time” is required. But Global Clocks or
Vector Clocks are expensive and complex.
Proposed
Solution:
Heuristic: Use the order of events in the event logs.
Issues & Solutions - II
Problem:
False +ves caused by Communication
Multiple Writes on the same Socket.
Proposed
Solution:
Heuristic: Use “Packet Size” and Total Received so far
to decide which write to associate to which reads.
Issues & Solutions - III
Problem:
False -ves caused by Gaurded Waits
Multiple waiting threads are notified and the Lock
Condition is updated before the current thread’s
execution. Hence a Condition Check is required
after waking up.
Proposed
Solution:
Manually update such cases and remove augmented
code within the loop and Add a marker just after
the loop.
Evaluation - I
Performance Impact
Runtime:
22.2% increase in binary size
38% increase in execution time
TaskGraph building using Prolog:
Evaluation – II (Demo)
To Help a new HDFS developer to analyze HDFS
Execution
Relevance to SmartGrid and CiC
•
Extensions
•
Fault Detection by real-time log processing (CEP?)
• The patterns for CEP can be defined by the
application developer
• OR can be auto-generated using code augmentation
and static code analysis.
•
•
On fault-detection, the task graph can be used to decide
“recovery” mechanisms (other than naïve restart process
strategy)
Shortcomings
•
•
Do not explicitly consider the “Data Repository”
• Considered only as one of the ‘tasks’.
Not sure how it handles Transactions
© Copyright 2026 Paperzz