BlueOx: A Java Framework for Distributed Data Analysis

BlueOx: A Java
Framework for Distributed
Data Analysis
Jeremiah Mans
Princeton University
CHEP 2003
San Diego, CA
Outline

Overview of BlueOx and its goals

Design and Structure of BlueOx

Current Status

Future Directions
3/24/2003
CHEP 2003
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
2
Code to Data
User’s Analysis Code
Goals of BlueOx

Generic code-to-data analysis framework




User writes analysis code on his or her desktop or
uses an analysis code generator program
Framework is responsible for distributing and
executing the analysis
Provide support for debugging of code
Expandable and adaptable framework

3/24/2003
CHEP 2003
Allow addition of new data formats,
communication protocols, authentication systems
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
4
Where BlueOx Fits
Immediately
Interactive
1 sec
Remote
Batch Processing
100 sec
10000 sec
1000000 sec
Monte Carlo
Production
ROOT
Histogram
Browsing
3/24/2003
CHEP 2003
Database
Queries
Arbitrary
Code
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
Arbitrary
Executables
5
Actors in BlueOx
Agent
User: the
physicist at a
local institute
Agent: class which
represents the User
and coordinates the
analysis process
Servers: data analysis
servers located locally and
around the world
3/24/2003
CHEP 2003
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
6
Job Lifecycle: Discovery
Agent

The Agent uses the Discovery interface to obtain a list
of the available datasets, possibly based on a query
provided by the User.
3/24/2003
CHEP 2003
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
7
Analysis
Job


Agent
Brokerage
Job Lifecycle: Brokering
The User supplies a Job class and a list of datasets. The
Agent uses an instance of the Brokerage interface to obtain
service contracts assigning the analysis of each dataset to a
Server.
In general, the Brokerage will either contact a subset of
Servers directly, or will contact Proxies which handle the job
assignments for a group of Servers (such as a cluster).
3/24/2003
CHEP 2003
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
8
Agent

Communications
Schemes
Job Lifecycle: Execution
The Agent distributes
split copies of the Job to
the various Servers, and
monitors the progress of
execution.
3/24/2003
CHEP 2003
Analysis
Job
DataSource
Analysis
Job
Analysis
Job
Analysis
Job
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
DataSource
DataSource
9
Job Lifecycle: Merging
Agent
Analysis
Job



When the split Jobs complete, the Agent gathers them from the
remote Servers and merges them into a single Job, which it
returns to the User.
If the job crashes on one or more Servers, the Agent reports the
details of the exception which terminated the Job to the User.
3/24/2003
CHEP 2003
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
10
Abstract Interfaces



The main actors in BlueOx are the Agent which
represents the user and the Servers.
The interaction between the main actors is defined
by abstract interfaces which support multiple
implementations
Interfaces in BlueOx:




3/24/2003
CHEP 2003
User authentication: Password-based, certificate based, …
Dataset discovery: Direct server contact, LDAP, …
Data Access/Storage: ROOT, HBOOK, custom, …
Communications Scheme
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
11
Abstract Interfaces:
Communications Scheme



Allows communication between the Agent and
Server during job execution
Responsible for transporting analysis objects and
classes (code)
Transports debugging and monitoring information



Arbitrary textual messages from user’s code
Exceptions which occur within the user’s code
Implemented with several different technologies



3/24/2003
CHEP 2003
Packet-based two-way communication
Client-connection-only communication (polling)
Two-way SOAP
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
12
Mergeability


For BlueOx to be able to distribute an analysis and
combine the results afterwards, the analysis must
support mergeability: the ability to merge or add one
object of a given class to another of the same class.
Simple operation for many objects




Counters and histograms add arithemetically
Lists concatenate (and possibly re-sort)
BlueOx contains several utility classes which
support Mergeability
An object which contains only mergeable, transient,
or static member variables can be merged
automatically: Automergeable
3/24/2003
CHEP 2003
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
13
BlueOx and AIDA


BlueOx supports the use of the Abstract Interfaces
for Data Analysis (AIDA), particularly for histograms
and similar objects.
BlueOx provides the implementation of Mergeable
for AIDA and manages the serialization of AIDA
objects.


3/24/2003
CHEP 2003
BlueOx does not provide a full implementation of AIDA, but
rather provides a “wrapper” implementation, which aims to
provide the merge and serialization functionality to any
AIDA implementation.
Currently, the use of JAIDA 3.0 is supported in BlueOx.
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
14
Doing Analysis in BlueOx


User writes an Analyzer class, which may employ
various extensions to enable configuration, startup
and completion tasks, and interfacing with a GUI.
Experiment-specific information ( such as the data
format used for the experiment’s data ) must be
provided separately to the user.


Future revisions of BlueOx may support XML schemas.
The user employs either a command-line or
graphical interface to select data sets, start the job,
and follow it to its completion.
3/24/2003
CHEP 2003
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
15
Demonstration Analysis
(Quick overview here – come to the demonstration for more details!)





Data Source: Pythia-generated events for
500 GeV e+/e- collider, followed by
smearing+reconstruction program
Dataset discovery: LDAP database
Brokering: Direct-contact brokering
Authentication: SSH-like key files
Analysis focus: e+e-  ZH  nnbb
3/24/2003
CHEP 2003
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
16
Choosing Datasets

H
3/24/2003
CHEP 2003
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
17
Plots from HZ  nnbb Analysis
3/24/2003
CHEP 2003
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
18
Testing BlueOx
100.0
Normalized Load
40 Servers, running on four machines
100 Clients, running on one machine
100 Jobs/Client (10,000 total jobs executed)
10.0
1.0
0.1
0
10
20
30
40
50
60
Time (min)
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
B1
B2
B3
B4
B5
B6
B7
B8
B9
B10
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
3/24/2003
CHEP 2003
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
19
Future Development

Increase sophistication of brokering


Test analysis framework with more real
analysis


Current implementation is not scalable, but we
have some ideas on how to improve it.
We are developing a DataSource which would
allow analysis of DØ data by our undergraduates
Enhance user tools


3/24/2003
CHEP 2003
Integration with JAS v3.0
Schema-based analysis-creation wizard
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
20
Further Information

Web site including “BlueOx Companion” and
JavaDoc documentation:
http://flywheel.princeton.edu/BlueOx/
3/24/2003
CHEP 2003
Jeremiah Mans
Princeton University
BlueOx: A Framework for Distributed Data Analysis
21