BlueOx: A Java Framework for Distributed Data Analysis Jeremiah Mans Princeton University CHEP 2003 San Diego, CA Outline Overview of BlueOx and its goals Design and Structure of BlueOx Current Status Future Directions 3/24/2003 CHEP 2003 Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 2 Code to Data User’s Analysis Code Goals of BlueOx Generic code-to-data analysis framework User writes analysis code on his or her desktop or uses an analysis code generator program Framework is responsible for distributing and executing the analysis Provide support for debugging of code Expandable and adaptable framework 3/24/2003 CHEP 2003 Allow addition of new data formats, communication protocols, authentication systems Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 4 Where BlueOx Fits Immediately Interactive 1 sec Remote Batch Processing 100 sec 10000 sec 1000000 sec Monte Carlo Production ROOT Histogram Browsing 3/24/2003 CHEP 2003 Database Queries Arbitrary Code Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis Arbitrary Executables 5 Actors in BlueOx Agent User: the physicist at a local institute Agent: class which represents the User and coordinates the analysis process Servers: data analysis servers located locally and around the world 3/24/2003 CHEP 2003 Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 6 Job Lifecycle: Discovery Agent The Agent uses the Discovery interface to obtain a list of the available datasets, possibly based on a query provided by the User. 3/24/2003 CHEP 2003 Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 7 Analysis Job Agent Brokerage Job Lifecycle: Brokering The User supplies a Job class and a list of datasets. The Agent uses an instance of the Brokerage interface to obtain service contracts assigning the analysis of each dataset to a Server. In general, the Brokerage will either contact a subset of Servers directly, or will contact Proxies which handle the job assignments for a group of Servers (such as a cluster). 3/24/2003 CHEP 2003 Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 8 Agent Communications Schemes Job Lifecycle: Execution The Agent distributes split copies of the Job to the various Servers, and monitors the progress of execution. 3/24/2003 CHEP 2003 Analysis Job DataSource Analysis Job Analysis Job Analysis Job Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis DataSource DataSource 9 Job Lifecycle: Merging Agent Analysis Job When the split Jobs complete, the Agent gathers them from the remote Servers and merges them into a single Job, which it returns to the User. If the job crashes on one or more Servers, the Agent reports the details of the exception which terminated the Job to the User. 3/24/2003 CHEP 2003 Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 10 Abstract Interfaces The main actors in BlueOx are the Agent which represents the user and the Servers. The interaction between the main actors is defined by abstract interfaces which support multiple implementations Interfaces in BlueOx: 3/24/2003 CHEP 2003 User authentication: Password-based, certificate based, … Dataset discovery: Direct server contact, LDAP, … Data Access/Storage: ROOT, HBOOK, custom, … Communications Scheme Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 11 Abstract Interfaces: Communications Scheme Allows communication between the Agent and Server during job execution Responsible for transporting analysis objects and classes (code) Transports debugging and monitoring information Arbitrary textual messages from user’s code Exceptions which occur within the user’s code Implemented with several different technologies 3/24/2003 CHEP 2003 Packet-based two-way communication Client-connection-only communication (polling) Two-way SOAP Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 12 Mergeability For BlueOx to be able to distribute an analysis and combine the results afterwards, the analysis must support mergeability: the ability to merge or add one object of a given class to another of the same class. Simple operation for many objects Counters and histograms add arithemetically Lists concatenate (and possibly re-sort) BlueOx contains several utility classes which support Mergeability An object which contains only mergeable, transient, or static member variables can be merged automatically: Automergeable 3/24/2003 CHEP 2003 Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 13 BlueOx and AIDA BlueOx supports the use of the Abstract Interfaces for Data Analysis (AIDA), particularly for histograms and similar objects. BlueOx provides the implementation of Mergeable for AIDA and manages the serialization of AIDA objects. 3/24/2003 CHEP 2003 BlueOx does not provide a full implementation of AIDA, but rather provides a “wrapper” implementation, which aims to provide the merge and serialization functionality to any AIDA implementation. Currently, the use of JAIDA 3.0 is supported in BlueOx. Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 14 Doing Analysis in BlueOx User writes an Analyzer class, which may employ various extensions to enable configuration, startup and completion tasks, and interfacing with a GUI. Experiment-specific information ( such as the data format used for the experiment’s data ) must be provided separately to the user. Future revisions of BlueOx may support XML schemas. The user employs either a command-line or graphical interface to select data sets, start the job, and follow it to its completion. 3/24/2003 CHEP 2003 Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 15 Demonstration Analysis (Quick overview here – come to the demonstration for more details!) Data Source: Pythia-generated events for 500 GeV e+/e- collider, followed by smearing+reconstruction program Dataset discovery: LDAP database Brokering: Direct-contact brokering Authentication: SSH-like key files Analysis focus: e+e- ZH nnbb 3/24/2003 CHEP 2003 Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 16 Choosing Datasets H 3/24/2003 CHEP 2003 Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 17 Plots from HZ nnbb Analysis 3/24/2003 CHEP 2003 Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 18 Testing BlueOx 100.0 Normalized Load 40 Servers, running on four machines 100 Clients, running on one machine 100 Jobs/Client (10,000 total jobs executed) 10.0 1.0 0.1 0 10 20 30 40 50 60 Time (min) A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 3/24/2003 CHEP 2003 Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 19 Future Development Increase sophistication of brokering Test analysis framework with more real analysis Current implementation is not scalable, but we have some ideas on how to improve it. We are developing a DataSource which would allow analysis of DØ data by our undergraduates Enhance user tools 3/24/2003 CHEP 2003 Integration with JAS v3.0 Schema-based analysis-creation wizard Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 20 Further Information Web site including “BlueOx Companion” and JavaDoc documentation: http://flywheel.princeton.edu/BlueOx/ 3/24/2003 CHEP 2003 Jeremiah Mans Princeton University BlueOx: A Framework for Distributed Data Analysis 21
© Copyright 2026 Paperzz