Slide presentation Template

Distributed Computing
and Analysis
Lamberto Luminari
Italo – Hellenic School of Physics 2004
Martignano - May 20, 2004
Outline

Introduction
– General remarks

Distributed computing
– Principles
– Projects
– Computing facilities: testbeds and production infrastructures

Database Systems
– Principles

Distributed analysis
– Requirements and issues
Italo - Hellenic School of Physics 2004
Lamberto Luminari
2
General remarks

Schematic approach
– For the purpose of clarity, differences among possible
alternatives are stressed: in reality, solutions are often a mix
or a compromise
– Only main features of relevant items are described: no aim of
exhaustivity

HEP (LHC) oriented presentation
– Examples are mainly taken from HEP world
– Projects with HEP community involvement are preferred
– Options chosen by LHC
Italo - Hellenic School of Physics 2004
Lamberto Luminari
3
Distributed Computing
Distributed computing

What is it:
– processing of data and objects across a network of connected systems;
– hardware and software infrastructure that provides pervasive (and
inexpensive) access to computational capabilities.

A long story:
– mainframes more and more expensive;
– cluster technology;
– RISC machines very powerful.

What makes it appealing now:
– CPU power!
– Storage capacity!!
– Network bandwidth!!!

... but Distr. Comp. is not a choice,
rather a necessity or an opportunity.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
5
Network performances
Italo - Hellenic School of Physics 2004
Lamberto Luminari
6
Advantages of distributed computing

Scalability and flexibility:
– in principle, distributed computing systems are infinitely scalable: simply
add more units and get more computing power. Moreover you can add or
remove specific resources and adapt the system to your needs.

Efficiency:
– private resources are usually poorly used: pooling them greatly increases
their exploitation.

Reliability:
– failure of a component little affects the overall performances.

Load balancing and averaging:
– distributing tasks according to the availability of resources optimize the
behavior of the whole system and minimize the execution time;
– load peaks arising from different user communities rarely sum up, then the
use of resources is averaged (and optimized) over long periods.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
7
Disadvantages of distributed computing

Difficult integration and coordination:
– many heterogeneous computing systems have to be integrated;
– data sets are splitted over different storage systems;
– many users have to cooperate and share resources.

Unpredictability:
– the quantity of available resources may largely fluctuate;
– computing units may become unavailable or unreachable suddenly and for
long periods, making unpredictable the completion time of the tasks
running there.

Security problems:
– distributed systems are prone to intrusion.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
8
Applications and distributed computing

Suitable:
–
–
–
–
–

high compute to data ratio;
batch processes;
loosely coupled tasks;
statistical evaluations dependent on random trials;
data mining through distributed filesystems or databases.
Unsuitable:
–
–
–
–
real time;
interactive processes;
strongly coupled;
sequential.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
9
Distributed computing architectures

Peer-to-peer:
– flat organization of components, with similar functionalities, talking to each other;
– suitable for:
 independent tasks or poor inter-task communication;
 access to sparse data organized in a non hierarchical way.

Client - server :
– components with different functionalities and roles:


processing unit (client) provided with a lightweight agent able to perform simple
operations: detect system status and notify it to the server, ask (or wait) for
tasks, accept and send data, execute processes according to priorities or in
spare cycles, ....
dedicated unit (server) provided with complex software able to: take or send
computing requests, monitor the status of the jobs sent to the clients, receive
the results and assemble them, possibly in a database. It also takes care of
security and access policy, and stores statistics and accounting data.
– suitable for:

complex architectures and tasks.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
10
Multi-tier computing systems

Components with different levels of service, arranged
in tiers:
– computing centers (multi-processors, PC farms, data storage
systems);
– clusters of dedicated machines;
– individual, general use PCs.

Different functionalities for each tier:
– amount of CPU power installed and data stored;
– quality and schedule of user support;
– level of reliability and security.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
11
Italo - Hellenic School of Physics 2004
Lamberto Luminari
12
Distributed computing models

Clusters:
– groups of homogeneous, tightly coupled components, sharing file
systems and peripheral devices (e.g., Beowulf);

Pools of desktop PCs:
– loosely interconnected private machines (e.g., Condor);

Grids:
– heterogeneous systems of (mainly dedicated) resources (e.g., LCG).
Italo - Hellenic School of Physics 2004
Lamberto Luminari
13
Comparison of computing models
Italo - Hellenic School of Physics 2004
Lamberto Luminari
14

Condor is a specialized workload management system for compute-intensive
jobs. It provides:
–
–
–
–
–



a job queueing mechanism;
scheduling policy;
priority scheme;
resource monitoring;
resource management.
Users submit their serial or parallel jobs to Condor, which places them into a
queue, chooses when and where to run the jobs based upon a policy, carefully
monitors their progress, and ultimately informs the user upon completion.
Unique mechanisms enable Condor to effectively harness wasted CPU power
from otherwise idle desktop workstations. Condor is able to transparently
produce a checkpoint and migrate a job to a different machine.
Condor does not require a shared file system across machines: if no shared file
system is available, Condor can transfer the job's data files on behalf of the
user, or Condor may be able to transparently redirect all the job's I/O
requests back to the submit machine.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
15
resources
data
network
Italo - Hellenic School of Physics 2004
Lamberto Luminari
16
Distributed computing environment

DCE standards:
– A distributed computing network may include many different systems. The
Distributed Computing Environment (DCE) — formulated by The Open Group —
formalizes the technologies needed to make the components communicate with
each other, such as remote procedural calls and middleware. DCE runs on all
major computing platforms and is designed to support distributed applications in
heterogeneous hardware and software environments.

DCE provides a complete infrastructure, with services, interfaces,
protocols, encoding rules for:
– authentication and security (Kerberos, Public Key certificate);
– objects interoperability across different platforms (CORBA: Common Object
Request Broker Architecture);
– directories (with global name and cell name) for distributed resources;
– time services (including synchronization);
– distributed file systems;
– Remote Procedure Call;
– Internet/Intranet communications.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
17
Grid computing specifications

The Global Grid Forum (GGF) is the primary organization whose purpose is to
define specifications about Grid Computing. It is a forum for information
exchange and collaboration among people who are




doing Grid research,
designing and building Grid software,
deploying Grids,
using Grids,
spanning technology areas: scheduling, data handling, security…

The Globus Toolkit (developed in Argonne Nat. Lab. and Univ. of Southern
California) is an implementation of these standards, and has become a de facto
standard for grid middleware because of some attractive features:
– a object-oriented approach, which allows developers of specific applications
to take just what meets their needs, to introduce tools one at a time and to
make programs increasingly "Grid-enabled“;
– the toolkit software is “open-source“: this allows developers to freely make
and add improvements.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
18
Globus toolkit



Practically all major Grid projects are being built on protocols and
services provided by the Globus Toolkit, a software "work-inprogress" which is being developed by the Globus Alliance, which
involves primarily Ian Foster's team at Argonne National Laboratory
and Carl Kesselman's team at the University of Southern California in
Los Angeles.
The toolkit provides a set of software tools to implement the basic
services and capabilities required to construct a computational Grid,
such as security, resource location, resource management, and
communications.
Globus includes programs such as:
– Computing Element: receives job requests and delivers them to the
Worker Nodes, which will perform the real work. The Computing
Element provides an interface to the local batch queuing systems. A
Computing Element can manage one or more Worker Nodes:
Italo - Hellenic School of Physics 2004
Lamberto Luminari
19
Globus Toolkit

The Globus toolkit provides a set of software tools to implement the basic
services and capabilities required to construct a computational Grid, such as
security, resource location, resource management, and communications:
– GRAM (Globus Resource Allocation Manager), to convert a request for resources into
commands that local computers can understand;
– GSI (Grid Security Infrastructure), to provide authentication of the user and work
out that person's access rights;
– MDS (Monitoring and Discovery Service), to collect information about resource
(processing capacity, bandwidth capacity, type of storage, etc);
– GRIS (Grid Resource Information Service), to query resources for their current
configuration, capabilities, and status;
– GIIS (Grid Index Information Service), to coordinate arbitrary GRIS services;
– GridFTP, to provide a high-performance, secure and robust data transfer mechanism
– Replica Catalog, a catalog that allows other Globus tools to look up where on the Grid
other replicas of a given dataset can be found
– Replica Management system, which ties together the Replica Catalog and GridFTP
technologies, allowing applications to create and manage replicas of large datasets.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
20
OGSA: the future?
Italo - Hellenic School of Physics 2004
Lamberto Luminari
21
Grid projects
… and many others!
Italo - Hellenic School of Physics 2004
Lamberto Luminari
22
Grid projects
•UK – GRIPP
•Netherlands – DutchGrid
•Germany – UNICORE, Grid project
•France – Grid funding approved
•Italy – INFN Grid
•Eire – Grid project
•Switzerland - Network/Grid project
•Hungary – DemoGrid
•Norway, Sweden – NorduGrid
•………
•NASA Information Power Grid
•DOE Science Grid
•NSF National Virtual Observatory
•NSF GriPhyN
•DOE Particle Physics Data Grid
•NSF TeraGrid
•DOE ASCI Grid
•DataGrid (CERN, ...)
•DOE Earth Systems Grid
•EuroGrid (Unicore)
•DARPA CoABS Grid
•DataTag (CERN,…)
•NEESGrid
•Astrophysical Virtual Observatory
•DOH BIRN
•GRIP (Globus/Unicore)
•NSF iVDGL
•GRIA (Industrial applications)
•GridLab (Cactus Toolkit)
•Grid2003
•CrossGrid (Infrastructure Components)
•…….
•EGSO (Solar Physics)
•EGEE
•………
Italo - Hellenic School of Physics 2004
Lamberto Luminari
23
Middleware projects relevant for HEP

EDG
– European Data Grid (EU project)

EGEE
– Enabling Grids for E-science in Europe (EU project)

Grid2003
– joint project of the U.S. Grid projects iVDGL, GriPhyN and PPDG, and the
U.S. participants in the LHC experiments ATLAS and CMS.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
24
Italo - Hellenic School of Physics 2004
Lamberto Luminari
25
Italo - Hellenic School of Physics 2004
Lamberto Luminari
26
Italo - Hellenic School of Physics 2004
Lamberto Luminari
27
LCG hierarchical information service
Italo - Hellenic School of Physics 2004
Lamberto Luminari
28
Replica management
Italo - Hellenic School of Physics 2004
Lamberto Luminari
29
Italo - Hellenic School of Physics 2004
Lamberto Luminari
30
Italo - Hellenic School of Physics 2004
Lamberto Luminari
31
Italo - Hellenic School of Physics 2004
Lamberto Luminari
32
A Job Submission Example
UI
JDL
Input “sandbox”
Data Management
Services
LFN->PFN
Input “sandbox”
Output “sandbox”
Resource
Broker
Job Query
Job Submit
Author.
&Authen.
Information
Service
Job Status
Storage
Element
Brokerinfo
Job Submission
Service
Logging &
Book-keeping
Output “sandbox”
Job Status
Compute
Element
33
Italo - Hellenic School of Physics 2004
Lamberto Luminari
33
Job submission steps (1)
Italo - Hellenic School of Physics 2004
Lamberto Luminari
34
Job submission steps (2)
Italo - Hellenic School of Physics 2004
Lamberto Luminari
35
Portals
Why a portal?
•
•
•
•
•
Italo - Hellenic School of Physics 2004
Lamberto Luminari
It can be accessed from everywhere
and by “everything” (desktop, laptop,
PDA, phone).
It can keep the same user interface
independently of the underlying
middleware.
It must be redundantly “secure” at
all levels:
•
•
•
•
secure for web transactions,
secure for user credentials,
secure for user authentication,
secure at VO level.
All available grid services must be
incorporated in a logic way, just “one
mouse click away”.
Its layout must be easily
understandable and user friendly.
36
Italo - Hellenic School of Physics 2004
Lamberto Luminari
37
Italo - Hellenic School of Physics 2004
Lamberto Luminari
38
Computing facilities (1)

Computing facilities (testbeds or production infrastructures) are
made up of one or more nodes. Each node (computer center or cluster
of resources) contains a certain number of components, which may be
playing different roles. Some are site specific:
– Computing Element: receives job requests and delivers them to the
Worker Nodes, which will perform the real work. The Computing Element
provides an interface to the local batch queuing systems. A Computing
Element can manage one or more Worker Nodes:

Worker Node: the machine that will actually process data. Typically managed via
a local batch system. A Worker Node can also be installed on the same machine
as the Computing Element.
– Storage Element: provides storage space to the facility. The storage
element may control large disk arrays, mass storage systems and the like;
however, the SE interface hides the differences between these systems
allowing uniform user access.
– User Interface: the machine that allows users to access the facility. This
is typically the machine the end-user logs into to submit jobs to the grid
and to retrieve the output from those jobs.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
39
Computing facilities (2)

Some other roles are shared by groups of users or by thwe whole
grid:
– Resource Broker: receives users' requests and queries the
Information Index to find suitable resources.
– Information Index: resides on the same machine as the Resource
Broker, keeps information about the available resources.
– Replica Manager: coordinates file replication from one Storage
Element to another. Useful for data redundancy but also to move data
closer to the machines which will perform computation.
– Replica Catalog: can reside on the same machine as the Replica
Manager, keeps information about file replicas. A logical file can be
associated to one or more physical files which are replicas of the same
data. Thus a logical file name can refer to one or more physical file
names.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
40
Computing facilities relevant for HEP

EDG
– Testbed

LCG
– Production infrastructure

EGEE
– Production infrastructure

Grid3
– Production infrastructure operated jointly by the U.S. Grid projects
iVDGL, GriPhyN and PPDG, and the U.S. participants in the LHC
experiments ATLAS and CMS.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
41
Italo - Hellenic School of Physics 2004
Lamberto Luminari
42
Italo - Hellenic School of Physics 2004
Lamberto Luminari
43
LCG hybrid architecture
Multi-tier hierarchy
+
Grids
Italo - Hellenic School of Physics 2004
Lamberto Luminari
44
Italo - Hellenic School of Physics 2004
Lamberto Luminari
45
EGEE Timeline



Italo - Hellenic School of Physics 2004
May 2003: proposal submitted
July 2003: proposal accepted
April 2004: start project
Lamberto Luminari
46
Grid3 infrastructure
Italo - Hellenic School of Physics 2004
Lamberto Luminari
47
Virtual Organizations (User Communities)
Italo - Hellenic School of Physics 2004
Lamberto Luminari
48
Multi-VO and one Grid
Grid (shared resources and services)
Italo - Hellenic School of Physics 2004
Lamberto Luminari
49
One VO and multi-Grid
ATLAS Production System
Italo - Hellenic School of Physics 2004
Lamberto Luminari
50
Multi-VO and multi-Grid
VO services
VO services
and private
resources
and private
resources
Shared Resources
and Services
VO services
Italo - Hellenic School of Physics 2004
Shared Resources
and Services
Shared Resources
and Services
VO services
and private
resources
VO services
Lamberto Luminari
51
HEP Requirements

User requirements:
– Concerning services, the HEP community has already made a lot of work within
EDG and LCG. The basic requirements have already been specified as use
cases for HEP data processing ( HEPCAL report, May 2002). Using the
HEPCAL document to provide templates for requirements analysis, the
EDG/AWG(Application Working Group) aim at defining requirements for a
high level common application layer based on the needs of HEP, Bio-medicine
and Earth Sciences, and is. High level APIs for Grid Services have also been
defined by the EU funded project Gridlab.
– Concerning resources, the production service must provide a continuous,
stable, robust environment and a controlled, reliable access to the resources.
The agreed sharing policies must be fully implemented and easily changeable.

Besides implementing the user requirements, practical help should be
given in interfacing the experiment applications to grid services, and to
evaluate the performance of the software deployed within the
production environment, as well as in pre-production testbeds.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
52
Security

Security Policy
– The security organizational model, often tailored so far on the needs
and characteristics of homogeneous communities, should in the future
be based on service needs of many heterogeneous V.O.’s, introducing
in the Grid organizational and security model a new complexity.

CA Policy
– A European Grid Policy Management Authority is a prerequisite for
running a Grid infrastructure both in Europe and worldwide. The Grid
Security Infrastructure relies on trusted Certification Authorities
(CA). It is therefore essential that a network of CA’s, based on a
commonly agreed set of requirements, is established and maintained in
Europe.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
53
VO management



As more and more communities will join common production
infrastructures, VO management is becoming crucial.
Current technology offers support for rather static and large
communities. The assignment of access rights is separated into two parts:
local resource administrators grant rights to the VO as a whole, while VO
administrators grant them to individual members of the community.
In the future there will be the need for small (even only two people),
short-lived (of the order of few days) and unforeseen (dynamically
discovered) VO’s. The goal would be to provide a very fine-grained
authorization and access control mechanism, where applicable based on
global standards.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
54
Resource allocation and usage

Resource allocation and reservation

Resource usage and accounting
– In order to meet the needs of all the different Grid users, mechanisms will be required
to control and balance usage of the resources (including networks) by highly demanding
applications, and to categorise and prioritise jobs so that they can receive the required
level of service.
– In particular, users should be able to allocate resources both immediately and in
advance. Allocations must be restricted to authenticated users acting within authorized
roles, the services available must be determined by policies agreed with the user
organisations, and the aggregate services made available to VOs must be monitored to
ensure adherence to the agreements.
– A major issue is the control of usage of resources, once access to them has been
established. This includes interfaces to traditional Usage Control mechanisms such as
quotas and limits, and also the extraction and recording of usage for Budgeting,
Accounting and Auditing purposes.
– The usage quotas may be owned either by individuals or by VO's, and specified both in
site-specific or Grid-wide protocols. This will include the ability to allow enforcement of
quotas across a set of distributed resources.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
55
Organizational issues

The need for resource sharing gives rise to a set of organisational issues
to be faced, analysed and solved. Indeed, when a given organisation makes
its own resources available on line:
– Each organisation has its own decision and management independence: the
resources to be shared with other organisations should not jeopardize such
independence.
– Each organisation has its own access policies. It's not true that everybody in
the Grid can use everything, but it's true that new generations of network
and grid technologies allow to define new sharing models. Each organisation
should be able to decide on each individual data, on each individual resource
and on which organisation have the access/use right.
– Each organisation has its own security policies: University security policies are
usually completely different from those of physics laboratory that works in
close co-operation with government and the army. In order to guarantee a real
resources sharing among different kinds of organisations, it's necessary to
ensure the maximum level of flexibility in the management of the above
mentioned issues.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
56
Requirements in LCG
Requirements are set by Experiments in the SC2 +
Requirements and Technical Assessment Groups (RTAGs):

On applications:

On Fabrics

On Grid technology and deployment area
–
–
–
–
–
–
–
data persistency
software support process
mathematical libraries
detector geometry description
Monte Carlo generators
applications architectural blueprint
detector simulation
– mass storage requirements
– Grid technology use cases
– Regional Center categorization
Italo - Hellenic School of Physics 2004
Lamberto Luminari
57
HEPCAL
LCG RTAG: Common Use Cases for a
HEP Common Application Layer
Requirements are given as a set of use cases free of
implementation details

GENERAL USE CASES:
–
–
–
–
Obtain Grid Authorization
Revoke Grid Authorization
Grid Login
Browse Grid Resources
Italo - Hellenic School of Physics 2004
Lamberto Luminari
58
HEPCAL

DATA MANAGEMENT USE CASES:
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Data Set (DS) Metadata Update
DS Metadata Access
DS Registration
Virtual DS Declaration
Virtual DS Materialization
DS Upload
Catalogue Creation
DS Access
DS transfer to non-Grid storage
DS Replica Upload
DS Access Cost Evaluation
DS Replication
Physical DS Instance Deletion
DS Deletion
Catalogue Deletion
Read from Remote DS
DS Verification
DS Browsing
Browse Expt Database
Italo - Hellenic School of Physics 2004
Lamberto Luminari
59
HEPCAL

JOB MANAGEMENT USE CASES:
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Job Catalogue Update
Job Catalogue Query
Job Submission
Job Output Access or Retrieval
Job Error Recovery
Job Control
Steer Job Submission
Job Resource Estimation
Job Environment Modification
Job Splitting
Production Job
Analysis
DS Transformation
Job Monitoring
Conditions Publishing
Software Publishing
Simulation Job
Exp’t Software Dev for Grid
Italo - Hellenic School of Physics 2004
Lamberto Luminari
60
HEPCAL

VO MANAGEMENT USE CASES:
– Configuring the VO:
 Configuring the DS metadata catalogue (either initially or reconfiguring).
 Configuring the job catalogue (either initially or reconfiguring).
 Configuring the user profile (if this is possible at all on a VO basis).
 Adding or removing VO elements, e.g. computing elements, storage elements, etc…
 Configuring VO elements, including quotas, privileges etc.
– Managing the Users:
 Add and remove users to/from the VO.
 Modify the user information ( privileges, quotas, priorities…) either for single users
or for subgroups of users within a VO.
– VO wide resource reservation
 The Grid should provide a tool to estimate the time-to-completion given as input an
estimate of the resources needed by the job. This is needed in particular to estimate
the access cost.
 There should be use cases for releasing reserved resources, and system use cases
for what to do in case a user does not submit a job for which resources are reserved.
– VO wide resource allocation to users or groups/users of a VO
– Software (or condition set) publishing, i.e. making it available on the Grid
Italo - Hellenic School of Physics 2004
Lamberto Luminari
61
Database Systems
Database Systems

Database :
– one or more, large structured sets of persistent data. Usually associated
with software to update and query the data. A simple database might be a
single file containing many records, each of which contains the same set of
fields, where each field is a certain fixed width. A database is one
component of a database management system.

Database Management System (DBMS):
– a set of programs (functions) that allows to manage the large, structured
sets of persistent data, which make up the database, and provide access to
the data for multiple, concurrent users whilst maintaining the integrity of
the data. The DBMS is in charge of all the functionalities related to the
database: access, security, storage…
Italo - Hellenic School of Physics 2004
Lamberto Luminari
63
Database Management Systems

DBMS provides:
– security facilities to prevent unauthorized users from accessing the system, using
names and passwords to identify operators, programs and individual machines and
sets of privileges assigned to them; these privileges can include the ability to read,
write and update data in the database;
– lock facilities to maintain data integrity; locks are used for read and write to chunks
of data: by doing this only one user at a time can alter data or users can be
prevented from accessing data being changed. These requirements are referred as
ACID (Atomicity, Consistency, Isolation and Durability):




Atomicity: all the parts of a transaction's execution are either all committed or all rolled
back. All changes take effect, or none do. This ensures that there is not erroneous data in
the systems or data which does not correspond to other data as it should.
Consistency: the database is transformed from one valid state to another valid state. A
transaction is legal only if it obeys user-defined integrity constraints. Illegal transactions
aren't allowed and, if an integrity constraint can't be satisfied the transaction is rolled
back to its previously valid state and the user informed that the transaction has failed.
Isolation: the results of a transaction are invisible to other transactions until the
transaction is complete.
Durability: once a transaction has been committed (completed), the results of a transaction
are permanent and can survive future system and media failures.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
64
Database Systems

Databases are based on many different models, each of which is
designed with a specific problem, industry or set of functions in mind.
Here we attempt to look at the main types in some depth:
– Relational Databases: data are structured in a series of tables, which have columns
representing the variables and rows that contain specific instances of data.
Currently the most wide spread model.
– Object Oriented Databases: information is stored as a persistent object, and not as
a row in a table. User defines objects and operations which can be executed on them.
– Object Relational Databases: relational systems to which object oriented functions
are added. They allow data to be manipulated in the form of objects, as well as
providing the traditional relational interface.
– Distributed Databases: data are stored on two or more computers, called nodes, and
that these nodes are connected over a network across a country, continent or planet.
– Multimedia Databases: model for storing several different types of file i.e. text,
audio, video and images in a single database.
– Network Databases: organizes data in a network of linked records. A very early form
of database, fast but not very adaptable, which is little used at present.
– Hierarchical Databases: data are stored as records, linked with Parent-Child
Relationships. Mostly used in the past on mainframes.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
65
Relational Database Systems

The Relational Model is one of the oldest models used for creating a database,
and the one that is used by the majority of businesses today. It was first
outlined in a paper published by Ted Codd in 1970. The relational model is based
on Set Theory and Predicate Logic:
– set theory allows data to be structured in a series of tables, which have columns
representing the variables and rows that contain specific instances of data. These
tables are organized using normalization, which is a process (derived from Normal
Forms theory) of reducing the occurrences of repeated data by breaking it into
smaller pieces and creating new tables (e.g., personal data of a customer).
– predicate logic is the basis of the query language, i.e. the set of commands that
allows to insert, retrieve, modify or delete data, according to some specified criteria.
Data can also be virtually or effectively joined in new tables.

The current standard for relational databases is set out in the Structured
Query Language. Version 2 of the language is currently in use with Version 3
expected to be released in the near future by the International Standards
Organization (ISO) and American National Standards Institution (ANSI).
– The most widely used relational database systems are produced by Oracle
Corporation, Microsoft, Sybase, IBM, but there is a large number of other RDBMS
designed to be either a general system or for specific applications used in HEP, like
MySQL and PostgreSQL.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
66
Object Oriented Database Systems

The ODBMS’s were introduced to overcome many restrictions imposed by the
relational model on certain types of data (mainly in case of huge amounts or
complex structures). Its main advantage is the degree of low level control of
the system it allows the programmer. This gives the programmer control of
how the data is to be store and manipulated:
– information is stored as a persistent object (and not as a row in a table). This makes
it more efficient in terms of storage space requirements and ensures that users can
only manipulate data in the ways the programmer has specified. It also saves on the
disk space needed for queries, as instead of having to allocate resources for the
results, the space required is already there in the objects themselves.

Because of the specific low level methods used in a ODBMS, it is very difficult
for third parties to produce add-on products. Whilst relational databases can
benefit from software which has been produced by other vendors, users of
ODBMS's either have to produce additional software in house, by contracting
other firms or in collaboration with other organizations using the same system.
– The first commercially available object oriented DBMS became available in the mid1980's. By the early 1990's there were a range of ODBMS's available from a variety
of vendors. Objectivity/DB is the most widely used in HEP community.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
67
Distributed Database Systems

Distributed databases have the common characteristics that they are
stored on two or more computers, called nodes, connected over a
network. They are classified as homogeneous and heterogeneous :
– homogeneous databases: use the same DBMS software and have the same
applications on each node. They have a common schema (a file specifying the
structure of the database), and can have varying degrees of local autonomy. They
can be based on any DBMS which supports this function, but it is not possible to
have more than one DBMS type in the system. To be efficient, they have to have
very large network connections and a lot of processing power.
– heterogeneous databases: have a very high degree of local autonomy. Each node
in the system has its own local users, applications and data and dealing with them
itself, and only connects to other nodes for information it does not have. This type
of distributed database is often just called a federated system or a federation. It
is becoming more popular with organizations, both for its scalability and the reduced
cost in being able to add extra nodes when necessary and the ability to mix software
packages. Unlike the homogenous systems, heterogeneous systems can include
different database management systems in the system. This makes them appealing
to organizations since they can incorporate legacy systems and data into new
systems.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
68
Beyond standard Database Systems
Italo - Hellenic School of Physics 2004
Lamberto Luminari
69
Italo - Hellenic School of Physics 2004
Lamberto Luminari
70
Italo - Hellenic School of Physics 2004
Lamberto Luminari
71
Italo - Hellenic School of Physics 2004
Lamberto Luminari
72
Distributed Analysis
Distributed Analysis


Within LCG a working group, with representatives from all LHC
experiments is working on a blueprint architecture for grid
services: ARDA (A Roadmap to Distributed Analysis). This will
serve as a first input to the EGEE Architecture team. The
HEPCAL work is continuing in the framework of the LCG/GAG
(Grid Applications Group), developing use cases and requirements
for the analysis of physics data. This will also give important
input to architecture and design work.
GAG reports:
– Hepcal



Systematic descriptions of HEP Grid Use Cases
CERN-LCG-2002-020 (29 May 2002)
lcg.web.cern.ch/LCG/sc2/RTAG4/finalreport.doc
Hepcal-prime: cern.ch/fca/HEPCAL-prime.doc
– Hepcal 2


Analysis Use Cases
CERN-LCG-2003-032 (29 October 2003)
lcg.web.cern.ch/LCG/SC2/GAG/HEPCAL-II.doc
Italo - Hellenic School of Physics 2004
Lamberto Luminari
74
ARDA working group mandate



To review the current Distributed Analysis activities and to capture
their architectures in a consistent way
To confront these existing projects to the HEPCAL II use cases and
the user's potential work environments in order to explore potential
shortcomings.
To consider the interfaces between Grid, LCG and experiment
specific services
– Review the functionality of experiment-specific packages, state of advancement
and role in the experiment
– Identify similar functionalities in the different packages
– Identify functionalities and components that could be integrated in the generic
GRID middleware


To confront the current projects with critical GRID areas
To develop a roadmap specifying wherever possible the architecture,
the components and potential sources of deliverables to guide the
medium term (2 year) work of the LCG and the DA planning in the
experiments.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
75
ARDA Architecture
Italo - Hellenic School of Physics 2004
Lamberto Luminari
76
SEAL Overview
Shared Environment
for Applications at LHC
SEAL aims to



Provide the software infrastructure, basic frameworks, libraries and
tools that are common among the LHC experiments
Select, integrate, develop and support foundation and utility class
libraries
Develop a coherent set of basic framework services to facilitate the
integration of LCG and non - LCG software
The scope of the SEAL project is basically the scope of the
LCG Applications Area.
Italo - Hellenic School of Physics 2004
Lamberto Luminari
77
PROOF (Parallel ROOT Facility)


Collaboration between core ROOT group
at CERN and MIT Heavy Ion Group
Part of and based on ROOT framework
– Uses heavily ROOT networking and other
infrastructure classes

Currently no external technologies

The PROOF system allows:
– parallel analysis of trees in a set of files
– parallel analysis of objects in a set of files
– parallel execution of scripts
on a cluster of heterogeneous machines
Italo - Hellenic School of Physics 2004
Lamberto Luminari
78
Italo - Hellenic School of Physics 2004
Lamberto Luminari
79
Italo - Hellenic School of Physics 2004
Lamberto Luminari
80
Italo - Hellenic School of Physics 2004
Lamberto Luminari
81
Useful links

Projects
–
–
–
–
–
EDG (European Data Grid): http://eu-datagrid.web.cern.ch/eu-datagrid/
GGF (Global Grid Forum): http://www.gridforum.org/
Globus: http://www.globus.org/
LCG (LHC Computing Grid): http://lcg.web.cern.ch/LCG/
Pool (Pool Of persistent Objects for LHC): http://pool.cern.ch/
Italo - Hellenic School of Physics 2004
Lamberto Luminari
82