DF IG Meeting Notes 20140924

Notes of BoF Meeting 20140924
Conf/rda/data fabric
Keith G Jeffery 20140924
Chairs: Peter Wittenberg & Gary Berg-Cross
Introduction
This is a BoF. Email exchange leading to ideas on a data fabric. Not an IG (although “approved” and
listed as such on the RDA site) since not yet a clear definition of data fabric. Some history, 4 groups
(DTR, DFT, PP, PIT) started off in RDA with MD (Metadata) a bit later. These groups just started to do
urgent work without considering a wider picture. More groups started such as provenance,
brokering… and the whole aspects of publishing, citation also scientific culture/ethics and legal.
Also a view from the community through the interviews done by DFT: data management and
processing is time consuming, costly. Federating data is costly and time-consuming. Lack of
software systems to manage data. Clear to all need to change organisation and procedures but
risky, lacking data professionals, needs high flexibility. A diagram of data fabric context generated –
not an architecture but a figurative view of what most people do working in data science – a sort of
generalised workflow. The diagram led to some discussion: importance of policy, what can be
automated.
RDA: how to maximally support the machinery of this diagram for the scientist. Make science
reproducible. Can we learn from internet – are PIDs equivalent of IP numbers; are objects
equivalent to packets. Also access via metadata, access via PID, access via paths.
Questions arise:
1.
2.
3.
4.
Do we need a data fabric IG – there is a reproducibility group;
What is the scope of DA data fabric?
What are the characteristics of RDA data fabric;
What should the DF IG do within RDA and what not;
DF is about (slide) and not about(slide); main point NOT an overarching architecture, nor an
implementation, or specific technologies and tools and not everything that is RDA! Email discussion
indicated DF is domain of registered data objects, repositories, policy principles….
Gary provided some slides:
The idea for this IG emerged from the discussions amongst the chairs of various RDA WGs
DF is disparate components being made to work together.
Characteristics of a Data Fabric:
We are just beginning to scout out the landscape of data fabric “. In one view it is a minimalistic
set of infrastructure and service requirements by which services can plug into (belong to) the
defined fabric. In a data fabric we ask how the separate components, developed separately, can
be made to work together, this means that for different sets of components the data fabric will be
different. We note, strongly, that it is meant as a descriptive/conceptual way to deal with the
interrelation between many components, rather than prescriptive (like you would have with an
architecture).
Need to identify components and their positioning in the landscape. Move from raw data to useful
and value-added data. DF BoF is a forum to discuss the various and alternative views.
Reagan Moore provided an Infrastructure view of a Data Fabric and its scope:

A data fabric is the set of software and hardware infrastructure components that are used to
manage data, information, and knowledge.

When an enterprise implements a data management solution, one of multiple types of DFs
infrastructure is typically chosen to enable the processes:

Data management –enterprise to build a data repository, manage an information
catalog, & enforce management policy

Data analysis –enterprise to process a data collection, apply analysis tools, and
automate a processing pipeline.

Data preservation –enterprise to build reference collections and knowledge bases
that comprise the intellectual capital, while managing technology evolution

Data publication –discovery and access of data collections.

Data sharing – controlled sharing of a data collection, shared analysis workflows,
and information catalogs.
Beth Plale added her view in discussions preceding the BoF meeting with some desired
characteristics of a DF (some at least a conceptual prescriptive view):

Be self-documenting – a service contributes to the lifecycle of data objects it handles and
must keep track of the scientifically relevant actions it performs on those data objects.

The resulting log files are periodically be sent to a provenance consolidator.

Track data objects through its service processing using one of the well-known object
identifier schemes

Identify itself as one type of service as drawn from an RDA- agreed upon list of service
types.

Implement an interface to a
the Data Fabric Control mechanism.
publish-subscribe
system
which
serves
as
Group Discussion
This is incredibly important; flexibility, no lock-in. Danger – quickly into architecture and getting into
the architectural weeds which are too complex.
Ruth noted that there are “So many overlapping interest groups”- all overlapping with this topic
(such as Domain Repository) . Is there a better way to get an overview & oversight? Is that the job of
the co-chairs? Can we combine the IGs? Response from Beth: TAB looking at this.
As well as organisational/conceptual question (can be worked through). Conceptual aspect needs to
go forward –weaving together chairs of the interest groups. We need some language and
visualisations for ‘fabric’ to avoid architectures and implementation of components, interfaces.
Then need to compare data fabrics – texture, weave, colours.
Peter noted that we have some idea of a fabric from the interviews of data management efforts.
We will make this report available.
Need to agree scope before deciding if we wish to have an interest group.
One view – how to use the data out of the fabric. Our member from Japan noted that is Hard to sell
the data fabric idea. No language to ‘sell’. We need a language of data sharing for domain oriented
scientists. Gary also suggested that we need to understand why domain scientists don’t buy into the
data sharing vision that has been around for a while. Is it that they are threatened by their data
management being disrupted as “infrastructure” is improved and new standards applied?? Is it a
because they need training time on new standards etc.??
Importance to add value to the data through services. Need to federate.
Why doing this - researchers (research managers, innovators, citizen scientists). Stress benefits not
features. There are e-research papers - UK, Europe and subsequently Tony Hey 4th paradigm
(dedicated to Jim Gray). Whole effort through GRIDs ==> CLOUDs. Top level ‘ user neither knows
nor cares where computing is done as long as request is satisfied’.
 Action: supply papers
Need to take components and weave them together; moving data between components smoothly.
Gary: Use cases are important; these allow us to handle some of the complexity that Alan warned us
about. We can walk through the fabric/architecture such as drawn by Beth (see further below) to
understand how particular components work for some activity like “data reuse” or “search using
metadata”.
Suggest use system engineering principles as a mechanism behind; Peter objects that we say
everything already done yet data practitioners struggling.
 Action: supply relevant papers which go back to 98
Need to say how RDA data fabric would address reproducibility.
Some researchers not happy with not knowing how done – need detail of processes on data –
provenance.
Problem of scale – things get so large cannot manage as a human - need machine assistance.
Part of the problem – missing from the picture are all the tools scientists use to get the data in and
reasonably handled.
Need to describe the components together with benefits and drawbacks understandable by the endusers (researchers) so have an open marketplace of interoperating components.
Not everyone agreed that computer scientists have solved problems – computer scientists ignore
users’ concern. Companies survey user requirements. Users employ tools – they like the tools they
use.
In RDA some WGs successful and made products. Have data and reference to data and on other side
workflows and functions. Need to bring together at this moment. Main components of DF could
thus be defined.
What the fabric is meant to do for users and how that results in tools. An example described was the
Virtual observatory – multiple projects – framework and tools produced. Done by stating ‘this is
how you get data from an archive’ or more generally defined as services.
Mark: former practitioner. All discussion on software and tools; problem not software and tools but
the data. Let us get the data right.
Beth: Drew diagram - end user use case, uses two distributed repositories, using PIDs and data
types/terminology foundations. There is a propagation of provenance implied here. We can answer
the question of what data repositories are involved. This helps with the stress if Big Data.
Hans noted that this is important for data intensive science and that there will be a synergy across
many dimensions. Some principles may come out of this effort.
Tim noted that if we have complex paths we may not get reproducible science unless we have good
provenance.
This seemed so clear that the question arose – “Should RDA do a demo of this?”
Discussion on different ways of demonstrating the DF as depicted by Beth.
Chuck Vardeman added an example from his work with CERN data. Need to capture workflow as and
with metadata upload to Fedora repository then another system can download it and use it. Get
provenance along the way.
Peter thought that the RENCI work on genomics might also be relevant.
Rainer suggested that we gather use cases from RDA members and groups and make our own
people happy.
Rainer: how about use case from domain-specific group e.g. photon and neutron science.
Beth: if interest group can fit and demonstrate using RDA products that would be great.
“The diagram is useful” Based on this conversation people thought the group important: data
intensive science: reproducibility, data re-use & sharing. Stick to those issues. Need to bring
together all RDA groups and with use cases – all feed in to this. This would give a view of
components needed for reproducibility, data sharing etc. Also principles for action.
Peter & Gary: recapitulate. Lot of support, some scepticism of DF IG. Services, data repositories,
provenance, shared workflows seem an important element of the DF view. Noted languages
(systems engineer) and not re-invent; various diagrams representing views but need better
diagrams. Need more use cases and literature review. There is a diversity of initiatives and note
overlap with other efforts and WGs. The WDS mentioned in Science Stream has some overlap
looking at a Knowledge Network.
Have to do a lot of interaction. Need key persons to read all the stuff. Rob Pennington has offered
to help but could not be here today. Peter motivated to write ‘white paper’ until Christmas. Peter
and Gary continue as organising chairs. Agreed editors: Peter, Gary, Beth, Ray , Chuck.
 Action: produce white paper
Last Comments:
Key idea:maximising value of outputs of research. Need roadmap/guide. BF as an IG can propose
WGs to get things done.