Organising Data access for Diverse Communities

Organising Data access for Diverse
Communities: GEOSS and beyond
Massimo Craglia, Elena Roglia
European Commission Joint Research Centre
http://www.geowow.eu/
A lot of data globally available
• But is this Really available?
• Do you know it exists? Can you access the level of
data relevant to your need?
• Can you understand it and use it to address the
question you have?
• Do you have access to the tools, methods, and
above all community of users with whom to share
experiences and add to cumulative learning?
Lessons from GEOSS
• The GEOSS Data Collection of Open Resources for
Everyone (GEOSS Data CORE), is a distributed
pool of documented datasets with full, open and
unrestricted access at no more than the cost of
reproduction and distribution.
• Established in 2010, implementation started in
2011, and gathered pace in 2013.
• Survey of GEOSS community in 2013, as part of
GEOWOW project, to understand awareness of
Data CORE
GEOSS Data CORE Survey:
Awareness, Involvement, and Challenges
•70 respondents from 31 Countries belonging to
different type of organizations involved in GEOSS.
•24% of respondents were NOT aware of the concept of
GEOSS Data CORE;
•17% were using the GEOSS Data CORE, 24% were
contributing to it.
•Key barriers: the difficulty to find and discover GEOSS
Data CORE resources; some thematic area are poorly
represented;
•Key advantages: the possibility to reduce data costs
and facilitates advancing disciplinary and interdisciplinary research;
•Key limitations: the fact that data spatial extent and
temporal resolution do not fit users’ needs.
•Improve awareness and participation by providing
technical support and disseminating successful stories.
Accessibility Analysis of GCI
• Assessed using the 50 GCOS Climate Variables as
keywords to perform a search;
• 126,000 records returned (60% GEOSS Data CORE);
• 8% not providing Distribution Information;
• 3% accessible via OGC protocols;
• 29% mostly accessible via HTTP and FTP protocols;
• 60% do not specify protocols (but with working links).
• Loss of info between metadata (where it exists) of raw
data and metadata as outcome of search in a catalogue.
• Unclear to users if results represent raw data, processed
products, or outcome of analyses based on the data.
An Australian Geoscience Data Cube
Aaron Sedgmen
Geoscience Australia
GA’s Traditional EO product process
EO products have traditionally been produced on demand for areas of interest
from tape archives of scene based raw data
Search catalogue
order scenes
1Petabyte hierarchical
archive: Millions of
individual scenes
Tape store accessed by
robot.
Orthorectification
calibration, cloud
Masking, atmospheric
correction, mosaicing
Identify footprint
of product in
space or time
Feature extraction,
algorithm application
spectral unmixing
Client requests
product
Product packaging
and delivery
An Australian Geoscience Data Cube
“Cubing” Landsat images
Landsat
images
 time 
Tile
squares
Dice…
&…
Stack
An Australian Geoscience Data Cube
A paradigm shift from traditional methods
• The data cube holds multiple Landsat products for the entire
archive – removes the need to generate products at time of
request
• Hosting the data cube at NCI co-locates “big data” with high
performance computing – enables in-situ analysis of the
whole archive
• Computational analysis is moved from the scientist’s local
environment to a central HPC facility
• Removes the need to download and replicate the data
• Provides computing power not otherwise available to many
scientists
• Opens up possibilities to integrate the Landsat archive with
other “big data” datasets hosted at the HPC facility
An Australian Geoscience Data Cube
Data Complexity
Potential Number of Users
Difficulty to Understand & Use
Calibrated “Cubed” Data
for Analysis
Summary Information
for Policy Advice
“Raw” Sensor Data
Data Complexity
Knowledge
Information
Data
GA Wednesday Seminar 30/10/13 - Datacube
Use Only the Best Ingredients:
Data Provenance in the Datacube
•
Tiles link to their source dataset (scene) records in DB for provenance. Tiles
have no metadata per-se.
•
Data provenance must be provided by lookups to authoritative metadata.
•
Composite data outputs can contain pixel-based provenance
e.g. Four-month non-interpolated
median NDVI for entire Murray Darling
Basin
• Initial Datacube test area
• 2,112,000,000 pixels (i.e. 2.1 Billion).
• Each and every pixel can be traced
back to its source observation through
provenance information layers
GA Wednesday Seminar 30/10/13 - Datacube
Layering information access
• Using the data provenance as link between raw
data, processed data, and analytical products
based on the data
• Metadata linking input, workflows and models,
and outputs
• “Drill down” when needed, ensures traceability
and reproducibility + access to the relevant level
of information
• Contribution to new model of Open Science
• Requires collective efforts from both producers
and users of data, models, and products.
Managing Expectations!
Thank you for your attention.
[email protected]