Data on the Web life cycle + best practices

Data on the Web Life Cycle
Bernadette Farias Lóscio
[email protected]
March, 2014
Outline
• Definition of data on the Web
• Data on the Web life cycle
– Spiral model
– Overview
•
•
•
•
Data collection
Data generation
Data distribution
Data usage
• Data on the Web life cycle + best practices
– Examples of best practices
Data on the Web
Data from diverse domains (ex: governmental
data, cultural heritage, scientific data, cross
domain) available on the Web on a machine
processable format.
Data on the Web Life Cycle
A set of tasks or activities that take place during the
process of publishing and using data on the Web.
The process may pass through some number of iterations
and may be represented using a spiral model.
Data on the Web Life Cycle
Author: Bernadette Lóscio
An overview of the
DATA ON THE WEB LIFE CYCLE
Data on the Web Life Cycle
• Data collection
– Sources selection: identification of data sources
that may offer relevant data (ex: relational
databases, xml files, excel documents)
Data on the Web Life Cycle
• Data Generation (1st iteration)
– Dataset project
• Define the schema of the target dataset (structural
metadata)
• Choose standard vocabularies
– Data (ex: FOAF, DC, SKOS, Data Cube)
– Dataset (ex: DCAT, PROV, VoiD, Data Quality Vocab)
– Data Catalog (ex: DCAT)
• Choose data formats (machine processable data)
• Create new vocabularies
•…
Data on the Web Life Cycle
• Data Generation (2nd iteration)
– ETL process (Extract, Transform and Load)
• Extract data from the selected data sources, transforms
the data according to the decisions made during the
dataset project and loads the data into the target
dataset
– Metadata generation
• Produce (manually or automatically) structured
metadata according to the metadata standards defined
during the dataset project
Data on the Web Life Cycle
• Data Distribution (1st iteration)
– URIs project
• Design URIs that will persist and will continue to mean
the same thing on the long term
– Choose a solution(s) for data publishing
• data catalogue, API, SPARQL endpoint, dataset dump, …
Data on the Web Life Cycle
• Data Distribution (2nd iteration)
– Publish data and metadata
• Make data and metadata available on the Web
• Data Distribution (3rd iteration)
– Update data
• Make a new version of the dataset available on the
Web
– Update metadata
• Make a new version of the metadata available on the
Web
Data on the Web Life Cycle
• Data usage
– Explore data
• Identify important aspects of the data into focus for
further analysis
– Analyze data
• Develop applications, build visualizations, …
– Give feedback
• Provide useful information about the dataset (ex:
dataset relevance, data quality,…)
• Provide data usage descriptions
An overview of the
DATA ON THE WEB LIFE CYCLE +
BEST PRACTICES
Data on the Web Best Practices
• Best practices may be applied during the
whole process of publishing and using data on
the Web.
• Best practices may be defined according to
the activities performed in each one of the
quadrants (or tasks).
Data on the Web Life Cycle + Best Practices
Author: Bernadette Lóscio
Examples of Best Practices
• Data collection
– Best practices:
• Have a catalogue to describe potential data sources, i.e., data sources that
could provide data to be published on the Web
• …
• Data Generation
– Best practices
• Document the process of data generation
• Use standard vocabularies to describe data
• Use standard vocabularies to describe datasets and data catalogues (ex:
DCAT)
• Provide stable URIs
• Provide data on machine processable formats
• Provide metadata to describe data
• …
Examples of Best Practices
• Data Distribution
–
–
–
–
–
–
–
–
Use standard ways to distribute data (ex: data catalogues and APIs)
Provide details about data access
Provide details about data licence
Provide details about dataset provenance and quality
Provide a schedule of dataset updates
Keep a dataset history
Provide ways to collect data consumers feedback
Announce the publication of new datasets or new versions of existing
datasets
– …
• Data usage
– Provide feedback about datasets
– Provide descriptions about the usage of the dataset
Data on the Web Best Practices
• For each best practice, a guidance of how to
implement must be provided
• Some best practices may have more than one
way of implementation
(to be continued)