eScience Needs in the Big Data Era e-IRG Workshop Athens, Greece, 9-10 June 2014 Peter Baumann, Dimitar Misev Jacobs University | rasdaman GmbH [email protected] eScience Needs :: Brussels :: P. Baumann Array DB Research @ Jacobs University Large-Scale Scientific Information Systems group • massive n-D array services & beyond • www.jacobs-university.de/lsis Main results: • Pioneer Array DBMS, rasdaman • Standardization: editor of „Big Geo Data“ stds, ISO Array SQL cand std rasdaman visitors 2013+ ISO: member, SC32 / WG3 SQL; SC32 Study Group on Big Data; OGC liaison, TC211 Open Geospatial Consortium: co-chair, BigData.DWG, WCS.SWG; Coverages.DWG; Temporal.DWG Research Data Alliance: co-chair, Big Data Interest Group and Geospatial Interest Group member, ERCIM Expert Group Big Data member, Belmont Forum, WP 3 Harmonization of global environmental data infrastructure Charter Member, OSGeo council member, CGI / IUGS founding member and secretary, CODATA Germany ... eScience Needs :: Brussels :: P. Baumann Sample User Queries "Given me all of the images in this geographic area in this this time span that are at least 80% cloud free have been radiometrically corrected and are from these satellites and then pass those images into a workflow to perform functions x,y,z" • Carl Reed, CTO, Open Geospatial Consortium (OGC) “Find images taken by the SEVIRI satellite on August 25, 2007 which contain fire hotspots in areas which have been classified as forests according to CORINE Land Cover, and are located within 2km from an archaeological site in the Peloponnese.” • INSPIRE related eScience Needs :: Brussels :: P. Baumann Core Requirements User-oriented • Visual interfaces + powerful expert interfaces (R, Matlab, WMS, WCPS, ...) Flexible • new apps, new research questions Scalable • Allows for scalable implementations (auto-parallelization, orchestration) • high-level service defs, not micro management Experience shows: high-level query language (QL) advantageous eScience Needs :: Brussels :: P. Baumann Tackling Variety Stock trading: 1-D sequences (i.e., arrays) Social networks: large, homogeneous graphs Ontologies: small, heterogeneous graphs Climate modelling: 4D/5D arrays Satellite imagery: 2D/3D arrays (+irregularity) Genome: long string arrays Particle physics: sets of events Bio taxonomies: hierarchies (such as XML) Documents: key/value stores: sets of unique identifiers + whatever etc. eScience Needs :: Brussels :: P. Baumann Managed Variety in Big Geo Data [OGC 09-146r2] OGC Coverage = regular & irregular grids, point clouds, meshes • Fully n-D, spatio-temporal & beyond Unifying service: Web Coverage Service (WCS) eScience Needs :: Brussels :: P. Baumann Hadoop: Not the Answer to All MapReduce built for unstructured data ...no builtin knowledge about structured data types • Ex: Array Analytics: n-D Euclidean neighborhood • “Since it was not originally designed to leverage the structure […] its performance […] is therefore suboptimal.” o – Daniel Abadi • M. Stonebraker (XLDB 2012): „will hit a scalability wall“ eScience Needs :: Brussels :: P. Baumann OGC WCPS OGC Web Coverage Processing Service (WCPS) = high-level geo raster query language; adopted 2008 "From MODIS scenes M1, M2, M3: difference between red & nir, as TIFF" • …but only those where nir exceeds 127 somewhere for $c in ( M1, M2, M3 ) where some( $c.nir > 127 ) return encode( $c.red - $c.nir, “image/tiff“ ) (tiffA, tiffC) eScience Needs :: Brussels :: P. Baumann 8 Database Visualization for $s in (SatImage) for $d in (DEM) return encode( struct { red: (char) green: (char) blue: (char) alpha: (char) }, “image/png" ) s.img.b7[x0:x1,x0:x1], s.img.b5[x0:x1,x0:x1], s.img.b0[x0:x1,x0:x1], scale( d.elev, 20 ) [JacobsU, Fraunhofer 2012; data courtesy BGS, ESA] eScience Needs :: Brussels :: P. Baumann Use Case: Plymouth Marine Laboratory [Oliver Clements, EGU 2014] “Avg chlorophyll concentration for area & time period, from x/y/t cube” • 10, 60,120, 240 days Conclusions: • „we must minimise data transfer as well as [client] processing” • “standards such as WCPS provide the greatest benefit” eScience Needs :: Brussels :: P. Baumann From Clouds to Federations Automatic, ad-hoc federation between data centers, intelligent sensors, ... • autonomous • Heterogeneous Dataset D Open standards! Dataset C Dataset A Dataset B eScience Needs :: Brussels :: P. Baumann Summary Rec 1: Evaluate domain standards Rec 2: Geo domain as priority • „80% of all data are location connected“ Rec 3: tie in database, data mining experts • Leverage long-standing experience in flexible, scalable information systems • Trend: high-level query languages • New data type support [rasdaman screenshots] eScience Needs :: Brussels :: P. Baumann
© Copyright 2024 Paperzz