D4M – Signal Processing On Databases 42 Sydney St Artarmon NSW 2064 Australia Virtualnation Starting with Big Data • • • • • • Why care? In your reach - big data and big compute on a budget Start with data and apply math D4M with Accumulo: New technology from MIT and NSA that claims • It requires 100x less code; and is • 100x faster than other approaches Fundamentally mathematical analysis for big data Lift the lid. Virtualnation Understand the world through data and math • • • • • How do you want to understand and the world? IT approaches have evolved from a past where IT was expensive and controlled by the few Modeled and constrained problems to not only fit onto limited computers but fit in with the politics of the enterprise If you could observe without built in constraints and preconceived bias – how would you approach computing? Understand through scientific method - data and math Virtualnation The Primordial Web (92) Browser (html):http put Server (http): SQL Database (sql): data http get Gopher Language: Client • • Server Database Browser GUI? HTTP for files? Perl for analysis? SQL for data? A lot of work just to view data. Virtualnation The Modern Web Game (data): http put Server (http): java Database (triples): data http get Language: Client • • • Server Database Game GUI! HTTP for files? Perl for analysis? Triples for data! A lot of work to view a lot of data. Great view. Massive data. Future Web? Game (data): http put Server (http): java Database (triples): data http get Language: Client • • • Server Database Game GUI! Fileserver for files! D4M for analysis! Triples for data! A little work to view a lot of data. Securely. Great view. Massive data. Big Data and Big Compute on a budget • • • ~$9K server with 256G RAM, 32 CPU core and 1.7TB SSD ~ $26K cost 270TB storage server $199 4TB USB drive • ZFS / Smart OS as a free virtualization technology • ~68TB entire transactional corpus $45B Australian retailer • How big are your possible data sets? Virtualnation Apache Accumulo NSA’s Big Table implementation and now top level Apache project Cell level security to support privacy and need to know Supports large scale processing of sparse matrices… Virtualnation Packaged into a secure production configuration Virtualnation Parallel Warehouse Scale Computer Memory Hierarchy Unit of Memory Implications High CPU CPU CPU RAM RAM RAM RAM disk disk disk disk Registers Instruction Operands Cache Blocks Network Switch Local Memory Messages CPU CPU CPU CPU RAM RAM RAM RAM disk disk disk disk Bandwidt h Latenc y Programmabilit y CPU High Capacit y Parallel Architecture Remote Memory Pages SSD High Disk See http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf Virtualnation High Starting with Big Data • • • • Now cheap to collect all data forever. Unconstrained approach to data acquisition No analysis up front or modeling Much of it involves Graph Analytics ISR • GOAL: Identify anomalous patterns of life Social • GOAL: Identify hidden social networks Cyber • GOAL: Detect cyber attacks or malicious software Virtualnation D4M - Signal Processing on Database Weak Signatures, Noisy Data, Dynamics Novel Analytics for: Text, Cyber, Bio High Level Composable API: D4M (“Databases for Matlab”) Distributed Database: Accumulo/HBase (triple store) Distributed Database/ Distributed File System Interactive Supercomputing High Performance Computing: Cluster+ Hadoop Virtualnation Detection Theory Virtualnation Matlab Demo - Reuters Corpus V1 (NIST) 810,000 Reuters news items Demonstration picked 70,000 and found 13,000 entities A is a 70Kx13K associative array with 500K entries. D4M demonstrations Virtualnation 7 Universal Constructs for Analytics Virtualnation Multi-Dimensional Associative Array Virtualnation Universal Exploded Schema Virtualnation D4M Stores Giant Space Matrices in the Accumulo Triple Store Database Triple Store Distributed Database D4M Dynamic Distributed Dimensional Data Model Associative Arrays Numerical Computing Environment B A C Query: T(:,ggaatctgcc) E D A D4M query returns a sparse matrix or graph from a triple store… Triple store are high performance distributed databases for heterogeneous data …for statistical signal processing or graph analysis in Matlab Virtualnation Big Data for High Speed Sequence Matching Virtualnation
© Copyright 2026 Paperzz