Innovation in a Complex World: Examples and Challenges www

Innovation in a Complex World:
Examples and Challenges
www.microsoft.com/science
Dr Daron Green
Senior Director, Microsoft Research
Overview
• Context
• Innovation in action
– Data deluge
– Data visualization
– Data sharing
• Challenges/impediments
– Things we haven’t worked out
– What’s stopping us making progress
– Areas of concern
Microsoft Research At A Glance
Redmond, Washington
San Francisco, California
Cambridge, United Kingdom
Beijing, China
Silicon Valley, California
Bangalore, India
Cambridge, Massachusetts
Sep, 1991
Jun, 1995
July, 1997
Nov, 1998
July, 2001
Jan, 2005
July, 2008
MSR India
Microsoft Research Mission Statement
• Expand the state of the art in each of the areas
in which we do research
• Rapidly transfer innovative technologies into
Microsoft products
• Ensure that Microsoft products have a future
Context: Science @ Microsoft
Earth
Sciences
Multidisciplinary
Research
Computer &
Information
Sciences
Life
Sciences
Social
Sciences
New Materials,
Technologies
& Processes
Math and
Physical Science
A Data Deluge in Science
• Data collection
– Sensor networks, satellite
surveys, high throughput
laboratory instruments,
astronomical telescopes,
supercomputers, LHC …
• Data processing, analysis,
visualization
SensorMap
Functionality: Map navigation
Data: sensor-generated temperature, video
camera feed, traffic feeds, etc.
– Legacy codes, workflows,
data mining, indexing,
searching, graphics …
• Archiving
– Digital repositories, libraries,
preservation, …
Scientific visualizations
NSF Cyberinfrastructure report, March 2007
Emergence of a New Research Paradigm?
•
•
•
•
Thousand years ago – Experimental Science
– Description of natural phenomena
Last few hundred years – Theoretical Science
– Newton’s Laws, Maxwell’s Equations…
Last few decades – Computational Science
– Simulation of complex phenomena
Today – eScience or Data-centric Science
– Unify theory, experiment, and simulation
– Using data exploration and data mining
• Data captured by instruments
• Data generated by simulations
• Data generated by sensor networks
Scientists over-whelmed with data…
Computer Scientists and IT companies
have technologies that will help innovate
.
a
a
2
4 G
3
c2
a2
Implications
• Data management along research pipeline:
•Capture
(inc metadata)
•Processing
•Visualization
•Storage
•Retrieval
•Sharing
•Publication
•Archival
Handling the data deluge…
Three examples:
• Machine Learning and HIV/AIDS research
• Advanced Database technologies and
Environmental Science
• Oceanographic Workflows
Fighting HIV with Computer Science
• A major problem: Over 40 million infected
– Drug treatments are effective but are an expensive life
commitment
• Vaccine needed for third world countries
– Effective vaccine could eradicate disease
• Methods from computer science are helping with the design
of vaccine
– Machine learning: Finding biological patterns that may
stimulate the immune system to fight the HIV virus
– Optimization methods: Compressing these patterns into
a small, effective vaccine
Computational Biology Web Tools
Better vaccine design through improved
understanding of HIV evolution
Goals
• Use machine learning and
visualization tools developed at
Microsoft, which require HPC, to
build maps of within-individual
evolution of the HIV virus
Progress so far
• Discovered ‘decoy epitopes’ that could have predicted recent failure of Merck vaccine
• Algorithms and medical results published in Science and Nature Medicine
• MSR Computational Biology Tools published (Open Source on CodePlex)
11
Handling the data deluge…
Two examples:
• Machine Learning and HIV/AIDS research
• Advanced Database technologies and
Environmental Science
• Oceanographic Workflows
Carbon-Climate Data
• What is the role of
photosynthesis in global
warming?
– Measurements of CO2 in the
atmosphere show 16-20% less
than emissions estimates predict
– The difference is either due to
plants or ocean absorption.
• Communal field science – each
investigator acts
independently.
LaThuile_NEE (gC m-2 yr-1)
• Cross site studies and
integration with modeling
increasingly important
1500
1000
500
0
-500
-1000
-1500
-1500
-1000
-500
0
500
Pub_NEE (gC m-2 yr-1)
1000
1500
Ameriflux Data
In collaboration with Berkeley
Water Center
• 149 Ameriflux sites across the
Americas reporting minimum of
22 common measurements
• Carbon-Climate Data published to
and archived at Oak Ridge
• Total data reported to date on the
order of 192M half-hourly
measurements since 1994
14
• Sharepoint site www.fluxnet.org
– 921 site-years of data from 240
sites around the world; 80+ siteyears now being added
– 60+ paper writing teams
– American data subset is public and
served more widely
– Summary data products greatly
simplify initial data discovery
• Used modern Relational
Database technologies
– Scientists can access data through
Data Cubes
– Allows simple data viewing
without need for knowledge of
SQL language
Brazil -- Tapajos (Santarem,Km
Brazil -- Tapajos (Santarem,Km
Canada - Boreas 1850
Canada -- BOREAS NSA - 1930 bu
Canada -- BOREAS NSA - 1963 bu
Canada -- BOREAS NSA - 1981 bu
Canada -- BOREAS NSA - 1989 bu
Canada -- BOREAS NSA - 1998 bu
Canada -- BOREAS NSA - Old Bla
Canada -- British Col., Campbe
Canada -- Lethbridge
USA -- AK Atqasuk, Alaska
USA -- AK Barrow, Alaska
USA -- AK Happy Valley, Alaska
USA -- AK Upad, Alaska
USA -- AZ Audubon Research Ran
USA -- CA Blodgett Forest, Cal
USA -- CA Sky Oaks, Old Stand,
USA -- CA Sky Oaks, Young Stan
USA -- CA Tonzi Ranch, Califor
USA -- CA Vaira Ranch, Ione, C
USA -- CO Niwot Ridge Forest,
USA -- CT Great Mountain Fores
USA -- FL Florida-Kennedy Spac
USA -- FL Florida-Kennedy Spac
USA -- FL Slashpine-Austin Car
USA -- FL Slashpine-Donaldson,
USA -- FL Slashpine-Mize,clear
USA -- FL Slashpine-Rayonier,m
USA -- IL Bondville, Illinois
USA -- IN Morgan Monroe State
USA -- KS Walnut River Watersh
USA -- MA Harvard Forest EMS T
USA -- MA Harvard Forest hemlo
USA -- MA Little Prospect Hill
USA -- ME Howland Forest (main
USA -- MI Sylvania Wilderness
USA -- MI Univ. of Mich. Biolo
USA -- MO Missouri Ozark Site
USA -- MS Goodwin Creek, Missi
USA -- MT Fort Peck, Montana
USA -- NC Duke Forest - loblol
USA -- NC Duke Forest-hardwood
USA -- NE Mead - irrigated con
USA -- NE Mead - irrigated mai
USA -- NE Mead - rainfed maize
USA -- OK Little Washita Water
USA -- OK Ponca City, Oklahoma
USA -- OK Shidler, Oklahoma
USA -- OK Southern Great Plain
USA -- OR Metolius-first young
USA -- OR Metolius-intermediat
USA -- OR Metolius-old aged po
USA -- SD Black Hills, South D
USA -- SD Brookings, South Dak
USA -- TN Walker Branch Waters
USA -- WA Wind River Crane Sit
USA -- WI Lost Creek, Wisconsi
USA -- WI Park Falls/WLEF, Wis
USA -- WI Willow Creek, Wiscon
USA -- WV Canaan Valley, West
Scientific Data Servers for Hydrology
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
Ameriflux Data Availability : All Data
Mashup of Ameriflux Sites
Handling the data deluge…
Two examples:
• Machine Learning and HIV/AIDS research
• Advanced Database technologies and
Environmental Science
• Oceanographic Workflows
Trident – Scientific Workbench
Trident Scientific Workflow Workbench
What it provides to the scientists
• Visually program workflows, through a web browser.
• Libraries of activities and workflows, to save and reuse workflows.
• Abstract parallelism for HPC, to test on desktop and then run on
cluster.
• Adaptive workflows, to detect and respond to events in real-time.
• Automatic provenance capture, for all workflows and data
products.
• Costing model, estimating resources required to run a workflow.
• Integrated data storage and access, allows researcher to store
data on a SQL database, local files or in the cloud (Microsoft SDS,
Amazon S3).
• Fault tolerance, facilitate smart reruns, what-if analysis
• Reproducible research
However…Challenges/Impediments
• Three dominant issues:
– People: lack of alignment in benefits, incentives and
budget…or, put another way, the way we respond to
money, process, metrics, measurement and
recognition…
– Technology: Transition to many/multi-core
– Privacy: risk of exposing personal information
Remote management of long-term conditions
The underlying challenge…
• Thousands of successful(?) pilots but none ‘make it big’
• Many, many papers published
• It has been shown† that:
– Largely no motivation for adoption by health practitioners
because there is…
– no alignment of benefits, incentives and budgets
• Or, stated another way, it is dangerous to assume people will adopt an
innovation just because it is ‘obviously’ the right thing to do.
• Consider the whole context for the innovation (people, money, metrics,
reward structures, process, skills etc) it’s not just the technology.
• Sometimes the key innovation is in the business design
†Dr Daron G Green and Prof Terry Young; Value Propositions for Information Systems in Healthcare HICSS - Proceedings of the Proceedings of the 41st Annual
Hawaii International Conference on System Sciences p257, 2008
Challenges/Impediments
• Three dominant issues:
– People: lack of alignment in benefits, incentives and
budget…or, put another way, the way we respond to
money, process, metrics, measurement and
recognition…
– Technology: Multi-Core Transition
– Privacy: inadvertently exposing personal information
CPU Architecture
• Heat becoming an unmanageable problem
Sun’s Surface
Power Density (W/cm2)
10,000
Rocket Nozzle
1,000
Nuclear Reactor
100
Pentium®
10 4004
8008
1
‘70
8086
8085
286
Hot Plate
386
486
8080
‘80
Intel Developer Forum, Spring 2004 - Pat Gelsinger
‘90
‘00
‘10
The End of Moore’s Law as We Know It
• Future of silicon chips
– “100’s of cores on a chip in 2015”
(Justin Rattner, Intel)
• Challenge for IT industry and Computer
Science community
– How can we make parallel computing on a chip
easy for developers of consumer applications?
• Challenge for the Scientific Community
– How will the Multi-Core transition affect
scientific computing?
Challenges/Impediments
• Three dominant issues:
– People: lack of alignment in benefits, incentives and
budget…or, put another way, the way we respond to
money, process, metrics, measurement and
recognition…
– Technology: Multi-Core Transition
– Privacy: inadvertently exposing personal information
Challenge: Data for Open Innovation
• With web users becoming
producers of information…
• We leave the footprint of our
lives in digital trails…
• It is becoming easier for “data
snoopers” to reconstruct the
identity of an individual or an
organization by cross-linking
information from different
sources.
28
A face is exposed for searcher no. 4417749
• “Search query data can contain
the sum total of our work,
interests, associations, desires,
dreams, fantasies, and even
darkest fears.”
The New York Times, Aug 2006:
Thelma Arnold's identity was
betrayed by the records of her Web
searches
29
Online Privacy
• We leave our traces online at multiple sites such as social
networks, blogs, forums etc.
– Re-identify users from movie mentions in forums to user ratings
of movies *Frankowski’06+
• However, researchers seek to gain insights, undertake
experiments with real-world data and businesses need
tools and analysis to understand market trends and needs…
30
In need of a framework for open innovation
• Research and Innovation is inhibited due to the lack
of a framework to disseminate information in a safe
way
• Open innovation roadblocks due to shortcomings in
– Data confidentiality/privacy
– Different data regulations per country
• More research needed on technical (semantics),
legal, societal solutions and processes to enable
open innovation in an information-based society
31
Challenges/Impediments
• Three dominant issues:
– People: lack of alignment in benefits, incentive and
budget…what is the business design that underpins
your innovation?
– Technology: Multi-Core Transition…just how will this
work out?
– Privacy: inadvertently exposing personal
information…what personal/business risks are we
prepared to accept?
Context: Science @ Microsoft
Earth
Sciences
Multidisciplinary
Research
Computer &
Information
Sciences
Life
Sciences
Social
Sciences
New Materials,
Technologies
& Processes
Math and
Physical Science
www.microsoft.com/science
Starting point
Comprehensive analysis of:
3) …and needed to
- NHS Stakeholder vs benefitunderstand what
functionality/value was
required
- NHS Stakeholder vs incentives
2) …then we aspired to be here…
- NHS Stakeholder vs budget availability
1) BT originally tried to sell here
- defining the scope of the service
Simplified benefits
Plays into political agenda:
- Access
- Choice
- Increased private sector
involvement in patient care
- New role of pharmacies
No significant
benefit to
these care
providers
PCT sees benefit and dis-benefit:
- Benefits of service are extremely diffuse
- Medication and strips costs ↑
- GP visits and A&E admissions ↓ over time
- Compliance increases: Yr 1 <£10k benefit
growing to £225k by Yr 10 (payback over v long
timescales)
- Near term: BT CDM solution roughly cash neutral
to PCT
Patients clearly benefit
provided they are motivated to
use service
Incentives summary
Incentives dominated by financial imperatives
Current incentives operate against adoption of service
Implementation of service largely irrelevant given current incentives
Requires regular updates to ensure personal motivation
Budget availability summary
Lack of incentives and appropriate metrics lead to no
real acknowledgement of the problem and no defined budget
Patients see costs for diabetes (and other LTCs) as being
responsibility of NHS
Summary overlay [benefits/incentives/budget]
Accrual of benefits at upper levels in NHS/DoH encourages nationalscale service...however all management of long-term conditions is
devolved to ‘lower’ levels of the NHS
Alignment of benefits, incentives and
budget availability does not appear at
lower levels of stakeholder stack.
Explains why many hospital/PCT/SHA
pilots and other initiatives in this area
have failed. This is a ‘no profit zone’
for a CDM service in UK.