The Code - Microsoft Research

Promised Abstract
Can database technology help manage and mine
scientific data? That is the question I have been trying to
answer with my astronomy colleagues (especially Alex
Szalay.) We have had some success but still face many
problems. I will start by describing the odyssey of putting
the Sloan Digital Sky Survey online and give some
statistics about how it is used and what we are doing now.
That segment will end with a discussion of how the
integration of SQL with the CLR (common language run
time) makes it much easier for us to handle scientific
datatypes and spatial access methods. The World-Wide
Telescope is an attempt to federate all the astronomy
archives of the world. I will briefly describe the architecture
of SkyQuery a prototype portal to several archives each of
which is a web service. Urls:
http://research.microsoft.com/~gray
http://skyserver.pha.jhu.edu/dr1/ http://skyQuery.net
Session Code
Yukon Features For
SkyServer Database
Jim Gray: Microsoft
Alex Szalay (and friends): Johns Hopkins
Help from:
Cathan Cook (personal SkyServer),
Maria A. Nieto-Santisteban (image cutout service)
SkyServer Overview (10 min)
10 minute SkyServer tour
Pixel space
Record space:
Doc space: Ned
Set space:
Web & Query Logs
Dr1 WebService
http://skyserver.sdss.org/en/
select top 10 * from weblog..weblog
where yy = 2003 and mm=7 and dd =25
order by seq desc
select top 10 * from weblog..sqlLog
order by theTime Desc
You can download (thanks to Cathan Cook )
Data + Database code:
http://research.microsoft.com/~gray/SDSS
Website:
Data Mining the SDSS SkyServer Database
MSR-TR-2002-01
Cutout Service (10 min)
A typical web service
Show it
Show WSDL
Show fixing a bug
Rush through code.
You can download it.
Maria A. Nieto-Santisteban did most of this (Alex and I started it)
http://research.microsoft.com/~gray/SDSS/personal_skyserver.htm
SkyQuery: http://skyquery.net/
Distributed Query tool using a set of web services
Fifteen astronomy archives from
Pasadena, Chicago, Baltimore, Cambridge (England)…
Feasibility study, built in 6 weeks
Tanu Malik (JHU CS grad student)
Tamas Budavari (JHU astro postdoc)
With help from Szalay, Thakar, Gray
Implemented in C# and .NET
Allows queries like:
SELECT o.objId, o.r, o.type, t.objId
FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t
WHERE XMATCH(o,t)<3.5
AND AREA(181.3,-0.76,6.5)
AND o.type=3 and (o.I - t.m_j)>2
SkyQuery Structure
Each SkyNode publishes
Schema Web Service
Database Web Service
Portal is
Plans Query (2 phase)
Integrates answers
Is itself a web service
Image
Cutout
SDSS
SkyQuery
Portal
FIRST
2MASS
INT
Four Database Topics
Sparse tables:
column vs row store
tag and index tables
pivot
Maplist (cross apply)
Bookmark bug
Object Relational has arrived.
Column Store Pyramid
Users see fat base tables
(universal relation)
BASE
Define popular columns index
tag table 10% ~ 100 columns
Make many skinny indices
1% ~ 10 columns
Query optimizer picks right plan
TAG
Obese query
Automate definition & use
Fast read, slow insert/update
Data warehouse
Note: prior to Yukon,
index had 16 column limit.
Fat
quer
y
Typical
Semijoin
INDICIES
A bane of my existence.
Simpl
Examples
create table base (
id bigint, f1 int primary key,
f2 int, …,f1000 int)
BASE
TAG
Obese query
create index tag on base (id)
include (f1, …, f100)
Typical Semi-join
Fat
quer
y
create index skinny on
base(f2,…f17)
INDICIES
Simpl
A Semi-Join Example
create table fat(a int primary key, b int, c int, fat char (988))
declare @i int, @j int; set @i = 0
again:
insert fat values(@i, cast(100*rand() as int), cast (100*rand() as int), ' ')
set @i = @i + 1; if (@i < 1000000) goto again
b=c
137 K IO
31 sec
create index ab on fat(a,b)
create index ac on fat(a,c)
dbcc dropcleanbuffers with no_infomsgs
select count(*) from fat with(index (0)) where c = b
-- Table 'fat'. Scan 3, reads 137,230, CPU : 1.3 s, elapsed 31.1s.
dbcc dropcleanbuffers with no_infomsgs
select count(*) from fat where b=c
-- Table 'fat'. Scan 2, reads: 3,482 CPU 1.1 s,
1GB
elapsed: 1.4 s.
ab
ac
8MB 8MB
b=c
3.4K IO 1.4 sec
Moving From Rows to Columns
Pivot & UnPivot
What if the table is sparse?
LDAP has 7 mandatory
and 1,000 optional attributes
Store row, col, value
Features
object attribute value
●●●●
create table Features (
select *
from (features
object varchar ,
attribute varchar,
value
varchar,
primary key (
object,
attribute))
pivot value
on
attribute
in
(year, color) ) as T
where object = ‘4PNC450’
4PNC450
4PNC450
4PNC450
4PNC450
year
color
make
model
2000
white
Ford
Taurus
●●●●
T
Object year
4PNC450 2000
color
white
Maplist Meets SQL – cross apply
select p.*, q.*
from parent as p cross apply f(p.a, p.b, p.c) as q
where p.type = 1
p1
f(p1)
Your table-valued function F(a,b,c)
returns all objects related to a,b,c.
spatial neighbors,
p2
f(p2)
sub-assemblies,
members of a group,
items in a folder,…
Apply this function to each row
pn
f(pn)
Classic drill-down
use outer apply if f() may be null
The Bookmark Bug
SQL is a non-procedural language.
The compiler/optimizer picks the procedure
based on statistics.
If the stats are wrong or missing….
Bad things happen.
Queries can run VERY slowly.
Strategy 1: allow users to specify plan.
Strategy 2: make the optimizer smarter
(and accept hints from the user.)
An Example of the Problem
Bookmark plan:
look in index for a subset.
Index
A query selects some fields
of an index
and of huge table.
Lookup subset in Fat table.
This is
great if subset << table.
terrible if subset ~ table.
If statistics are wrong,
or if predicates not independent,
you get the wrong plan.
How to fix the statistics?
Huge table
A Fix: Let user ask for stats
Create Statistics on View(f1,..,fn)
Then the optimizer has the right data
Picks the right plan.
Statistics on Views,
C. Galindo-Legaria, M. Josi, F. Waas, M. Wu, VLDB 2003,
Q3:
Select count(*) from Galaxy
where r < 22 and r_extinction > 0.120
Bookmark: 34 M
random IO,
520 minutes
Create Statistics on Galaxy(objID )
Scan:
5 M sequential IO
18 minutes
Ultimately this should be automated,
but for now,… it’s a step in the right direction.
Object Relational Has Arrived
VMs are moving inside the DB
Yukon includes Common Language Runtime
(Oracle & DB2 have similar mechanisms).
So, C++, VB, C# and Java are
co-equal with TransactSQL.
code
You can define classes and methods
SQL will store the instances
Access them via methods
code
You can put your analysis code
INSIDE the database.
Minimizes data movement.
You can’t move petabytes to the client
But we will soon have petabyte databases.
data
data
+code
And..
Fully-async and synchronous (blocking) calls
and multi-concurrent-result sets
per connection (transaction)
Queues built in (service broker):
Fire-and forget asynchronous processing
It listens to Port 80 for SOAP calls :
TP-lite is back
It’s a web service
Notification service and
data mining and
olap and
reporting and
xml and
xquery and
.... )
But, back to OR.
Some Background
Table valued functions
SQL operates on tables.
If you can make tables, you can extend SQL
This is the idea behind OLE/DB
create function Evens(@maxVal int)
returns @T table (a int)
begin
while (@maxVal > 0)
begin
if (@maxVal % 2 = 0)
insert @T values(@maxVal)
set @maxVal = @maxVal -1
end
return
end
select *
from Evens(10)
a
----------10
8
6
4
2
Using table Valued Functions
For Spatial Search
Use function to return likely key ranges.
Use filter predicate to eliminate objects
outside the query box.
Table valued function
returns candidate ranges
of some space-filling curve.
Select objID
From Objects O join fGetRanges( @latitude, @longitude, @radius) R
on O.htmID between R.begin and R.end
where abs(o.Lat - @latitude) + abs(o.Lon – @longitude) < @radius
Filter discards
false positives.
The Pre CLR design
Transact SQL
sp_HTM
(20 lines)
469 lines of
“glue”
looking like:
// Get Coordinates param datatype, and param length information of
if (srv_paraminfo(pSrvProc, 1, &bType1, &cbMaxLen1, &cbActualLen1, NULL, &fNull1) == FAIL)
ErrorExit("srv_paraminfo failed...");
// Is Coordinate param a character string
if (bType1 != SRVBIGVARCHAR && bType1 != SRVBIGCHAR &&
bType1 != SRVVARCHAR
&& bType1 != SRVCHAR)
ErrorExit("Coordinate param should be a string.");
// Is Coordinate param non-null
if (fNull1 || cbActualLen1 < 1 || cbMaxLen1 <= cbActualLen1)
ErrorExit("Coordinate param is null.");
// Get pointer to Coordinate param
pzCoordinateSpec = (char *) srv_paramdata (pSrvProc, 1);
if (pzCoordinateSpec == NULL)
ErrorExit("Coordinate param is null.");
pzCoordinateSpec[cbActualLen1] = 0;
// Get OutputVector datatype, and param length information
if (srv_paraminfo(pSrvProc, 2, &bType2, &cbMaxLen2, &cbActualLen2, NULL, &fNull2) == FAIL)
ErrorExit("Failed to get type info on HTM Vector param...");
The HTM code body
The “glue” CLR design
Discard 450 lines of UGLY code
C# SQL
sp_HTM
(50 lines)
Thanks!!! To Peter Kukol
(who wrote this)
using System;
using System.Data;
using System.Data.SqlServer;
using System.Data.SqlTypes;
using System.Runtime.InteropServices;
namespace HTM {
public class HTM_wrapper {
[DllImport("SQL_HTM.dll")] static extern unsafe void * xp_HTM_Cover_get (byte *str);
public static unsafe void HTM_cover_RS(string input) {
// convert the input from Unicode (array of 2 bytes) to an array of bytes (not shown)
byte * input; byte * output;
// invoke the HTM routine
output = (byte *)xp_HTM_Cover_get(input);
// Convert the array to a table
SqlResultSet outputTable = SqlContext.GetReturnResultSet();
if (output[0] == 'O') { // if Output is “OK”
uint
c = *(UInt32 *)(s + 4); // cast results as dataset
Int64 * r = ( Int64 *)(s + 8); // Int64 r[c-1,2]
for (int i = 0; i < c; ++i) {
SqlDataRecord newRecord = outputTable.CreateRecord();
newRecord.SetSqlInt64(0, r[0]);
newRecord.SetSqlInt64(1, r[1]); r++;r++;
outputTable.Insert(newRecord);
} } // return outputTable;
}
}
}
The HTM
code body
The Clean CLR design
Discard all glue code
return array cast as table
CREATE ASSEMBLY HTM_A
FROM '\\localhost\HTM\HTM.dll'
CREATE FUNCTION HTM_cover( @input NVARCHAR(100) )
RETURNS @t TABLE (
HTM_ID_START BIGINT NOT NULL PRIMARY KEY,
HTM_ID_END BIGINT NOT NULL
)
AS
EXTERNAL NAME HTM_A:HTM_NS.HTM_C::HTM_cover
using System;
using System.Data;
using System.Data.Sql;
using System.Data.SqlServer;
using System.Data.SqlTypes;
using System.Runtime.InteropServices;
namespace HTM_NS {
public class HTM_C {
public static Int64[,2] HTM_cover(string input) {
// invoke the HTM routine
return (Int64[,2]) xp_HTM_Cover(input); // the actual HTM C# or C++ or Java or VB code goes here.
Your/My
code
goes here
}
} }
Performance (Beta1)
On a 2.2 Ghz Xeon
Call a Transact SQL function
33μs
Call a C# function
50μs
Table valued function
Array (== table) valued function
1,580 μs
+ per row 42 μs
200 μs
+ per row 27 μs
The Code
CREATE ASSEMBLY ReturnOneA
FROM '\\localhost\C:\ReturnOne.dll'
GO
CREATE FUNCTION ReturnOne_Int( @input INT) RETURNS INT
AS EXTERNAL NAME ReturnOneA:ReturnOneNS.ReturnOneC::ReturnOne_Int
GO
---------------------------------------------- time echo an integer
declare @i int, @j int, @cpu_seconds float, @null_loop float
declare @start datetime, @end datetime
set @j = 0
set @i = 10000
set @start = current_Timestamp
while(@i > 0)
begin
set @j = @j + 1
set @i = @i -1
end
set @end = current_Timestamp
set @null_loop = datediff(ms, @start,@end) / 10.0
set @i = 10000
set @start = current_Timestamp
while(@i > 0)
begin
select @j = dbo.ReturnOne_Int(@i)
set @j = @j + 1
set @i = @i -1
end
set @end = current_Timestamp
set @cpu_seconds = datediff(ms, @start,@end) / 10.0 - @null_loop
print 'average cpu time for 1,000 calls to ReturnOne_Int was ' +
str(@cpu_seconds,8,2)+ ' micro seconds'
Program in DB
in different
language (Tsql)
calling function
Function
written in C#
inside the DB
using System;
using System.Data;
using System.Data.SqlServer;
using System.Data.SqlTypes;
using System.Runtime.InteropServices;
namespace ReturnOneNS {
public class ReturnOneC {
public static int ReturnOne_Int(int input) {
return input;
}
}
}
What Is the Significance?
No more inside/outside DB dichotomy.
You can put your code near the data.
Indeed, we are letting users put personal
databases near the data archive.
This avoids moving large datasets.
Just move questions and answers.
Meta-Message
Trying to fit science data into databases
When it does not fit, something is wrong.
Look for solutions
Many solutions come from OR extensions
Some are fundamental engine changes
More structure in DB
Richer operator sets
Better statistics
© 2002 Microsoft Corporation. All rights reserved.
This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.
evaluations
© 2003 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only.
MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.