Section 1: Microsoft Amalga

I recently had to watch my mom die of cancer. Like many who have
to see the same thing – have the same experiences – it is
difficult to put into words the confusion, anger and frustration
with a disease so unrelenting. I also watched, at the same time,
the ‘waste’ and ridiculous near-fraud like behavior with respect
to the implementation of Microsoft Amalga at the University of
WA Medical Center.
During the last months of her life I was an employee of the
University of Washington Medical Center working at Harborview
(the primary trauma center serving the Pacific Northwest and the
‘county hospital’ for King County). As an employee at the UW,
from July 2010 to May 2011, I worked directly and indirectly
with a system called Microsoft Amalga.
You might not know what Microsoft Amalga is. Frankly, given the
number of re-brandings of this system in the last few years, I
doubt Microsoft knows what it is. While interviewing at
Microsoft recently, I received puzzled stairs when mentioning
it. It seems, other than the marketing pabulum available on the
web, no one really knows much about this system. What is
Microsoft Amalga? Microsoft would describe it as a clinical data
repository which supports the querying and analysis of patient
data. Basically, it is a unified data warehouse for the health
care system.
What is Microsoft Amalga – it is a disaster and a waste of
money!
To say it has flaws is to seem trite. What are its main
features?
1. A parser/ETL system that is more complex (complex being bad
in this case) then any Rube Goldberg Device.
2. A set of sub-standard HL-7 API’s.
3. A group of ‘admin tools’ which are badly designed from a
User Interface perspective and badly performing from an
engineering perspective.
4. A serialization model that is essentially like calling a
group of spreadsheets a database.
There is an assumption, amongst the political class in this
country, that someone who self-identifies as 'liberal', and is
in public service, must have a 'sense of right and wrong' - a
higher calling of public service and stewardship of the people's
money. My experiences at the University of Washington since July
of last year (2010) have gone a long ways to showing this to be
FALSE. This would be 'expected' by many, but this was health
care, this was work where people's lives depend upon the UW
making wise decisions -- not just about care, but also about the
way in which money is spent. Having watched my mom die recently
of metastatic cancer and seeing how little financial resources
she had towards the end, it made me sick to know that even ONE
DIME of my money went to Microsoft for the abomination called
Amalga.
I would like to believe that this was an exception. In February
of 2011 I switched jobs to a grant based role working for the
Dept. of Biomedical and Healthcare Informatics working on a
project called CBR. A central tool that we 'must use' was i2b2 - a clinical informatics tool developed by Harvard University to
allow researchers to do 'de-identified' queries (sometimes
referred to as 'noodling'). There are many features of i2b2 that
are flawed, few that make much sense and a general model of
construction that aligns with NOTHING I learned in any of my
computer science classes - what's worse, it does not even look
like a system that could survive outside of publicly funded
software.
The following essay is a remembrance, a call to arms and a
'hope' that someone might make the right decision with respect
to one or both of these systems before they spend precious time
and money on them -- I am assuming that most are aware of the
constraints of current budgets and the tough decisions ordinary
Americans (like my Mom for instance) are having to make every
day. AS WITH MEDICINE, SEEK A SECOND OPINION! If you think I am
wrong, review both the Amalga and i2b2 systems with expert
computer scientists and engineers in the fields of data
warehousing and data storage. These are two very different
systems (i2b2 and Amalga), but they both have 1 thing in common
-- it is doubtful they would pass a consistent review or set of
tests for scalability, usability, efficiency, TCO (total cost of
ownership) or accuracy/audit worthiness. While it is true that
i2b2 is 'free', this means less when you consider the labor cost
of fixing and supporting their equally bad architectures.
Disclaimer
1. Most of the men and women I have worked with here at the UW
(with the exception of some management) have been hard
working, intelligent and critical assets to the UW. In my
view, the UW is lucky to have them and it is sad that they
spend so much time keeping Amalga from crashing (i2b2 isn't
really in the same boat as far as criticality -- if it were
expected to be 'up' all the time then I suspect the UW
would need to invest as much as Harvard has had to to keep
in from blowing up).
2. I don't claim to know everything. In fact, I think what is
troubling here are not simply my 'opinions' but the obvious
absence of any structured consistent review of either
system - I don't know WHAT the CDROC committee does, but
clearly it is having no impact upon Amalga's quality and
delivery. I was told, when I was hired, that Amalga had
been tested/evaluated. This seems disingenuous. The best
part is you DON'T need to take my word for it, go beyond
the Amalga marketing materials and do your own discovery.
Something I wish the UWMC had done.
3. Though I can draw a distinction between the direct payment
of money to Microsoft versus the indirect costs of working
with i2b2, I don't think i2b2 should be off the hook simply
because it seems 'free' - nothing is free if you must
expend unreasonable resources in order to support it.
Section 1: Microsoft Amalga
High Level: Amalga Design Flaws
1. Amalga comes with a few data models built in. Some of these
models are rather naively based on HL-7. HL-7 is an EDI
messaging format. Message formats are supposed to be
redundant -- data warehouses (or clinical repositories or
data aggregators or whatever the Microsoft marketing folks
are calling this terrible system now) ought to store data
accurately, efficiently and with as little redundancy as is
possible. Because of the way azADT is designed, there are
MANY fields which NEVER get populated.
2. All Amalga primary keys (clustered indexes) are NONSEQUENTIAL random strings. I say random, they are really a
'hash' of actually meaningful information like MRN (Medical
Record Number), date of service, and proprietary system
episode or visit keys. This use of a 'non-sequential'
string key in a high throughput system hammers SAN's and is
very expensive. One estimate I performed before resigning
showed that 50% of all unique string tokens in the UWMC
data warehouse were composed of Amalga ID's.
3. The 'flexible' part of the Amalga's data model amounts to
database building by spreadsheet. Because their is very
little or no sound data design in Amalga, the expansion of
databases usually amounts to importing large, unwieldy and
non-dimensionalized data.
4. Amalga, because of the way the azAEID database is designed,
is very difficult if not IMPOSSIBLE to federate. What does
this mean? It means that Amalga, to quote a former
'manager', really is just a 'giant garbage can for data'.
The core tables get bigger and bigger and there is not
rational or feasible way to archive data.
5. The hl-7 parsers and parser development is a joke. The
tools which support hl-7 interface work are buggy and in
general do not live up to expectations. Someone told me
that Microsoft marketers sell Amalga as a system where the
HL-7 interfaces can be developed in 4 hours. I don't know
in what bizarre universe this is true, but the estimate I
was given initially when I started on the team was 80-100
hours -- not 4. I am very productive and I was working on a
SIMPLE referral interface. This interface took at least 80
hours to complete (not including testing). When you have a
problem building an interface in Amalga, you pretty much
have to start from scratch. There are bugs in the SEE
(Script Engine Explorer: the tool used to help build these
interfaces) which crashes often.
6. Amalga was almost ALWAYS in a failure state when I was
there. During the winter we had serious problems with
replication and missing data -- Microsoft 'support' was
difficult to find. I suppose if you want to buy an
enterprise system and get very little help --> then buy
Amalga, you can feel left out in the cold.
7. The amalga ‘ado.net connector’ does not work as advertised.
It states, in their materials, that this connector can work
with LDAP/AD  nothing of this is true.
8. I don't believe Amalga has was ever properly tested prior
to being deployed at the UWMC. I was told they had 'vetted'
the system in the year and a half before my arrival --> I
don't know what was evaluated, but it WAS NOT Amalga's
ability to reliably ingest and store large volumes of data.
9. I proved that IF you moved from the 'post relational'
Amalga way to a Kimball model, you would reduce your data
footprint by roughly 50%. If you further broke free of the
"Amalga ID" this reduction MIGHT be as big as 70%. What
does this mean? One estimate for DB warehousing drive needs
for the next year or so at UWMC for Amalga storage was 258
TB's (to provide the uninitiated with a frame of reference,
in 2005 I worked for a company that had HALF this much data
in it and at the time it was one of the largest SQL Server
databases in the USA). Drives, energy and computer
equipment are NOT getting cheaper -- they are actually
beginning to respond to the same inflationary pressures
that exist in the wider economy. To shrug off a 50-70%
reduction in hardware costs seems reckless and not in
keeping with the stewardship of the people's money.
Microsoft Amalga conforms to NONE of the standard features you
would expect to find in a contemporary, high volumes, large
scale, data warehouse. It does NOT have dimensions. Data, in
Amalga (of course the exception are the core azyxxi HL-7
databases and tables which are BAD for their own reasons) is
stored as a collection of spreadsheets. The data is redundant
and could be improved by adopting standard approaches of data
storage. As such, Amalga has VERY prohibitive features from a
TCO (total cost of ownersip) perspective. See below, an estimate
for JUST the labor costs for managing 7500 patients in Amalga
for 1 year.
ADO.NET connector does not work, see below:
Let's say you have 1,000,000 patients per year. Assume a roughly
9K cost per 7500 patients (not including servers and Amalga
licenses), this a variable cost of 1.2 million dollars per year
for just the maintenance cost. Add in the fixed costs of
licenses and servers and the REAL cost of Amalga for a 1 million
patient a year enterprise is closer to 10-15 million dollars for
the first 5 years... This may not be the MOST expensive
solution, but given how unwieldy and under-performing the system
is it also does not seem like a bargain.
The HL-7 parsing is NO substitute to Cloverleaf. Cloverleaf is
not perfect, but at least it has the right to call itself an HL7 integration engine. I estimated, several months ago, that the
true cost of building parsers in Amalga IS NOT the 4 hours they
tell future customers. In fact, it is non-deterministic. The SEE
(Script Engine Explorer) is SO buggy, you are often faced with
doing the same tasks OVER and OVER again. God forbid you need to
update a parser, you are better off starting from scratch.
We were told, in October (by Microsoft) that with the 'next
version' building HL-7 interfaces would be 'easier', see
comments below from someone who had to work with the new system:
Amalga is NOT federated. What does this mean? It means that
there is no way to archive off or split the clinical repository
into smaller sub groups. Normally, in large scale data
warehouses, people choose MEANINGFUL partition points in the
data space. Logically, in the hospital context, Facility,
Location and Date of Service are LOGICAL and semantically
MEANINGFUL ways of breaking the very large monolithic databases
into smaller units. Microsoft's advice to our management was
'buy another license'... This is a really great solution
(sarcasm).
Amalga, because of its design, is almost not capable of being
audited. This may seem like a slight concern to some, but to
ANYONE familiar with the risks and opportunities of healthcare
informatics, this is not a good feature. In a system with
conforming dimensions, it is possible to ask existential
questions LONG before you need to query core facts. For
instance, you can ask what procedure codes exist or what clinics
exist without writing a very inefficient query against a 'post
relational' table. Because of its (Amalaga's) design, there is
NO easy auditing query. One of my reasons for leaving the UW
Amalga team was because I felt Microsoft had a responsibility to
assist us in ensuring data quality by designing a system which
allowed for these checks.
Amalga IS a VENDOR LOCK-IN system. The likelihood, given the
design of the 'amalga id' and azAEID structures, is VERY low
that any hospital system could easily switch to something else.
If Microsoft gets this terrible system into a hospital system,
it is UNLIKELY that the hospital system would be able to
disentangle itself. It doesn't take a math genius or Warren
Buffet to figure out that IF Microsoft is successful in
'pushing' this bad system it could become a 1 billion dollar
product line in 10 years. If they get into the DOD or VA system,
it is unlikely that Amalga could be stopped.
Amalga's Index to Data space ratios are atrocious in most cases.
And yet, for the TERRIBLE indexing schemes (or lack of any sense
of designing proper indexes), the queries are still more often
than not slower than you would expect. Here is a snippet of
data/index space analysis performed on the live system:
The Amalga ID is a nightmare. It makes for a TERRIBLE clustered
index, because no matter how you manipulate the 'guid like' key,
it will NEVER be an efficient sequential key. The excuse (and a
weak one at that) is it will be 'unique' for external data
sharing. I would like to believe this, but since it is only
'guid-like' and since I can ONLY assume it is deterministic, it
seems UNLIKELY that this argument is true.
Let's say I have a deterministic hash function:



Hospital (A) is the University of Wisconsin Medical Center
--> UWMC
Hospital (B) is the University of Washington Medical Center
--> UWMC (I don't know if such a hospital exists)
Hospital (C) is the Union-Wilmington Medical Center -->
UWMC (this one is made up)
MRN's are not unique across organizations, and they are OFTEN,
via roll-over, not unique in a hospital EMR.
If I have an MRN for each of U2345445, for Sept 2, 2010, the
'amalga key generator' would be using precisely the same inputs
for each hospital for the same patient -- to generate the EID
(visit level key).
If it is not deterministic, then you have a whole slew of other
problems. If it is deterministic, then by definition the same
EID would be generated for each institution. How is this crossfacility/system unique?
See question/response below with respect to the Amalga ID:
Section 2: i2b2
High level design flaws:
A). Ontology Storage: The 'ontology' or structured
classification scheme is stored primarily in 2 tables (in 2
different databases) in i2b2. The 'tree' itself is stored as all
possible paths in the tree. For instance, take the following
simple ontology:
The 'i2b2' way of storing this ontology is as follows:
For Each Terminal Node:
Store the complete path from root node to terminal node. For
formatting reasons, store it 3 times (yes, in both the metadata
and concept_dimension, this same path is stored multiple times).
For this, it stores (and more than once!) the following:








/Automobiles/Trucks/Dodge/
/Automobiles/Trucks/Chevy/
/Automobiles/Cars/Color/Black/Speed/
/Automobiles/Cars/Color/Black/Steering/
/Automobiles/Cars/Color/Red/Mustang/
/Automobiles/Cars/Color/Red/Corvette/
/Automobiles/Cars/Cost/>30K/BMW/
/Automobiles/Cars/Cost/5-30K/Used/


/Automobiles/Cars/Cost/<5K/Used/
/Automobiles/Cars/Cost/<5K/Wreck/
The standard way of storing topologies, which an ontology is a
structural sub-class of, is as one of the following:
1. Edge List (my preferred technique)
2. Jagged Array / Linked List (also workable)
However, the 2 techniques above attempt a balance between
mathematical complexity and useability. As an edge list, the
above would look like this:



















Automobiles --> Trucks
Trucks --> Chevy
Trucks --> Dodge
Automobiles --> Cars
Cars --> Color
Cars --> Cost
Color --> Red
Color --> Black
Red --> Mustang
Red --> Corvette
Black --> Speed
Black --> Steering
Cost --> >30K
Cost --> 5-30K
Cost --> <5K
>30K --> BMW
5-30K --> Used
<5K --> Used
<5K --> Wreck
My Edge List Cost: 19 x 2 ==> 38 memory units (an abstraction,
ceterus paribus, assume node size is equivalent)
I2B2 Cost: 46 Units!
The difference between these two seems small. Please understand
-- THIS IS NOT A LINEAR RELATIONSHIP! With healthcare data,
being MUCH more complex than my example, this difference becomes
much worse. My edge list version of the 'i2b2 ontology' took
SIGNIFICANTLY less space, did not have an asinine 'only 700
characters in length requirement' and is orthogonal/generic
(good in systems which live in the real world and not fantasy
land).
As stated above, the maximum string length of the 'ontology'
paths is 700 characters. It seems ridiculous to have to explain
why this is bad. What's worse is the SAME ontology is stored in
MULTIPLE fields in different tables -- wanna say update
inconsistency? This ontology model has the following features:
a) inefficient, b) difficult to manage, and c) ONLY WORKS if the
semantic ontology string depth will ONLY ALWAYS BE 700 chars in
DEPTH! (not a great feature of a system designed for the complex
sciences of medicine and biology)
B). Over-Design / difficult to disentangle: The documentation
and install bits are over-bearing. In reality, the primary use
cases for this at other institutions (other than Harvard) is
essentially as a set theory engine. As such, there are very few
tables that are 'needed'. An analysis of LIVE data from our i2b2
system showed MOSTLY empty tables and empty structures. This
does not create much of a data footprint cost, but from a design
perspective it is the equivalent of the 8 headlights on the
family truxter -- not really necessary and probably counter
productive. Over design has costs in documentation and the risk
of engendering FUTURE design flaws. I2B2 reminds me of the
freeways, in Seattle, during the late 70's -- many of them went
NOWHERE. I2B2 is 'sold' as a cellular model, which would imply
an ala carte architecture. This is FAR from the truth. While a
person COULD re-program and re-design portions of it to make it
more flexible (one reason for my resignation was the desire -in order to meet deadlines -- to fix and remedy some of the
worst aspects of this), it is frowned upon. I think EGO rules
for I2B2.
C). Indexing is amateurish: Load factor of 7-10 times input data
footprint. I was told by a PhD at the UW not be be concerned by
this. I'm glad a PhD in informatics allows someone to discount a
memory leak. Please, download and look (you can download i2b2
although it has a strange version of OFT MENTIONED 'open source'
licensing). Observation fact, for I2B2 1.6, has so many indexes
that SQL SERVER rejected the DDL. There were indexes on almost
every field individually (though given how empty some of these
fields were it made little or no sense), but on top of this
there is a covering index on EVERY field! A db tuning class is
needed - stat.
D). UI is uninspiring and derivative: Set theory tabs, drag a
node of the tree, it allows basic AND, OR and NOT operations.
Its not bad, but not that impressive either. Plus, if you DO NOT
update your ontology to reflect new concepts you will suffer
from ORPHAN CONCEPTS... I do not need to explain how difficult
it can be to maintain a system with this feature. What it is
missing is dynamic connectedness. Tableaux and systems like
tableaux support a much more dynamic and less labor intensive
means of noodling. A person could BUILD a set theory engine that
would outperform i2b2 for large scale systems and this
alternative could be completed in 2-3 weeks. You could then call
it a 'cell' and everyone could save face. This correct approach
was not acceptable.
E). Method for de-id is limited and buggy: a) i2b2 randomizes
the counts for results below a certain threshold and b). if you
query multiple times you are 'locked out'. The theory, use a
randomizing function and then block users from using statistical
narrowing. The problem: this doesn't really help when the
populations you are dealing with drop below a certain level. At
best it then becomes a noise generator.
F). Data stored is stored in an unjustifiably redundant way
(especially given its nature as a noodling/de-id tool): At first
you might think, "Hey, its as simple as storing concepts in
Observation_Fact and ensuring the metadata and concept_dimension
(remember, the ontology is stored and used in multiple places -can I say update anomaly again?)". However, take a look at the
hard coded and stored sql in the I2B2 table --> very revealing.
I think, all a person should do once they download the i2b2
system is to examine the following table -- i2b2metadata.I2B2
and pay attention to the generated SQL and the closure anomalies
that exist. This kind of code undermines a modern database,
creates 'false' sub-queries and is frankly VERY BAD PRACTICE.
Also, unnecessary when you consider HOW SQL could be generated
in the middle tier. There is a Patient Provider and Visit
dimension. I am wary of using dimension, because even though
they are 'dimensions' of the data, the actual concept codes that
represent the stored de-id value is stored ALSO. So, if I want
to use I2B2, I both have to store the concept code in the
Observation_Fact which represents this code, but I must ALSO
store the same data in one of the core dimensions.
Conclusions: How does this happen?
1. Dr. Harvard (or Docs PRACTICING computer science without a
license): If I decided to begin practicing 'neighborhood
cardiology' I would be arrested. I have no MD degree. I
have no license to practice. And yet, in the healthcare
system today, there are doctors selling themselves as
'experts' who know little or nothing of the discipline of
computer science. Amalga was not designed by folks who
understand complexity and data structures -- it was
designed by docs, with some software developers, and it is
the outcome of malpractice. I will make the docs who
invented this a promise: if they stop practicing computer
science, I will not pretend to be an ER doc. We need to ask
better questions of ANYONE selling an idea and not simply
assume because they are docs they must be good at
everything -- sorry, but this is not generally the case. I
joked once that given the current culture of hospitals, a
doc from 'Harvard' could sell electronic mail as his or her
own invention -- it might even make it to contract review
before someone calls bullshit.
2. Bureaucracy -- good systems and good ideas don't make it:
The circuitous route by which systems get built in a large
healthcare system is both byzantine and painful. Instead of
taking the advantage of 'crowd' intelligence, the result is
a gray, mediocre, half measure.
3. No cost accountability: I don't think there is currently
any solid cost review or cost accounting with respect to
these large healthcare systems. In fairness, this is common
with MANY enterprise systems in and outside of healthcare.
We must, as professionals, develop better and more
objective metrics for measuring the performance and
benefits of systems like Amalga and I2b2.
4. No transparency in WHY a system is selected: Amalga is one
of the least transparent systems I have come across. Its
ironic, because it is a terribly simplistic system with
only 3 legs to the stool. Leg 1: The HL-7 parsers (crap),
Leg 2: Amalga ID and azAEID (super expensive and terribly
convoluted crap) and Leg 3: the 'console' or UI tools -- I
used to beat myself up for cutting corners on UI
development and design, now that I've seen Amalga and
considered the raw dollars it has consumed, I am much more
forgiving to myself. I doubt ONE of the UI tools (including
the console) would pass a thorough review by an HCI (Human
Computer Interaction) specialist. The UI tools which
support development are buggy and have memory leaks. I
figured out months ago that you MUST shut down your
development environment periodically in order to avoid
catastrophic memory leaks which can (and do) lead to lost
parser work. I don't really know why or how the UWMC picked
Amalga or I2B2, but what I am fairly certain of is that NO
competent computer scientist dug very deeply into either
system. If they had, I doubt these systems would have ever
been acquired.
5. Informatics IS NOT Computer Science, at least not yet: I
have, in the last year, met MANY PhD's in Informatics. I
remember my brother-in-law telling me about the very
difficult tests he had to take (in the Computer Science
Department) at Indiana University in Bloomington in order
to move from the Masters program to the PhD portion of the
program. The tests he described seemed extremely difficult
- I don't know if I would have passed (my Bro-in-law was a
graduate of Rose-Hulman so not a light-weight when it comes
to the science of information). I have no reason to
believe, given all the blank stares, given the fact that
these 'experts' can look at Amalga and I2B2 and not
immediately be concerned, that the PhD in Informatics
provides any real background in the foundational skills of
being an information scientist in the Biological and Medial
realm. Maybe Informatics has a place, but only if the
stewards of the profession take seriously the basic skills
they need to do their job. I have very little reason to
believe that the 'information scientists' in Informatics
with PhD's actually have the background to do their work.
This is too bad.
No one needs to take my word for it. If you are a hospital CIO,
do yourself a favor -- evaluate Amalga or i2b2 yourself and do a
good job of it. Microsoft sales folks will tell you about all
the 'success' stories -- if they tell you the UWMC is a success
they are being dishonest. Before you 'buy' Amalga, load it with
5-10 TB's of test data. If Microsoft says 'we can't let you do
that', then you should RUN from this disaster as far as you can.
If they do consent, do your own analysis. Look at ingestion
speed, replication and run ingestion while you are writing
queries. In all likelihood, you will have to have or purchase
servers on this scale or you may have them already. Bottom line,
Amalga doesn't get really ugly until it's almost too late to
turn back. But it is never too late to do the right thing!
So, in recap. At small scales of data Amalga is unnecessary and
a burden, at large scales of data Amalga (and this applies to
i2b2 as well) is unwieldy, buggy, provably inefficient and just
a waste of money, time and human resources.
Miscellaneous

The Joy of Querying Amalga... (JOY is SARCASM here)

The confusion over querying Amalga (our expert on our team
gave me 3 explanations in 3 weeks during September of last
year) The funny thing is, Microsoft originally recommended
that MRN be thrown away. I'm glad that is one
recommendation that was NOT followed (despite the most
recent statement of what is 'correct' when querying
Amalga).

Converting from Amalga to Kimball.

Download Report

Section 1: Microsoft Amalga

Paperzz.com

Your Paperzz