I recently had to watch my mom die of cancer. Like many who have to see the same thing – have the same experiences – it is difficult to put into words the confusion, anger and frustration with a disease so unrelenting. I also watched, at the same time, the ‘waste’ and ridiculous near-fraud like behavior with respect to the implementation of Microsoft Amalga at the University of WA Medical Center. During the last months of her life I was an employee of the University of Washington Medical Center working at Harborview (the primary trauma center serving the Pacific Northwest and the ‘county hospital’ for King County). As an employee at the UW, from July 2010 to May 2011, I worked directly and indirectly with a system called Microsoft Amalga. You might not know what Microsoft Amalga is. Frankly, given the number of re-brandings of this system in the last few years, I doubt Microsoft knows what it is. While interviewing at Microsoft recently, I received puzzled stairs when mentioning it. It seems, other than the marketing pabulum available on the web, no one really knows much about this system. What is Microsoft Amalga? Microsoft would describe it as a clinical data repository which supports the querying and analysis of patient data. Basically, it is a unified data warehouse for the health care system. What is Microsoft Amalga – it is a disaster and a waste of money! To say it has flaws is to seem trite. What are its main features? 1. A parser/ETL system that is more complex (complex being bad in this case) then any Rube Goldberg Device. 2. A set of sub-standard HL-7 API’s. 3. A group of ‘admin tools’ which are badly designed from a User Interface perspective and badly performing from an engineering perspective. 4. A serialization model that is essentially like calling a group of spreadsheets a database. There is an assumption, amongst the political class in this country, that someone who self-identifies as 'liberal', and is in public service, must have a 'sense of right and wrong' - a higher calling of public service and stewardship of the people's money. My experiences at the University of Washington since July of last year (2010) have gone a long ways to showing this to be FALSE. This would be 'expected' by many, but this was health care, this was work where people's lives depend upon the UW making wise decisions -- not just about care, but also about the way in which money is spent. Having watched my mom die recently of metastatic cancer and seeing how little financial resources she had towards the end, it made me sick to know that even ONE DIME of my money went to Microsoft for the abomination called Amalga. I would like to believe that this was an exception. In February of 2011 I switched jobs to a grant based role working for the Dept. of Biomedical and Healthcare Informatics working on a project called CBR. A central tool that we 'must use' was i2b2 - a clinical informatics tool developed by Harvard University to allow researchers to do 'de-identified' queries (sometimes referred to as 'noodling'). There are many features of i2b2 that are flawed, few that make much sense and a general model of construction that aligns with NOTHING I learned in any of my computer science classes - what's worse, it does not even look like a system that could survive outside of publicly funded software. The following essay is a remembrance, a call to arms and a 'hope' that someone might make the right decision with respect to one or both of these systems before they spend precious time and money on them -- I am assuming that most are aware of the constraints of current budgets and the tough decisions ordinary Americans (like my Mom for instance) are having to make every day. AS WITH MEDICINE, SEEK A SECOND OPINION! If you think I am wrong, review both the Amalga and i2b2 systems with expert computer scientists and engineers in the fields of data warehousing and data storage. These are two very different systems (i2b2 and Amalga), but they both have 1 thing in common -- it is doubtful they would pass a consistent review or set of tests for scalability, usability, efficiency, TCO (total cost of ownership) or accuracy/audit worthiness. While it is true that i2b2 is 'free', this means less when you consider the labor cost of fixing and supporting their equally bad architectures. Disclaimer 1. Most of the men and women I have worked with here at the UW (with the exception of some management) have been hard working, intelligent and critical assets to the UW. In my view, the UW is lucky to have them and it is sad that they spend so much time keeping Amalga from crashing (i2b2 isn't really in the same boat as far as criticality -- if it were expected to be 'up' all the time then I suspect the UW would need to invest as much as Harvard has had to to keep in from blowing up). 2. I don't claim to know everything. In fact, I think what is troubling here are not simply my 'opinions' but the obvious absence of any structured consistent review of either system - I don't know WHAT the CDROC committee does, but clearly it is having no impact upon Amalga's quality and delivery. I was told, when I was hired, that Amalga had been tested/evaluated. This seems disingenuous. The best part is you DON'T need to take my word for it, go beyond the Amalga marketing materials and do your own discovery. Something I wish the UWMC had done. 3. Though I can draw a distinction between the direct payment of money to Microsoft versus the indirect costs of working with i2b2, I don't think i2b2 should be off the hook simply because it seems 'free' - nothing is free if you must expend unreasonable resources in order to support it. Section 1: Microsoft Amalga High Level: Amalga Design Flaws 1. Amalga comes with a few data models built in. Some of these models are rather naively based on HL-7. HL-7 is an EDI messaging format. Message formats are supposed to be redundant -- data warehouses (or clinical repositories or data aggregators or whatever the Microsoft marketing folks are calling this terrible system now) ought to store data accurately, efficiently and with as little redundancy as is possible. Because of the way azADT is designed, there are MANY fields which NEVER get populated. 2. All Amalga primary keys (clustered indexes) are NONSEQUENTIAL random strings. I say random, they are really a 'hash' of actually meaningful information like MRN (Medical Record Number), date of service, and proprietary system episode or visit keys. This use of a 'non-sequential' string key in a high throughput system hammers SAN's and is very expensive. One estimate I performed before resigning showed that 50% of all unique string tokens in the UWMC data warehouse were composed of Amalga ID's. 3. The 'flexible' part of the Amalga's data model amounts to database building by spreadsheet. Because their is very little or no sound data design in Amalga, the expansion of databases usually amounts to importing large, unwieldy and non-dimensionalized data. 4. Amalga, because of the way the azAEID database is designed, is very difficult if not IMPOSSIBLE to federate. What does this mean? It means that Amalga, to quote a former 'manager', really is just a 'giant garbage can for data'. The core tables get bigger and bigger and there is not rational or feasible way to archive data. 5. The hl-7 parsers and parser development is a joke. The tools which support hl-7 interface work are buggy and in general do not live up to expectations. Someone told me that Microsoft marketers sell Amalga as a system where the HL-7 interfaces can be developed in 4 hours. I don't know in what bizarre universe this is true, but the estimate I was given initially when I started on the team was 80-100 hours -- not 4. I am very productive and I was working on a SIMPLE referral interface. This interface took at least 80 hours to complete (not including testing). When you have a problem building an interface in Amalga, you pretty much have to start from scratch. There are bugs in the SEE (Script Engine Explorer: the tool used to help build these interfaces) which crashes often. 6. Amalga was almost ALWAYS in a failure state when I was there. During the winter we had serious problems with replication and missing data -- Microsoft 'support' was difficult to find. I suppose if you want to buy an enterprise system and get very little help --> then buy Amalga, you can feel left out in the cold. 7. The amalga ‘ado.net connector’ does not work as advertised. It states, in their materials, that this connector can work with LDAP/AD nothing of this is true. 8. I don't believe Amalga has was ever properly tested prior to being deployed at the UWMC. I was told they had 'vetted' the system in the year and a half before my arrival --> I don't know what was evaluated, but it WAS NOT Amalga's ability to reliably ingest and store large volumes of data. 9. I proved that IF you moved from the 'post relational' Amalga way to a Kimball model, you would reduce your data footprint by roughly 50%. If you further broke free of the "Amalga ID" this reduction MIGHT be as big as 70%. What does this mean? One estimate for DB warehousing drive needs for the next year or so at UWMC for Amalga storage was 258 TB's (to provide the uninitiated with a frame of reference, in 2005 I worked for a company that had HALF this much data in it and at the time it was one of the largest SQL Server databases in the USA). Drives, energy and computer equipment are NOT getting cheaper -- they are actually beginning to respond to the same inflationary pressures that exist in the wider economy. To shrug off a 50-70% reduction in hardware costs seems reckless and not in keeping with the stewardship of the people's money. Microsoft Amalga conforms to NONE of the standard features you would expect to find in a contemporary, high volumes, large scale, data warehouse. It does NOT have dimensions. Data, in Amalga (of course the exception are the core azyxxi HL-7 databases and tables which are BAD for their own reasons) is stored as a collection of spreadsheets. The data is redundant and could be improved by adopting standard approaches of data storage. As such, Amalga has VERY prohibitive features from a TCO (total cost of ownersip) perspective. See below, an estimate for JUST the labor costs for managing 7500 patients in Amalga for 1 year. ADO.NET connector does not work, see below: Let's say you have 1,000,000 patients per year. Assume a roughly 9K cost per 7500 patients (not including servers and Amalga licenses), this a variable cost of 1.2 million dollars per year for just the maintenance cost. Add in the fixed costs of licenses and servers and the REAL cost of Amalga for a 1 million patient a year enterprise is closer to 10-15 million dollars for the first 5 years... This may not be the MOST expensive solution, but given how unwieldy and under-performing the system is it also does not seem like a bargain. The HL-7 parsing is NO substitute to Cloverleaf. Cloverleaf is not perfect, but at least it has the right to call itself an HL7 integration engine. I estimated, several months ago, that the true cost of building parsers in Amalga IS NOT the 4 hours they tell future customers. In fact, it is non-deterministic. The SEE (Script Engine Explorer) is SO buggy, you are often faced with doing the same tasks OVER and OVER again. God forbid you need to update a parser, you are better off starting from scratch. We were told, in October (by Microsoft) that with the 'next version' building HL-7 interfaces would be 'easier', see comments below from someone who had to work with the new system: Amalga is NOT federated. What does this mean? It means that there is no way to archive off or split the clinical repository into smaller sub groups. Normally, in large scale data warehouses, people choose MEANINGFUL partition points in the data space. Logically, in the hospital context, Facility, Location and Date of Service are LOGICAL and semantically MEANINGFUL ways of breaking the very large monolithic databases into smaller units. Microsoft's advice to our management was 'buy another license'... This is a really great solution (sarcasm). Amalga, because of its design, is almost not capable of being audited. This may seem like a slight concern to some, but to ANYONE familiar with the risks and opportunities of healthcare informatics, this is not a good feature. In a system with conforming dimensions, it is possible to ask existential questions LONG before you need to query core facts. For instance, you can ask what procedure codes exist or what clinics exist without writing a very inefficient query against a 'post relational' table. Because of its (Amalaga's) design, there is NO easy auditing query. One of my reasons for leaving the UW Amalga team was because I felt Microsoft had a responsibility to assist us in ensuring data quality by designing a system which allowed for these checks. Amalga IS a VENDOR LOCK-IN system. The likelihood, given the design of the 'amalga id' and azAEID structures, is VERY low that any hospital system could easily switch to something else. If Microsoft gets this terrible system into a hospital system, it is UNLIKELY that the hospital system would be able to disentangle itself. It doesn't take a math genius or Warren Buffet to figure out that IF Microsoft is successful in 'pushing' this bad system it could become a 1 billion dollar product line in 10 years. If they get into the DOD or VA system, it is unlikely that Amalga could be stopped. Amalga's Index to Data space ratios are atrocious in most cases. And yet, for the TERRIBLE indexing schemes (or lack of any sense of designing proper indexes), the queries are still more often than not slower than you would expect. Here is a snippet of data/index space analysis performed on the live system: The Amalga ID is a nightmare. It makes for a TERRIBLE clustered index, because no matter how you manipulate the 'guid like' key, it will NEVER be an efficient sequential key. The excuse (and a weak one at that) is it will be 'unique' for external data sharing. I would like to believe this, but since it is only 'guid-like' and since I can ONLY assume it is deterministic, it seems UNLIKELY that this argument is true. Let's say I have a deterministic hash function: Hospital (A) is the University of Wisconsin Medical Center --> UWMC Hospital (B) is the University of Washington Medical Center --> UWMC (I don't know if such a hospital exists) Hospital (C) is the Union-Wilmington Medical Center --> UWMC (this one is made up) MRN's are not unique across organizations, and they are OFTEN, via roll-over, not unique in a hospital EMR. If I have an MRN for each of U2345445, for Sept 2, 2010, the 'amalga key generator' would be using precisely the same inputs for each hospital for the same patient -- to generate the EID (visit level key). If it is not deterministic, then you have a whole slew of other problems. If it is deterministic, then by definition the same EID would be generated for each institution. How is this crossfacility/system unique? See question/response below with respect to the Amalga ID: Section 2: i2b2 High level design flaws: A). Ontology Storage: The 'ontology' or structured classification scheme is stored primarily in 2 tables (in 2 different databases) in i2b2. The 'tree' itself is stored as all possible paths in the tree. For instance, take the following simple ontology: The 'i2b2' way of storing this ontology is as follows: For Each Terminal Node: Store the complete path from root node to terminal node. For formatting reasons, store it 3 times (yes, in both the metadata and concept_dimension, this same path is stored multiple times). For this, it stores (and more than once!) the following: /Automobiles/Trucks/Dodge/ /Automobiles/Trucks/Chevy/ /Automobiles/Cars/Color/Black/Speed/ /Automobiles/Cars/Color/Black/Steering/ /Automobiles/Cars/Color/Red/Mustang/ /Automobiles/Cars/Color/Red/Corvette/ /Automobiles/Cars/Cost/>30K/BMW/ /Automobiles/Cars/Cost/5-30K/Used/ /Automobiles/Cars/Cost/<5K/Used/ /Automobiles/Cars/Cost/<5K/Wreck/ The standard way of storing topologies, which an ontology is a structural sub-class of, is as one of the following: 1. Edge List (my preferred technique) 2. Jagged Array / Linked List (also workable) However, the 2 techniques above attempt a balance between mathematical complexity and useability. As an edge list, the above would look like this: Automobiles --> Trucks Trucks --> Chevy Trucks --> Dodge Automobiles --> Cars Cars --> Color Cars --> Cost Color --> Red Color --> Black Red --> Mustang Red --> Corvette Black --> Speed Black --> Steering Cost --> >30K Cost --> 5-30K Cost --> <5K >30K --> BMW 5-30K --> Used <5K --> Used <5K --> Wreck My Edge List Cost: 19 x 2 ==> 38 memory units (an abstraction, ceterus paribus, assume node size is equivalent) I2B2 Cost: 46 Units! The difference between these two seems small. Please understand -- THIS IS NOT A LINEAR RELATIONSHIP! With healthcare data, being MUCH more complex than my example, this difference becomes much worse. My edge list version of the 'i2b2 ontology' took SIGNIFICANTLY less space, did not have an asinine 'only 700 characters in length requirement' and is orthogonal/generic (good in systems which live in the real world and not fantasy land). As stated above, the maximum string length of the 'ontology' paths is 700 characters. It seems ridiculous to have to explain why this is bad. What's worse is the SAME ontology is stored in MULTIPLE fields in different tables -- wanna say update inconsistency? This ontology model has the following features: a) inefficient, b) difficult to manage, and c) ONLY WORKS if the semantic ontology string depth will ONLY ALWAYS BE 700 chars in DEPTH! (not a great feature of a system designed for the complex sciences of medicine and biology) B). Over-Design / difficult to disentangle: The documentation and install bits are over-bearing. In reality, the primary use cases for this at other institutions (other than Harvard) is essentially as a set theory engine. As such, there are very few tables that are 'needed'. An analysis of LIVE data from our i2b2 system showed MOSTLY empty tables and empty structures. This does not create much of a data footprint cost, but from a design perspective it is the equivalent of the 8 headlights on the family truxter -- not really necessary and probably counter productive. Over design has costs in documentation and the risk of engendering FUTURE design flaws. I2B2 reminds me of the freeways, in Seattle, during the late 70's -- many of them went NOWHERE. I2B2 is 'sold' as a cellular model, which would imply an ala carte architecture. This is FAR from the truth. While a person COULD re-program and re-design portions of it to make it more flexible (one reason for my resignation was the desire -in order to meet deadlines -- to fix and remedy some of the worst aspects of this), it is frowned upon. I think EGO rules for I2B2. C). Indexing is amateurish: Load factor of 7-10 times input data footprint. I was told by a PhD at the UW not be be concerned by this. I'm glad a PhD in informatics allows someone to discount a memory leak. Please, download and look (you can download i2b2 although it has a strange version of OFT MENTIONED 'open source' licensing). Observation fact, for I2B2 1.6, has so many indexes that SQL SERVER rejected the DDL. There were indexes on almost every field individually (though given how empty some of these fields were it made little or no sense), but on top of this there is a covering index on EVERY field! A db tuning class is needed - stat. D). UI is uninspiring and derivative: Set theory tabs, drag a node of the tree, it allows basic AND, OR and NOT operations. Its not bad, but not that impressive either. Plus, if you DO NOT update your ontology to reflect new concepts you will suffer from ORPHAN CONCEPTS... I do not need to explain how difficult it can be to maintain a system with this feature. What it is missing is dynamic connectedness. Tableaux and systems like tableaux support a much more dynamic and less labor intensive means of noodling. A person could BUILD a set theory engine that would outperform i2b2 for large scale systems and this alternative could be completed in 2-3 weeks. You could then call it a 'cell' and everyone could save face. This correct approach was not acceptable. E). Method for de-id is limited and buggy: a) i2b2 randomizes the counts for results below a certain threshold and b). if you query multiple times you are 'locked out'. The theory, use a randomizing function and then block users from using statistical narrowing. The problem: this doesn't really help when the populations you are dealing with drop below a certain level. At best it then becomes a noise generator. F). Data stored is stored in an unjustifiably redundant way (especially given its nature as a noodling/de-id tool): At first you might think, "Hey, its as simple as storing concepts in Observation_Fact and ensuring the metadata and concept_dimension (remember, the ontology is stored and used in multiple places -can I say update anomaly again?)". However, take a look at the hard coded and stored sql in the I2B2 table --> very revealing. I think, all a person should do once they download the i2b2 system is to examine the following table -- i2b2metadata.I2B2 and pay attention to the generated SQL and the closure anomalies that exist. This kind of code undermines a modern database, creates 'false' sub-queries and is frankly VERY BAD PRACTICE. Also, unnecessary when you consider HOW SQL could be generated in the middle tier. There is a Patient Provider and Visit dimension. I am wary of using dimension, because even though they are 'dimensions' of the data, the actual concept codes that represent the stored de-id value is stored ALSO. So, if I want to use I2B2, I both have to store the concept code in the Observation_Fact which represents this code, but I must ALSO store the same data in one of the core dimensions. Conclusions: How does this happen? 1. Dr. Harvard (or Docs PRACTICING computer science without a license): If I decided to begin practicing 'neighborhood cardiology' I would be arrested. I have no MD degree. I have no license to practice. And yet, in the healthcare system today, there are doctors selling themselves as 'experts' who know little or nothing of the discipline of computer science. Amalga was not designed by folks who understand complexity and data structures -- it was designed by docs, with some software developers, and it is the outcome of malpractice. I will make the docs who invented this a promise: if they stop practicing computer science, I will not pretend to be an ER doc. We need to ask better questions of ANYONE selling an idea and not simply assume because they are docs they must be good at everything -- sorry, but this is not generally the case. I joked once that given the current culture of hospitals, a doc from 'Harvard' could sell electronic mail as his or her own invention -- it might even make it to contract review before someone calls bullshit. 2. Bureaucracy -- good systems and good ideas don't make it: The circuitous route by which systems get built in a large healthcare system is both byzantine and painful. Instead of taking the advantage of 'crowd' intelligence, the result is a gray, mediocre, half measure. 3. No cost accountability: I don't think there is currently any solid cost review or cost accounting with respect to these large healthcare systems. In fairness, this is common with MANY enterprise systems in and outside of healthcare. We must, as professionals, develop better and more objective metrics for measuring the performance and benefits of systems like Amalga and I2b2. 4. No transparency in WHY a system is selected: Amalga is one of the least transparent systems I have come across. Its ironic, because it is a terribly simplistic system with only 3 legs to the stool. Leg 1: The HL-7 parsers (crap), Leg 2: Amalga ID and azAEID (super expensive and terribly convoluted crap) and Leg 3: the 'console' or UI tools -- I used to beat myself up for cutting corners on UI development and design, now that I've seen Amalga and considered the raw dollars it has consumed, I am much more forgiving to myself. I doubt ONE of the UI tools (including the console) would pass a thorough review by an HCI (Human Computer Interaction) specialist. The UI tools which support development are buggy and have memory leaks. I figured out months ago that you MUST shut down your development environment periodically in order to avoid catastrophic memory leaks which can (and do) lead to lost parser work. I don't really know why or how the UWMC picked Amalga or I2B2, but what I am fairly certain of is that NO competent computer scientist dug very deeply into either system. If they had, I doubt these systems would have ever been acquired. 5. Informatics IS NOT Computer Science, at least not yet: I have, in the last year, met MANY PhD's in Informatics. I remember my brother-in-law telling me about the very difficult tests he had to take (in the Computer Science Department) at Indiana University in Bloomington in order to move from the Masters program to the PhD portion of the program. The tests he described seemed extremely difficult - I don't know if I would have passed (my Bro-in-law was a graduate of Rose-Hulman so not a light-weight when it comes to the science of information). I have no reason to believe, given all the blank stares, given the fact that these 'experts' can look at Amalga and I2B2 and not immediately be concerned, that the PhD in Informatics provides any real background in the foundational skills of being an information scientist in the Biological and Medial realm. Maybe Informatics has a place, but only if the stewards of the profession take seriously the basic skills they need to do their job. I have very little reason to believe that the 'information scientists' in Informatics with PhD's actually have the background to do their work. This is too bad. No one needs to take my word for it. If you are a hospital CIO, do yourself a favor -- evaluate Amalga or i2b2 yourself and do a good job of it. Microsoft sales folks will tell you about all the 'success' stories -- if they tell you the UWMC is a success they are being dishonest. Before you 'buy' Amalga, load it with 5-10 TB's of test data. If Microsoft says 'we can't let you do that', then you should RUN from this disaster as far as you can. If they do consent, do your own analysis. Look at ingestion speed, replication and run ingestion while you are writing queries. In all likelihood, you will have to have or purchase servers on this scale or you may have them already. Bottom line, Amalga doesn't get really ugly until it's almost too late to turn back. But it is never too late to do the right thing! So, in recap. At small scales of data Amalga is unnecessary and a burden, at large scales of data Amalga (and this applies to i2b2 as well) is unwieldy, buggy, provably inefficient and just a waste of money, time and human resources. Miscellaneous The Joy of Querying Amalga... (JOY is SARCASM here) The confusion over querying Amalga (our expert on our team gave me 3 explanations in 3 weeks during September of last year) The funny thing is, Microsoft originally recommended that MRN be thrown away. I'm glad that is one recommendation that was NOT followed (despite the most recent statement of what is 'correct' when querying Amalga). Converting from Amalga to Kimball.
© Copyright 2026 Paperzz