Social Infobuttons: Integrating Open Health Data with Social Data

Social Infobuttons: Integrating Open Health Data with
Social Data using Semantic Technology
Xiang Ji
Soon Ae Chun
James Geller
New Jersey Institute of Technology
323 Martin Luther King Jr. Blvd
Newark, NJ, 07102
Columbia University &
City University of New York
Staten Island, NY, 10314
New Jersey Institute of Technology
323 Martin Luther King Jr. Blvd
Newark, NJ, 07102
[email protected]
[email protected]
[email protected]
ABSTRACT
There is a large amount of free health information available for a
patient to address her health concerns. HealthData.gov includes
community health datasets at the national, state and community
level, readily downloadable. There are also patient-generated
datasets, accessible through social media, on the conditions,
treatments or side effects that individual patients experience.
While caring for patients, clinicians or healthcare providers may
benefit from integrated information and knowledge embedded in
the open health datasets, such as national health trends and social
health trends from patient-generated healthcare experiences.
However, the open health datasets are distributed and vary from
structured to highly unstructured. An information seeker has to
spend time visiting many, possibly irrelevant, websites, and has to
select relevant information from each and integrate it into a
coherent mental model. In this paper, we present a Linked Data
approach to integrating these health data sources and presenting
contextually relevant information called Social InfoButtons to
healthcare professionals and patients. We present methods of data
extraction, and semantic linked data integration and visualization.
A Social InfoButtons prototype system provides awareness of
community and patient health issues and healthcare trends that
may shed light on patient care and health policy decisions.
Keywords
Social Network, Semantic Integration, Medical Data, Ontology
1. INTRODUCTION
In recent years, patients have begun to turn to social media,
particularly patient communities, for personal contact, social
support, and patient-generated knowledge. A study [1] by the Pew
Research Center found that 34% of the Internet users used social
media to read other patients’ comments about health issues.
MedHelp [2] and PatientsLikeMe [3] are examples of medical
social network sites with big user bases.
As another data source, governments have published many open
health datasets. At the city level, NYC Open Data [4] and Chicago
Data Portal [5] are examples of Open Government Initiatives [33].
At the country level, the CDC has established the Behavioral
RiskFactor Surveillance System (BRFSS) [6] based on regularly
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SWIM’13, June 23, 2013, New York, USA.
Copyright 2013 ACM 978-1-4503-2194-5/13/06 …$15.00.
held research community and in patient resource websites.
PubMed [8] is a database containing more than 22 million
scientific sources from MEDLINE, life science journals, and
online books. In a patient resource website such as WebMD [9], a
patient can search for professional advice from health specialists.
Although there are many different “open” health data sources
available, the stored information varies widely, from unstructured
to highly-structured. To address this heterogeneity, we use
Semantic Web technologies to model the integrated knowledge
network by linking health data from multiple data sources [7]. The
RDF triples of the Semantic Web provide a light-weight
integration method through linking entities from different sources
to represent the relationships among them. This integration of
patient-created social data with existing health research and
clinical practice data can provide a tool for contrasting the actual
situations of patients with the views officially accepted as “correct”
by healthcare experts. It may also spot community trends in
medication use, side effects or treatment methods that are not yet
known “at the textbook level.” This kind of integrated information
for comparisons and trend spotting is useful for the patients who
are looking for health-related information, for clinicians who are
treating a patient to become aware of what similar patients have
experienced during a particular treatment, and for government
agencies to adjust their health policies.
Thus, our research goal is to extract patient-generated social
health data, government data, and clinical research data, and link
them to develop an integrated semantic network model. This
model provides not only answers to simple queries on health
information, but also a set of analytic tools called “Social
InfoButtons” that can help end-users to become aware of socially
distributed health information, such as treatments, conditions,
experiences, attitudes, and behaviors reported by patients. In this
paper, we present a semantic model for integrated health
knowledge, a matching algorithm to determine cases of equivalent
semantics of different terms, semantic query functionalities, data
analytics and visualization functionalities. We present a prototype
system called Social InfoButtons to provide easy access to this
integrated social health information.
The rest of the paper is organized as follows. In Section 2, we
present related work. Section 3 describes the semantic model for
the integrated health data store and the RDF representation of data
extracted with the semantic matching algorithm. In Section 4, the
Social InfoButtons prototype system and its analytics are
illustrated. Section 5 contains the conclusions and future work.
2. RELATED WORK
During data integration, one critical problem is how to match
different terms with the same meaning. Copious amounts of work
on integrating various data sources have been reported, including
approaches using ontologies. Shvaiko and Euzenat [11]
summarized the state-of-the-art in ontology matching. The
Semantic Web has been used as a framework for data integration
in previous research. Sheth et al. [7] reviewed the viability of the
Semantic Web in integration. However, in healthcare research,
there are fewer publications on this topic. Fernandez-Luque et al.
[12] surveyed different approaches to extracting information from
the “Social Web” for health personalization.
We utilize the idea of “Social InfoButtons” to provide contextaware links to social health knowledge, aggregated from social
network sites. InfoButtons was developed by Cimino et al. [10] to
meet the clinician’s information needs. Cimino et al. [14]
described ten different information needs, their contexts, their
resources, and the applicable methods. They concluded that the
methods to implement InfoButtons included simple links,
concept-based links, simple search, concept-based search,
intelligent agents, and a calculator. We incorporated some of the
InfoButtons standard questions from Collins et al. [13] into our
prototype. The “Social InfoButtons” system answers these
questions, using a knowledge base which contains the usergenerated content, location information, and patients’
demographics, stored in a semantic triple store. Social InfoButtons
was strongly inspired by Cimino’s InfoButtons, but was
developed independently.
3. SEMANTIC INTEGRATION
APPROACH FOR HEALTH DATA
3.1 Semantic Model for Linking Health Data
To capture the variety of health information, we developed a
conceptual model to capture the relationships among health data
entities extracted from different sources. The Condition class and
Treatment class are the central concepts in the model. The
Condition class has the “isConditionOf” relationship to the
Patient class, the “hasSymptom” relationship to the Symptom
class, and the “hasTreatment” relationship to the Treatment class.
The Treatment class has the “hasSideEffect” relationship to the
SideEffect class, and the “hasPurpose” relationship to the Purpose
class. This semantic model also implements the relationships
between the concepts from PatientsLikeMe to the ones from other
Web sources, e.g., the Condition class can be linked to the web
resource in WebMD that describes the corresponding condition.
This link can be represented with the “hasResource” relationship
to denote that the Condition in PatientsLikeMe has a related
resource in WebMD. Similarly, the Condition class can be linked
to a research paper at the PubMed publication site by the predicate
“isReportedBy.” An advantage of the triple representation
compared with a relational database is that if a new condition
from a new data source needs to be included, the new instance can
be directly added and linked without considering altering the
model structure as in the relational database. A detailed class and
relationship diagram was omitted due to space limitations.
3.2 Data Extraction and Transformation
We collected publicly available data from PatientsLikeMe,
PubMed, the CDC website, and the UMLS Metathesaurus. We
utilized the PHP HTML DOM Parser [21] to scrape relevant
information from the above websites. Web scraping on various
websites is not an easy task, since different sites have different
HTML structures, and there are also differences between pages
within a website. We wrote different scripts to scrape relevant
health information on various pages, and store the extracted data
into a relational database. Taking PatientsLikeMe as an example,
we began with the publicly available patients’ list, then visited
each patient’s profile page from the list, and scraped its id,
username, gender, age, location, primary condition, condition,
first symptom date, and diagnosis date. Similar procedures were
also used for scraping lists of treatments, symptoms, purposes,
side effects, and conditions. To retrieve the relevant documents
about conditions, we searched PubMed using 1228 condition
names collected from PatientsLikeMe. For each condition, the
available information such as PubMed URL, title, author, and
conference or journal of the top 20 matched documents were
collected. Furthermore, we scraped the CDC BRFSS prevalence
data [22], which is published in tables. UMLS data can be
retrieved remotely through SQL queries.
The extracted relational health data needs to be transformed into
RDF triples. One problem is the multiple-attribute issue, since
relational tables can have multiple attributes for one record but a
RDF triple can only use a predicate to represent one attribute. We
utilized a reified statement to address this problem. In the reified
statement, the primary key of the original table was converted into
the subject, the attribute name of the foreign key was regarded as
the predicate, and the object was the value of the foreign key. The
reified statement was then regarded as a new subject, and the
other attributes in the original table were converted into different
predicates of the new subject. It took 16 seconds on a laptop with
Intel i7 1.6GHz CPU and 6G RAM to transform the database
records into 612,017 triples in an RDF file (57.1MB), and it took
another 57 seconds on the same laptop to transform the RDF file
into a Jena TDB [24][20] triple store representation (341MB),
which is a fast permanent triple store that stores triples directly to
the disk. For the visualization interface of Social InfoButtons, we
used the geocoding procedure [15] to transform original location
data into a canonical location name.
3.3 Ontology-Based Semantic Matching
In the process of data integration, one critical problem is how to
recognize that the terms from different resources actually
represent the same concept. As suggested in [16], we use the
UMLS [17] Metathesaurus, which provides a common vocabulary
and semantics for multiple terms that refer to the same concept. In
this paper, we utilize a matching algorithm, using the UMLS to
recognize identical concepts. We used CUIs, which are concept
unique identifiers for medical concepts in the UMLS.
Definition 3.1. Integrated data repository. D = {D1’, D2’, D3’,…
Dn’} is the aggregation of all data sources being integrated. Di’
represents the ith data source.
Definition 3.2. Repository of instances. Hi’ = {h1,i’, h2,i’, h3,i’, …
hk,i’}. Hi’ is the set of instances from data source Di’, and hi,i’ is the
ith instance from set Hi’.
Definition 3.3. Repository of CUIs which has at least one
synonym that is the same as the instance hi,i’. Ti,i’ = {t | t is the
CUI of instance hi,i’ }.
Assumption 3.1. If the name of instance hi,i’ matches exactly with
the label of the concept Ci in UMLS, then hi,i’ is a term of the
UMLS concept Ci, which is uniquely identified by the CUI of Ci.
Assumption 3.2. If the intersection of the set Ti,i’ and the set Tj,j’ is
not empty, and only one instance exists in the intersection, then
the instances hi,i’ and hj,j’ are referring to the same concept.
Assumption 3.1 states that if a name of the instance matches any
one of a concept’s synonyms in the UMLS, this instance belongs
to that UMLS concept. This assumption holds in most cases, but it
is worth noting that the UMLS itself has some errors [18].
Naturally, there will be cases where several CUIs are possible
matches. For example, one condition is “Multiple Sclerosis,” with
the instance name “MS.” There are 14 different matching CUIs, of
which only “C0026769” is the desired one. In our triple store,
according to an empirical observation, the instances with short
Algorithm 1: Computing S(hi,i’, hj,j’)
Input: instance hi,i’, hj,j’, S(hi,i’, hj,j’) = 0
Output: S(hi,i’, hj,j’)
1. Search the UMLS with the name(hi,i’), if a str in tuple
(cui, sui, str) matches the term, add cui to Ti,i’.
2. Search the UMLS with name(hj,j’), if a match is found in
UMLS, add cui to Tj,j’.
3. for each t in Ti,i’.
4. for each t’ in Tj,j’
5.
if t is equal to t’
6.
set S(hi,i’, hj,j’) =1;
7.
break;
8.
endif
9.
end
10. end
4.2 Use Cases
Consider a medical doctor, Christine, writing prescriptions for her
patient Bob, who is suffering from migraines. In addition to
referencing Bob’s lab reports, Christine also wants to explore the
social trends and experiences of other patients like Bob. The
treatments used by other similar patients for migraines and how
they reacted to them, are also displayed. For migraines, 11
treatments were mentioned by other patients. We only show how
the system displays Sumatriptan, which is the most popular
treatment (Figure 2).
Government agencies could also use the Social InfoButtons
system to compare the reports of patients on social network sites
with official data. For example, as shown in Figure 3, the gender
distribution of Asthma in Pennsylvania from PatientsLikeMe
differs by 22.4 percentage points from government data. The
reasons behind this difference still need further investigations.
Figure 1. Procedure of the Triple Matching Algorithm
names are rare. The assumption 3.1 selects a superset of the
correct CUI, and the intersection of two supersets is likely to
contain the correct answer. Assumption 3.2 assumes that if two
instances share one and only one possible unique concept, they
are referring to the same concept.
Algorithm 1, shown in Figure 1, is proposed for computing the
“identity” of two terms (or instances). In this algorithm, the CUIs
of two instances from two data sources are collected; when the
CUIs of the first instance have an overlap with the CUIs of the
second instance, the “identity” flag is set to 1. After the algorithm
is finished, every pair of instances between two data sources is
processed. If the “identity” flag is 1, an OWL.sameAs relationship
will be added between the two instances.
Figure 2. Social InfoButtons and Treatments
4. SOCIAL INFOBUTTONS
4.1 Social Data Analytics
SPARQL queries were designed to answer expected user
questions. Due to space limitations, only one question from each
category is listed in Table 1.
Table 1. Questions Answered by Social InfoButtons
Category
Statistics
Demographics
Location
Condition
Correlation
Questions
Top conditions with the most patients?
Gender distribution of the patients?
Where is the individual patient?
What are the symptoms of the condition?
Difference between social and official data?
The Social InfoButtons system provides information in a contextaware fashion e.g. in the clinical patient care context or in the
government policy evaluation context. The Social InfoButtons
analytics that displays the information about the condition
“Migraine” is shown in Figure 2.
Figure 3. Difference of Gender Distribution between Social
and Government Health Data
5. SOCIAL INFOBUTTONS PROTOTYPE
The Social InfoButtons system architecture diagram is omitted
due to space limitations. The data collector extracts publicly
available health data from various data sources, which include the
social network site PatientsLikeMe, the government-maintained
CDC website, the PubMed website, and the patient resource portal
WebMD. All the text-based location information is sent to the
Google geocoding server to be translated into canonical locations.
Based on our medical semantic model, our data collector program
has extracted 17,407 publicly-viewable patient profiles from
PatientsLikeMe. Each profile contains the basic patient
information such as username, gender, age, and location. In
addition, 1,228 conditions, 911 side effects, 2,176 symptoms,
1,354 treatment purposes, and 5,608 treatments were collected.
The CDC Behavioral Risk Factor Surveillance System contains
annual reports about prevalent diseases, and provides statewide
data about the absolute numbers of patients and percentages of
patients among the whole population in US states. For PubMed
and WebMD, the related URLs of the medical concepts, such as
condition names, side effect names, symptom names, and
treatment names were extracted. All of the triples are stored as a
graph model in the Jena TDB, which is a high-performance RDF
store on a single machine. SPARQL queries are used to access
the stored graph model.
6. CONCLUSONS AND FUTURE WORK
We have developed a semantic integration method that links the
social data reported by patients with other health information. A
semantic network model for health data integration was defined as
a conceptual model to link various data sources. We utilized the
UMLS and presented a matching algorithm to search for
synonyms of resource names in existing medical ontologies, and
then to extract and compare the identifiers of each pair of
resources. Utilizing the integrated knowledge base, we
implemented the Social InfoButtons prototype, which can provide
the aggregated social health information that is relevant to an end
users’ context. Patients, clinicians, and government officials can
make social-network-informed decisions with the Social
InfoButtons prototype.
Following are future work topics. (1) The scalability of the
approach will be investigated. The Jena TDB triple store currently
contains 612,017 triples, and the query execution times range
from instantaneous to a few seconds. However, real time
execution might become infeasible when the extracted dataset
becomes very large. As proposed by Punnoose et al. [23], an
index on RDF triples might be useful for speeding up queries. In
addition, more analysis techniques will be used to discover the
underlying patterns in the aggregated knowledge base. (2) In cases
where the query terms do not match the target terms but match the
children of the target terms, semantic reasoning using the medical
ontological information can be exploited to expand the matching
and further improve the recall of the matching algorithm. (3)
More results of the matching algorithm and other relevant
algorithms will be collected and compared in systematic
experiments. (4) Semantic searches will be employed to improve
or replace the current, embedded SPARQL queries in order to
fully utilize the advantages of the Jena TDB triple store. (5)
Currently data collection is automatic but not in real time, so it is
desirable to expand the data collection process into a batch
procedure or a real-time process.
7. ACKNOWLEDGMENTS
This work was partially funded by PSC-CUNY Research
Foundation (2011-2013). This work was conducted while Chun
was on sabbatical leave at Columbia University.
8. REFERENCES
[1] Pew Research Center. 2011. The Social Life of Health
Information.http://pewinternet.org/~/media/files/reports/2011/
pip_social_life_of_health_info.pdf.
[2] MedHelp. http://www.medhelp.org/.
[3] PatientsLikeMe. https://www.patientslikeme.com.
[4] NYC Open Data. https://nycopendata.socrata.com/.
[5] City of Chicago. Data Portal. https://data.cityofchicago.org/.
[6] Behavioral Risk Factor Surveillance System.
http://www.cdc.gov/brfss/.
[7] Sheth A., Ramakrishnan C. 2003. Semantic Web Technology
in Action: Ontology Driven Information Systems for Search,
Integration and Analysis. IEEE Data Engineering Bulletin.
[8] PubMed. http://www.ncbi.nlm.nih.gov/pubmed.
[9] WebMD. http://www.webmd.com.
[10] Cimino J.J., Elhanan G., Zeng Q. 1997. Supporting
Infobuttons with Terminological Knowledge. Proceedings of
AMIA Annual Symposium.
[11] Shvaiko P., Euzenat J. 2013. Ontology Matching: State of The
Art and Future Challenges. IEEE Transactions on Knowledge
and Data Engineering, Volume: 25, Issue: 1.
[12] Fernandez-Luque L., Karlsen R., Bonander J. 2011. Review
of Extracting Information From the Social Web for Health
Personalization. J Med Internet Res 2011;13(1):e15.
[13] Collins S.A., Currie L.M., Bakken S., Cimino J.J. 2009.
Information Needs, Infobutton Manager Use, and Satisfaction
by Clinician Type: A Case Study. J Am Med Inform Assoc.
16(1): 140–142.
[14] Cimino J.J., Li J., Allen M., Currie L.M., Graham M.,
Janetzki V., Lee N.J., Bakken S., Patel V.L. 2004. Practical
Considerations for Exploiting the World Wide Web to Create
InfoButtons. Studies Health Technology and Informatics.107
(Pt 1):277-81.
[15] Ji X., Chun S., and Geller J. 2012. Epidemic Outbreak and
Spread Detection System Based on Twitter Data. Proceedings
of The 1st International Conference on Health Information
Science (HIS 2012): 152-163, Beijing, China.
[16] Chun S. and MacKellar B. 2012. Social Health Data
Integration using Semantic Web. Proceedings of the 27th
Annual ACM Symposium on Applied Computing Pages 392397.
[17] Bodenreider O. 2004. The Unified Medical Language System
(UMLS): Integrating Biomedical Terminology. Nucleic Acids
Research 32(Database-Issue): p267-270.
[18] Geller J., Morrey C.P., Xu J., Halper M., Elhanan G., Perl Y.,
Hripcsak G. 2009. Comparing Inconsistent Relationship
Configurations indicating UMLS Errors. Proceedings of
AMIA Annual Symposium.
[19] Open Government Initiative. www.whitehouse.gov/open.
[20] McBride B. 2001. Jena: Implementing the rdf model and
syntax specification. Proceedings of the Second International
Workshop on the Semantic Web.
[21] PHP Simple HTML DOM Parser. http://simplehtmldom.sour
ceforge .net.
[22] CDC Prevalence Data of Asthma in 2010. http://www.cdc.gov
/asthma/brfss/ 2010/brfssdata.htm.
[23] Punnoose R., Crainiceanu A., Rapp D. 2012. Rya: a scalable
RDF triple store for the clouds. Proceedings of the 1st
International Workshop on Cloud Intelligence.
[24] Jena TDB. http://jena.apache.org/documentation/tdb.