Social Infobuttons: Integrating Open Health Data with Social Data using Semantic Technology Xiang Ji Soon Ae Chun James Geller New Jersey Institute of Technology 323 Martin Luther King Jr. Blvd Newark, NJ, 07102 Columbia University & City University of New York Staten Island, NY, 10314 New Jersey Institute of Technology 323 Martin Luther King Jr. Blvd Newark, NJ, 07102 [email protected] [email protected] [email protected] ABSTRACT There is a large amount of free health information available for a patient to address her health concerns. HealthData.gov includes community health datasets at the national, state and community level, readily downloadable. There are also patient-generated datasets, accessible through social media, on the conditions, treatments or side effects that individual patients experience. While caring for patients, clinicians or healthcare providers may benefit from integrated information and knowledge embedded in the open health datasets, such as national health trends and social health trends from patient-generated healthcare experiences. However, the open health datasets are distributed and vary from structured to highly unstructured. An information seeker has to spend time visiting many, possibly irrelevant, websites, and has to select relevant information from each and integrate it into a coherent mental model. In this paper, we present a Linked Data approach to integrating these health data sources and presenting contextually relevant information called Social InfoButtons to healthcare professionals and patients. We present methods of data extraction, and semantic linked data integration and visualization. A Social InfoButtons prototype system provides awareness of community and patient health issues and healthcare trends that may shed light on patient care and health policy decisions. Keywords Social Network, Semantic Integration, Medical Data, Ontology 1. INTRODUCTION In recent years, patients have begun to turn to social media, particularly patient communities, for personal contact, social support, and patient-generated knowledge. A study [1] by the Pew Research Center found that 34% of the Internet users used social media to read other patients’ comments about health issues. MedHelp [2] and PatientsLikeMe [3] are examples of medical social network sites with big user bases. As another data source, governments have published many open health datasets. At the city level, NYC Open Data [4] and Chicago Data Portal [5] are examples of Open Government Initiatives [33]. At the country level, the CDC has established the Behavioral RiskFactor Surveillance System (BRFSS) [6] based on regularly Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SWIM’13, June 23, 2013, New York, USA. Copyright 2013 ACM 978-1-4503-2194-5/13/06 …$15.00. held research community and in patient resource websites. PubMed [8] is a database containing more than 22 million scientific sources from MEDLINE, life science journals, and online books. In a patient resource website such as WebMD [9], a patient can search for professional advice from health specialists. Although there are many different “open” health data sources available, the stored information varies widely, from unstructured to highly-structured. To address this heterogeneity, we use Semantic Web technologies to model the integrated knowledge network by linking health data from multiple data sources [7]. The RDF triples of the Semantic Web provide a light-weight integration method through linking entities from different sources to represent the relationships among them. This integration of patient-created social data with existing health research and clinical practice data can provide a tool for contrasting the actual situations of patients with the views officially accepted as “correct” by healthcare experts. It may also spot community trends in medication use, side effects or treatment methods that are not yet known “at the textbook level.” This kind of integrated information for comparisons and trend spotting is useful for the patients who are looking for health-related information, for clinicians who are treating a patient to become aware of what similar patients have experienced during a particular treatment, and for government agencies to adjust their health policies. Thus, our research goal is to extract patient-generated social health data, government data, and clinical research data, and link them to develop an integrated semantic network model. This model provides not only answers to simple queries on health information, but also a set of analytic tools called “Social InfoButtons” that can help end-users to become aware of socially distributed health information, such as treatments, conditions, experiences, attitudes, and behaviors reported by patients. In this paper, we present a semantic model for integrated health knowledge, a matching algorithm to determine cases of equivalent semantics of different terms, semantic query functionalities, data analytics and visualization functionalities. We present a prototype system called Social InfoButtons to provide easy access to this integrated social health information. The rest of the paper is organized as follows. In Section 2, we present related work. Section 3 describes the semantic model for the integrated health data store and the RDF representation of data extracted with the semantic matching algorithm. In Section 4, the Social InfoButtons prototype system and its analytics are illustrated. Section 5 contains the conclusions and future work. 2. RELATED WORK During data integration, one critical problem is how to match different terms with the same meaning. Copious amounts of work on integrating various data sources have been reported, including approaches using ontologies. Shvaiko and Euzenat [11] summarized the state-of-the-art in ontology matching. The Semantic Web has been used as a framework for data integration in previous research. Sheth et al. [7] reviewed the viability of the Semantic Web in integration. However, in healthcare research, there are fewer publications on this topic. Fernandez-Luque et al. [12] surveyed different approaches to extracting information from the “Social Web” for health personalization. We utilize the idea of “Social InfoButtons” to provide contextaware links to social health knowledge, aggregated from social network sites. InfoButtons was developed by Cimino et al. [10] to meet the clinician’s information needs. Cimino et al. [14] described ten different information needs, their contexts, their resources, and the applicable methods. They concluded that the methods to implement InfoButtons included simple links, concept-based links, simple search, concept-based search, intelligent agents, and a calculator. We incorporated some of the InfoButtons standard questions from Collins et al. [13] into our prototype. The “Social InfoButtons” system answers these questions, using a knowledge base which contains the usergenerated content, location information, and patients’ demographics, stored in a semantic triple store. Social InfoButtons was strongly inspired by Cimino’s InfoButtons, but was developed independently. 3. SEMANTIC INTEGRATION APPROACH FOR HEALTH DATA 3.1 Semantic Model for Linking Health Data To capture the variety of health information, we developed a conceptual model to capture the relationships among health data entities extracted from different sources. The Condition class and Treatment class are the central concepts in the model. The Condition class has the “isConditionOf” relationship to the Patient class, the “hasSymptom” relationship to the Symptom class, and the “hasTreatment” relationship to the Treatment class. The Treatment class has the “hasSideEffect” relationship to the SideEffect class, and the “hasPurpose” relationship to the Purpose class. This semantic model also implements the relationships between the concepts from PatientsLikeMe to the ones from other Web sources, e.g., the Condition class can be linked to the web resource in WebMD that describes the corresponding condition. This link can be represented with the “hasResource” relationship to denote that the Condition in PatientsLikeMe has a related resource in WebMD. Similarly, the Condition class can be linked to a research paper at the PubMed publication site by the predicate “isReportedBy.” An advantage of the triple representation compared with a relational database is that if a new condition from a new data source needs to be included, the new instance can be directly added and linked without considering altering the model structure as in the relational database. A detailed class and relationship diagram was omitted due to space limitations. 3.2 Data Extraction and Transformation We collected publicly available data from PatientsLikeMe, PubMed, the CDC website, and the UMLS Metathesaurus. We utilized the PHP HTML DOM Parser [21] to scrape relevant information from the above websites. Web scraping on various websites is not an easy task, since different sites have different HTML structures, and there are also differences between pages within a website. We wrote different scripts to scrape relevant health information on various pages, and store the extracted data into a relational database. Taking PatientsLikeMe as an example, we began with the publicly available patients’ list, then visited each patient’s profile page from the list, and scraped its id, username, gender, age, location, primary condition, condition, first symptom date, and diagnosis date. Similar procedures were also used for scraping lists of treatments, symptoms, purposes, side effects, and conditions. To retrieve the relevant documents about conditions, we searched PubMed using 1228 condition names collected from PatientsLikeMe. For each condition, the available information such as PubMed URL, title, author, and conference or journal of the top 20 matched documents were collected. Furthermore, we scraped the CDC BRFSS prevalence data [22], which is published in tables. UMLS data can be retrieved remotely through SQL queries. The extracted relational health data needs to be transformed into RDF triples. One problem is the multiple-attribute issue, since relational tables can have multiple attributes for one record but a RDF triple can only use a predicate to represent one attribute. We utilized a reified statement to address this problem. In the reified statement, the primary key of the original table was converted into the subject, the attribute name of the foreign key was regarded as the predicate, and the object was the value of the foreign key. The reified statement was then regarded as a new subject, and the other attributes in the original table were converted into different predicates of the new subject. It took 16 seconds on a laptop with Intel i7 1.6GHz CPU and 6G RAM to transform the database records into 612,017 triples in an RDF file (57.1MB), and it took another 57 seconds on the same laptop to transform the RDF file into a Jena TDB [24][20] triple store representation (341MB), which is a fast permanent triple store that stores triples directly to the disk. For the visualization interface of Social InfoButtons, we used the geocoding procedure [15] to transform original location data into a canonical location name. 3.3 Ontology-Based Semantic Matching In the process of data integration, one critical problem is how to recognize that the terms from different resources actually represent the same concept. As suggested in [16], we use the UMLS [17] Metathesaurus, which provides a common vocabulary and semantics for multiple terms that refer to the same concept. In this paper, we utilize a matching algorithm, using the UMLS to recognize identical concepts. We used CUIs, which are concept unique identifiers for medical concepts in the UMLS. Definition 3.1. Integrated data repository. D = {D1’, D2’, D3’,… Dn’} is the aggregation of all data sources being integrated. Di’ represents the ith data source. Definition 3.2. Repository of instances. Hi’ = {h1,i’, h2,i’, h3,i’, … hk,i’}. Hi’ is the set of instances from data source Di’, and hi,i’ is the ith instance from set Hi’. Definition 3.3. Repository of CUIs which has at least one synonym that is the same as the instance hi,i’. Ti,i’ = {t | t is the CUI of instance hi,i’ }. Assumption 3.1. If the name of instance hi,i’ matches exactly with the label of the concept Ci in UMLS, then hi,i’ is a term of the UMLS concept Ci, which is uniquely identified by the CUI of Ci. Assumption 3.2. If the intersection of the set Ti,i’ and the set Tj,j’ is not empty, and only one instance exists in the intersection, then the instances hi,i’ and hj,j’ are referring to the same concept. Assumption 3.1 states that if a name of the instance matches any one of a concept’s synonyms in the UMLS, this instance belongs to that UMLS concept. This assumption holds in most cases, but it is worth noting that the UMLS itself has some errors [18]. Naturally, there will be cases where several CUIs are possible matches. For example, one condition is “Multiple Sclerosis,” with the instance name “MS.” There are 14 different matching CUIs, of which only “C0026769” is the desired one. In our triple store, according to an empirical observation, the instances with short Algorithm 1: Computing S(hi,i’, hj,j’) Input: instance hi,i’, hj,j’, S(hi,i’, hj,j’) = 0 Output: S(hi,i’, hj,j’) 1. Search the UMLS with the name(hi,i’), if a str in tuple (cui, sui, str) matches the term, add cui to Ti,i’. 2. Search the UMLS with name(hj,j’), if a match is found in UMLS, add cui to Tj,j’. 3. for each t in Ti,i’. 4. for each t’ in Tj,j’ 5. if t is equal to t’ 6. set S(hi,i’, hj,j’) =1; 7. break; 8. endif 9. end 10. end 4.2 Use Cases Consider a medical doctor, Christine, writing prescriptions for her patient Bob, who is suffering from migraines. In addition to referencing Bob’s lab reports, Christine also wants to explore the social trends and experiences of other patients like Bob. The treatments used by other similar patients for migraines and how they reacted to them, are also displayed. For migraines, 11 treatments were mentioned by other patients. We only show how the system displays Sumatriptan, which is the most popular treatment (Figure 2). Government agencies could also use the Social InfoButtons system to compare the reports of patients on social network sites with official data. For example, as shown in Figure 3, the gender distribution of Asthma in Pennsylvania from PatientsLikeMe differs by 22.4 percentage points from government data. The reasons behind this difference still need further investigations. Figure 1. Procedure of the Triple Matching Algorithm names are rare. The assumption 3.1 selects a superset of the correct CUI, and the intersection of two supersets is likely to contain the correct answer. Assumption 3.2 assumes that if two instances share one and only one possible unique concept, they are referring to the same concept. Algorithm 1, shown in Figure 1, is proposed for computing the “identity” of two terms (or instances). In this algorithm, the CUIs of two instances from two data sources are collected; when the CUIs of the first instance have an overlap with the CUIs of the second instance, the “identity” flag is set to 1. After the algorithm is finished, every pair of instances between two data sources is processed. If the “identity” flag is 1, an OWL.sameAs relationship will be added between the two instances. Figure 2. Social InfoButtons and Treatments 4. SOCIAL INFOBUTTONS 4.1 Social Data Analytics SPARQL queries were designed to answer expected user questions. Due to space limitations, only one question from each category is listed in Table 1. Table 1. Questions Answered by Social InfoButtons Category Statistics Demographics Location Condition Correlation Questions Top conditions with the most patients? Gender distribution of the patients? Where is the individual patient? What are the symptoms of the condition? Difference between social and official data? The Social InfoButtons system provides information in a contextaware fashion e.g. in the clinical patient care context or in the government policy evaluation context. The Social InfoButtons analytics that displays the information about the condition “Migraine” is shown in Figure 2. Figure 3. Difference of Gender Distribution between Social and Government Health Data 5. SOCIAL INFOBUTTONS PROTOTYPE The Social InfoButtons system architecture diagram is omitted due to space limitations. The data collector extracts publicly available health data from various data sources, which include the social network site PatientsLikeMe, the government-maintained CDC website, the PubMed website, and the patient resource portal WebMD. All the text-based location information is sent to the Google geocoding server to be translated into canonical locations. Based on our medical semantic model, our data collector program has extracted 17,407 publicly-viewable patient profiles from PatientsLikeMe. Each profile contains the basic patient information such as username, gender, age, and location. In addition, 1,228 conditions, 911 side effects, 2,176 symptoms, 1,354 treatment purposes, and 5,608 treatments were collected. The CDC Behavioral Risk Factor Surveillance System contains annual reports about prevalent diseases, and provides statewide data about the absolute numbers of patients and percentages of patients among the whole population in US states. For PubMed and WebMD, the related URLs of the medical concepts, such as condition names, side effect names, symptom names, and treatment names were extracted. All of the triples are stored as a graph model in the Jena TDB, which is a high-performance RDF store on a single machine. SPARQL queries are used to access the stored graph model. 6. CONCLUSONS AND FUTURE WORK We have developed a semantic integration method that links the social data reported by patients with other health information. A semantic network model for health data integration was defined as a conceptual model to link various data sources. We utilized the UMLS and presented a matching algorithm to search for synonyms of resource names in existing medical ontologies, and then to extract and compare the identifiers of each pair of resources. Utilizing the integrated knowledge base, we implemented the Social InfoButtons prototype, which can provide the aggregated social health information that is relevant to an end users’ context. Patients, clinicians, and government officials can make social-network-informed decisions with the Social InfoButtons prototype. Following are future work topics. (1) The scalability of the approach will be investigated. The Jena TDB triple store currently contains 612,017 triples, and the query execution times range from instantaneous to a few seconds. However, real time execution might become infeasible when the extracted dataset becomes very large. As proposed by Punnoose et al. [23], an index on RDF triples might be useful for speeding up queries. In addition, more analysis techniques will be used to discover the underlying patterns in the aggregated knowledge base. (2) In cases where the query terms do not match the target terms but match the children of the target terms, semantic reasoning using the medical ontological information can be exploited to expand the matching and further improve the recall of the matching algorithm. (3) More results of the matching algorithm and other relevant algorithms will be collected and compared in systematic experiments. (4) Semantic searches will be employed to improve or replace the current, embedded SPARQL queries in order to fully utilize the advantages of the Jena TDB triple store. (5) Currently data collection is automatic but not in real time, so it is desirable to expand the data collection process into a batch procedure or a real-time process. 7. ACKNOWLEDGMENTS This work was partially funded by PSC-CUNY Research Foundation (2011-2013). This work was conducted while Chun was on sabbatical leave at Columbia University. 8. REFERENCES [1] Pew Research Center. 2011. The Social Life of Health Information.http://pewinternet.org/~/media/files/reports/2011/ pip_social_life_of_health_info.pdf. [2] MedHelp. http://www.medhelp.org/. [3] PatientsLikeMe. https://www.patientslikeme.com. [4] NYC Open Data. https://nycopendata.socrata.com/. [5] City of Chicago. Data Portal. https://data.cityofchicago.org/. [6] Behavioral Risk Factor Surveillance System. http://www.cdc.gov/brfss/. [7] Sheth A., Ramakrishnan C. 2003. Semantic Web Technology in Action: Ontology Driven Information Systems for Search, Integration and Analysis. IEEE Data Engineering Bulletin. [8] PubMed. http://www.ncbi.nlm.nih.gov/pubmed. [9] WebMD. http://www.webmd.com. [10] Cimino J.J., Elhanan G., Zeng Q. 1997. Supporting Infobuttons with Terminological Knowledge. Proceedings of AMIA Annual Symposium. [11] Shvaiko P., Euzenat J. 2013. Ontology Matching: State of The Art and Future Challenges. IEEE Transactions on Knowledge and Data Engineering, Volume: 25, Issue: 1. [12] Fernandez-Luque L., Karlsen R., Bonander J. 2011. Review of Extracting Information From the Social Web for Health Personalization. J Med Internet Res 2011;13(1):e15. [13] Collins S.A., Currie L.M., Bakken S., Cimino J.J. 2009. Information Needs, Infobutton Manager Use, and Satisfaction by Clinician Type: A Case Study. J Am Med Inform Assoc. 16(1): 140–142. [14] Cimino J.J., Li J., Allen M., Currie L.M., Graham M., Janetzki V., Lee N.J., Bakken S., Patel V.L. 2004. Practical Considerations for Exploiting the World Wide Web to Create InfoButtons. Studies Health Technology and Informatics.107 (Pt 1):277-81. [15] Ji X., Chun S., and Geller J. 2012. Epidemic Outbreak and Spread Detection System Based on Twitter Data. Proceedings of The 1st International Conference on Health Information Science (HIS 2012): 152-163, Beijing, China. [16] Chun S. and MacKellar B. 2012. Social Health Data Integration using Semantic Web. Proceedings of the 27th Annual ACM Symposium on Applied Computing Pages 392397. [17] Bodenreider O. 2004. The Unified Medical Language System (UMLS): Integrating Biomedical Terminology. Nucleic Acids Research 32(Database-Issue): p267-270. [18] Geller J., Morrey C.P., Xu J., Halper M., Elhanan G., Perl Y., Hripcsak G. 2009. Comparing Inconsistent Relationship Configurations indicating UMLS Errors. Proceedings of AMIA Annual Symposium. [19] Open Government Initiative. www.whitehouse.gov/open. [20] McBride B. 2001. Jena: Implementing the rdf model and syntax specification. Proceedings of the Second International Workshop on the Semantic Web. [21] PHP Simple HTML DOM Parser. http://simplehtmldom.sour ceforge .net. [22] CDC Prevalence Data of Asthma in 2010. http://www.cdc.gov /asthma/brfss/ 2010/brfssdata.htm. [23] Punnoose R., Crainiceanu A., Rapp D. 2012. Rya: a scalable RDF triple store for the clouds. Proceedings of the 1st International Workshop on Cloud Intelligence. [24] Jena TDB. http://jena.apache.org/documentation/tdb.
© Copyright 2026 Paperzz