Identification of Biomedical Articles with Highly Related Core Contents Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan Outline • Background • Problem definition • The proposed technique: CCSE (Core Content Similarity Estimation) • Empirical evaluation • Conclusion 2 Background 3 Core Contents of Biomedical Articles • Core contents of a scholarly article a are the textual contents about – Research goal of a – Research background of a – Research conclusion of a 4 Similarity Estimation for the Core Contents • Goal: retrieval of highly related articles – Mining & analysis of highly related evidence – A typical goal of existing search engines • Challenge: recognition of the core contents – Core content of an article a may be briefly expressed in the title and scattered in the abstract 5 Selected by biomedical experts for <erythropoietin, anemia> They are highly related to each other Recommended by PubMed, but NOT highly related to <erythropoietin, anemia> 6 Problem Definition 7 Goal • Developing a technique CCSE (Core Content Similarity Estimation) – Given: titles and abstracts of two articles a1 and a2 – Output: core content similarity between a1 and a2 8 Contributions • CCSE works on titles and abstracts only, which are publicly available • CCSE improves inter-article similarity estimation by considering the core contents of the articles 9 Related Work • Inter-article similarity based on citation links – Example I: out-link citations (by bibliographic coupling, BC) – Example II: in-link citations (by co-citation, CC) – Weakness: • The citation links are often not available on the Internet (many articles even have no in-link citations) 10 Related Work (cont.) • Inter-article similarity based on textual contents – Working on publicly available parts • Titles and abstracts – Considering weights of terms (e.g., TFIDF weight) – Weakness: • Did not consider the core content similarity (due to the difficulty in recognizing the core contents) 11 The Proposed Technique: CCSE 12 Main Ideas Title of a1 Abstract of a1 How are goal terms of a1 related to the goal of a2? How are background and conclusion terms of a1 related to the background and conclusion of a2, respectively? Title of a2 Abstract of a2 13 Main Ideas (cont.) • Three kinds of relatedness of a term t the core content of an article a, – Rgoal: Relatedness to goal – Rback: Relatedness to background – Rconc: Relatedness to conclusion • Rgoal, Rback, and Rconc are estimated based on the positions of t in the title and the abstract of a 14 Step 1/2: Estimation of Rgoal, Rback, and Rconc based on positions of the term: 15 Step 2/2: Estimating inter-article similarity between two articles a1 and a2 SimilarityCCSE (a1 , a2 ) CoreMatch(a1 , a2 ) CoreMatch(a2 , a1 ) Any mismatch between the core contents will significantly reduce the inter-article similarity CoreMatch(a1 , a2 ) Matchgoal (a1 , a2 ) Matchback (a1 , a2 ) Matchconc (a1 , a2 ) 3 Similarity between a1 and a2 is based on goal match, background match, and conclusion match between a1 and a2 16 Matchgoal (a1 , a2 ) InterR( Rgoal (t , a1 ), Rgoal (t , a2 )) Log 2 IDF (t ) Min tTitle ( a1 );tTitle ( a2 ) Abstract ( a2 ) tTitle ( a1 ) Rgoal (t , a1 ) Log 2 IDF (t ) Matchgoal(a1, a2) is based on how terms in the title of a1 is related to the goals of a1 and a2 17 Matchgoal (a1 , a2 ), if a1 or a2 has no abstract Min InterR( Rback (t , a1 ), Rback (t , a2 )) Log 2 IDF (t ) Matchback (a1 , a2 ) tAbstract ( a1 );tAbstract ( a2 ) , otherwise. Rback (t , a1 ) Log 2 IDF (t ) tAbstract ( a1 ) Matchback(a1, a2) is based on how terms in the abstract of a1 is related to the backgrounds of a1 and a2 18 Matchgoal (a1 , a2 ), if a1 or a2 has no abstract InterR( Rconc (t , a1 ), Rconc (t , a2 )) Log 2 IDF (t ) Min Matchconc (a1 , a2 ) tAbstract ( a1 );tAbstract ( a2 ) , otherwise. Rconc (t , a1 ) Log 2 IDF (t ) tAbstract ( a1 ) Matchconc(a1, a2) is based on how terms in the abstract of a1 is related to the conclusions of a1 and a2 19 Interesting Features of CCSE • Inter-article similarity is composed of three parts – Goal similarity – Background similarity – Conclusion similarity • These similarities are estimated based on the positions of the terms appearing in the title and the abstract of the article • Any mismatch between the core contents will significantly reduce the inter-article similarity 20 Empirical Evaluation 21 The data • Two sets of articles – Highly related biomedical articles: • For each gene-disease pair <g,d>, collect the biomedical articles that biomedical experts selected to annotate the pair (noted by DisGeNET) – Near-miss biomedical articles (Non-highly related articles): • For each gene-disease pair <g,d>, collect articles using two queries: “g NOT d” and “d NOT g” 22 • Data statistics – 53 gene-disease pairs – 9,875 articles, including • 53 targets + 9,822 candidates – 435,786 out-link references 23 The Baseline Systems (1) Link-based inter-article similarity – Bibliographic coupling (BC) SimilarityBC (a1, a 2) | Oa1 Oa 2 | | Oa1 Oa 2 | (2) Text-based inter-article similarity – BM25 (one of the best in the biomedical domain) SimilarityBM 25 (a1 , a2 ) ta1 a2 TF (t , a2 )(k1 1) TF (t , a2 ) k1 (1 b b a2 ) avgal Log 2 IDF (t ) (3) Biomedical search engine – PubMed (popular and one of the best in the biomedical domain) 24 Evaluation Criteria • MAP (Mean Average Precision) – If a system can rank higher those articles that are highly related to r, average precision (AvgP) for the gene-disease pair will be higher – MAP is simply the average of the AvgP values for all gene-disease pairs hi j j 1 Seeni ( j ) AvgP(i ) hi |T | MAP AvgP(i) i 1 |T | 25 • Average P@X – If those articles that are highly related to r, are ranked at top-X position, P@X for the genedisease pair will be higher – Average P@X is simply the average of the P@X values for all gene-disease pairs Number of top-X articles that are highly related to the target in the i th test P@X(i ) X |T| Average P@X P@X(i) i 1 |T | 26 Result CCSE performs significantly better than BC and BM25 27 CCSE performs better than PubMed in all evaluation criteria: 28 Conclusion 29 • Our Motivation: – Core contents of scholarly articles are essential for retrieval of highly related scientific evidence, BUT – The core contents are scattered in titles and abstracts of articles • We develop CCSE that – Estimates inter-article similarity based on the similarities in goals, backgrounds, and conclusions of two articles • The idea of CCSE can be – Incorporated into search engines to properly retrieve highly related scholarly articles 30
© Copyright 2026 Paperzz