Core Contents

Identification of Biomedical
Articles with Highly Related
Core Contents
Rey-Long Liu
Dept. of Medical Informatics
Tzu Chi University
Taiwan
Outline
• Background
• Problem definition
• The proposed technique: CCSE (Core
Content Similarity Estimation)
• Empirical evaluation
• Conclusion
2
Background
3
Core Contents of
Biomedical Articles
• Core contents of a scholarly article a
are the textual contents about
– Research goal of a
– Research background of a
– Research conclusion of a
4
Similarity Estimation for the
Core Contents
• Goal: retrieval of highly related articles
– Mining & analysis of highly related evidence
– A typical goal of existing search engines
• Challenge: recognition of the core contents
– Core content of an article a may be briefly
expressed in the title and scattered in the abstract
5
Selected by
biomedical
experts for
<erythropoietin,
anemia> 
They are highly
related to each
other
Recommended
by PubMed, but
NOT highly
related to
<erythropoietin,
anemia>
6
Problem Definition
7
Goal
• Developing a technique CCSE (Core
Content Similarity Estimation)
– Given: titles and abstracts of two articles
a1 and a2
– Output: core content similarity between a1
and a2
8
Contributions
• CCSE works on titles and abstracts only,
which are publicly available
• CCSE improves inter-article similarity
estimation by considering the core contents
of the articles
9
Related Work
• Inter-article similarity based on citation links
– Example I: out-link citations (by bibliographic
coupling, BC)
– Example II: in-link citations (by co-citation, CC)
– Weakness:
• The citation links are often not available on the
Internet (many articles even have no in-link
citations)
10
Related Work (cont.)
• Inter-article similarity based on textual
contents
– Working on publicly available parts
• Titles and abstracts
– Considering weights of terms (e.g., TFIDF
weight)
– Weakness:
• Did not consider the core content similarity (due
to the difficulty in recognizing the core contents)
11
The Proposed Technique:
CCSE
12
Main Ideas
Title of a1
Abstract of
a1
How are goal
terms of a1 related
to the goal of a2?
How are background
and conclusion terms
of a1 related to the
background and
conclusion of a2,
respectively?
Title of a2
Abstract of
a2
13
Main Ideas (cont.)
• Three kinds of relatedness of a term t the core
content of an article a,
– Rgoal: Relatedness to goal
– Rback: Relatedness to background
– Rconc: Relatedness to conclusion
• Rgoal, Rback, and Rconc are estimated based on
the positions of t in the title and the abstract of
a
14
Step 1/2: Estimation of Rgoal, Rback, and Rconc
based on positions of the term:
15
Step 2/2: Estimating inter-article similarity
between two articles a1 and a2
SimilarityCCSE (a1 , a2 )  CoreMatch(a1 , a2 )  CoreMatch(a2 , a1 )
 Any mismatch between the core contents will
significantly reduce the inter-article similarity
CoreMatch(a1 , a2 ) 
Matchgoal (a1 , a2 )  Matchback (a1 , a2 )  Matchconc (a1 , a2 )
3
 Similarity between a1 and a2 is based on goal match,
background match, and conclusion match between a1 and a2
16
Matchgoal (a1 , a2 ) 

InterR( Rgoal (t , a1 ), Rgoal (t , a2 ))  Log 2 IDF (t )
Min
tTitle ( a1 );tTitle ( a2 ) Abstract ( a2 )

tTitle ( a1 )
Rgoal (t , a1 )  Log 2 IDF (t )
 Matchgoal(a1, a2) is based on how terms in the title of a1
is related to the goals of a1 and a2
17
 Matchgoal (a1 , a2 ), if a1 or a2 has no abstract

Min
InterR( Rback (t , a1 ), Rback (t , a2 ))  Log 2 IDF (t )


Matchback (a1 , a2 )   tAbstract ( a1 );tAbstract ( a2 )
, otherwise.

Rback (t , a1 )  Log 2 IDF (t )


tAbstract ( a1 )

 Matchback(a1, a2) is based on how terms in the abstract of
a1 is related to the backgrounds of a1 and a2
18
 Matchgoal (a1 , a2 ), if a1 or a2 has no abstract

InterR( Rconc (t , a1 ), Rconc (t , a2 ))  Log 2 IDF (t )
Min


Matchconc (a1 , a2 )   tAbstract ( a1 );tAbstract ( a2 )
, otherwise.

Rconc (t , a1 )  Log 2 IDF (t )


tAbstract ( a1 )

 Matchconc(a1, a2) is based on how terms in the abstract of
a1 is related to the conclusions of a1 and a2
19
Interesting Features of CCSE
• Inter-article similarity is composed of three parts
– Goal similarity
– Background similarity
– Conclusion similarity
• These similarities are estimated based on the
positions of the terms appearing in the title and
the abstract of the article
• Any mismatch between the core contents will
significantly reduce the inter-article similarity
20
Empirical Evaluation
21
The data
• Two sets of articles
– Highly related biomedical articles:
• For each gene-disease pair <g,d>, collect the
biomedical articles that biomedical experts
selected to annotate the pair (noted by DisGeNET)
– Near-miss biomedical articles (Non-highly
related articles):
• For each gene-disease pair <g,d>, collect articles
using two queries: “g NOT d” and “d NOT g”
22
• Data statistics
– 53 gene-disease pairs
– 9,875 articles, including
• 53 targets + 9,822 candidates
– 435,786 out-link references
23
The Baseline Systems
(1) Link-based inter-article similarity
– Bibliographic coupling (BC)
SimilarityBC (a1, a 2) 
| Oa1  Oa 2 |
| Oa1  Oa 2 |
(2) Text-based inter-article similarity
– BM25 (one of the best in the biomedical domain)
SimilarityBM 25 (a1 , a2 ) 

ta1 a2
TF (t , a2 )(k1  1)
TF (t , a2 )  k1 (1  b  b
a2
)
avgal
Log 2 IDF (t )
(3) Biomedical search engine
– PubMed (popular and one of the best in the
biomedical domain)
24
Evaluation Criteria
• MAP (Mean Average Precision)
– If a system can rank higher those articles that
are highly related to r, average precision (AvgP)
for the gene-disease pair will be higher
– MAP is simply the average of the AvgP values
for all gene-disease pairs
hi
j

j 1 Seeni ( j )
AvgP(i ) 
hi
|T |
MAP 
 AvgP(i)
i 1
|T |
25
• Average P@X
– If those articles that are highly related to r, are
ranked at top-X position, P@X for the genedisease pair will be higher
– Average P@X is simply the average of the
P@X values for all gene-disease pairs
Number of top-X articles that are highly related to the target in the i th test
P@X(i ) 
X
|T|
Average P@X 
 P@X(i)
i 1
|T |
26
Result
CCSE performs significantly better than BC and BM25
27
CCSE performs better than PubMed in all evaluation criteria:
28
Conclusion
29
• Our Motivation:
– Core contents of scholarly articles are essential for
retrieval of highly related scientific evidence, BUT
– The core contents are scattered in titles and abstracts
of articles
• We develop CCSE that
– Estimates inter-article similarity based on the
similarities in goals, backgrounds, and conclusions
of two articles
• The idea of CCSE can be
– Incorporated into search engines to properly retrieve
highly related scholarly articles
30