biological and biomedical domain

BioSumm
A novel summarizer oriented to biological information
Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio
Politecnico di Torino
Introduction
Preprocessing and Clustering
Preliminary Experimental Results
The availability of increasingly wider text
repositories requires effective techniques to
manage the huge mass of unstructured information
there contained (e.g., navigate, analyse and
represent it in the most suitable way).
General purpose blocks
Comparison with traditional summarizers
Particularly, in the biological and biomedical
domain a huge amount of information is daily
generated and contributed by a vast research
community spread all over the world.
•Preprocessing.
• Parses the Pubmed Central xml inputs
Removes xml tags and biologically irrelevant
information
• Represents the documents according to the
Bag of words model using the Rapid Miner Text
Plugin
Repositories like PubMed Central, the U.S. National
Institutes of Health (NIH) free digital archive of
biomedical and life sciences journal literature,
nowadays contain billions of documents.
RapidMiner work flow.
• BioSumm sentences have the same expressive
power of the traditional ones, but a strong focus
on biology
PubMed Central Logo.
Clustering Quality Evaluation
• Rand index is used as metric
• A Rand Index close to 1 means that the clustering
block succeeded in the division by topic
Aim
• The BioSumm (Biological Summarizer) framework
that analyses large collections of unclassified
biomedical texts and exploits clustering and
summarization techniques to obtain a concise
synthesis, explicitly addressed to emphasize the text
parts that are more relevant for the disclosure of
genes (and/or proteins) interactions
• Clustering. exploits the Bag of words
representation and produces the clusters using the
CLUTO software package
Rand Index close to 1
BioSumm Logo.
• The framework is designed to be flexible, modular
and oriented to biological information
• Researchers can exploit BioSumm for knowledge
inference and biological validation of the
interactions discovered in independent ways (e.g.,
by means of data mining techniques)
Summarization
Performance Evaluation
• Based on a traditional statistic summarizer (OTS)
• Measured in terms of completion times
• Roughly linear trend with the number of
documents
• Biases sentence selection using the information
contained in a Domain Specific Dictionary
• The dictionary contains genes and proteins
names and aliases
Grading function for sentence j in document i:
Term frequency in
document i, of a non
stopword term k
Framework
Conclusions
Modular architecture composed by three blocks
• BioSumm can summarize large collections of
unstructured data by extracting the sentences
that are more relevant for knowledge inference and
biological validation of gene/protein relationships
• BioSumm has a strong focus on biology related
sentences
Number of distinct
occurrences of
dictionary term gn
BioSumm Framework Architecture.
Weights the
number of distinct
dictionary term gn
• Preprocessing. Extracts relevant parts of the
original document and performs text stemming
• Clustering. Divides rather diverse texts into
homogeneous clusters, in which the documents
cover the same topic
Favours sentences
that contain dictionary
terms disregarding
their number
• Extend the BioSumm approach to other
summarization techniques (e.g., based on Latent
Semantic Analysis)
• Validate on other domains (e.g., financial)
Contact:
Alessandro Fiori (PhD Student)
• Summarization. It produces a summary for each
cluster
is in the range
Future Works
,
is in the range
Phone: 0039 011 090 7194
Fax: 0039 011 090 7099
Email: alessandro.fiori@ polito.it
Web: http://dbdmg.polito.it/