BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction Preprocessing and Clustering Preliminary Experimental Results The availability of increasingly wider text repositories requires effective techniques to manage the huge mass of unstructured information there contained (e.g., navigate, analyse and represent it in the most suitable way). General purpose blocks Comparison with traditional summarizers Particularly, in the biological and biomedical domain a huge amount of information is daily generated and contributed by a vast research community spread all over the world. •Preprocessing. • Parses the Pubmed Central xml inputs Removes xml tags and biologically irrelevant information • Represents the documents according to the Bag of words model using the Rapid Miner Text Plugin Repositories like PubMed Central, the U.S. National Institutes of Health (NIH) free digital archive of biomedical and life sciences journal literature, nowadays contain billions of documents. RapidMiner work flow. • BioSumm sentences have the same expressive power of the traditional ones, but a strong focus on biology PubMed Central Logo. Clustering Quality Evaluation • Rand index is used as metric • A Rand Index close to 1 means that the clustering block succeeded in the division by topic Aim • The BioSumm (Biological Summarizer) framework that analyses large collections of unclassified biomedical texts and exploits clustering and summarization techniques to obtain a concise synthesis, explicitly addressed to emphasize the text parts that are more relevant for the disclosure of genes (and/or proteins) interactions • Clustering. exploits the Bag of words representation and produces the clusters using the CLUTO software package Rand Index close to 1 BioSumm Logo. • The framework is designed to be flexible, modular and oriented to biological information • Researchers can exploit BioSumm for knowledge inference and biological validation of the interactions discovered in independent ways (e.g., by means of data mining techniques) Summarization Performance Evaluation • Based on a traditional statistic summarizer (OTS) • Measured in terms of completion times • Roughly linear trend with the number of documents • Biases sentence selection using the information contained in a Domain Specific Dictionary • The dictionary contains genes and proteins names and aliases Grading function for sentence j in document i: Term frequency in document i, of a non stopword term k Framework Conclusions Modular architecture composed by three blocks • BioSumm can summarize large collections of unstructured data by extracting the sentences that are more relevant for knowledge inference and biological validation of gene/protein relationships • BioSumm has a strong focus on biology related sentences Number of distinct occurrences of dictionary term gn BioSumm Framework Architecture. Weights the number of distinct dictionary term gn • Preprocessing. Extracts relevant parts of the original document and performs text stemming • Clustering. Divides rather diverse texts into homogeneous clusters, in which the documents cover the same topic Favours sentences that contain dictionary terms disregarding their number • Extend the BioSumm approach to other summarization techniques (e.g., based on Latent Semantic Analysis) • Validate on other domains (e.g., financial) Contact: Alessandro Fiori (PhD Student) • Summarization. It produces a summary for each cluster is in the range Future Works , is in the range Phone: 0039 011 090 7194 Fax: 0039 011 090 7099 Email: alessandro.fiori@ polito.it Web: http://dbdmg.polito.it/
© Copyright 2026 Paperzz