Feature Based Gene Summary Extraction with Re

Feature Based Gene Summary Extraction with Re-ranking
Samir Gupta
Computer and Information Sciences
University of Delaware
Newark, DE 19716 USA
[email protected]
Abstract
Due to the vast availability of bio-medical
literature, searching medical databases
for information about genes is getting
problematic and cumbersome. Searching
PubMed with a gene name as query, returns thousands of results, including irrelevant ones. Gene Ontology(GO) and
UniProtKB databases provide an indication of relevant terms associated with a
gene but is not enough for a quick understanding of the different properties of the
gene. Besides these are manually written
and curated which is both labor-intensive
and time consuming. Automatically generating summaries for gene would help
biologists to get an overall picture about
the gene quickly. In this paper we adapt
generic feature-based extractive summarization techniques and augment it with
biomedical domain specific features. We
also use the concept of “novelty” to reduce
the redundancy in the extracted summary.
Our results show inclusion of domain specific features and “redundancy removal”
improve the content of the summary significantly.
1
been made to construct databases such as EntrezGene(Maglott et al., 2005) , Gene Ontology2 and
UnitProtKB3 , which provides “important information” about a gene. But these database are manually created and require curation and regular updates which is labor-intensive. This necessitates
the development automatic gene summary extractor.
In this paper we describe an approach which
expands on the generic features used in summary extraction by including domain specific features. Feature based summary extraction techniques were explored by (Edmundson, 1969; Kupiec et al., 1995) for generic domains. We augment these features with certain domain specific
features like presence of gene name and certain
biological cue phrases. We also use a variant
of Maximal Marginal Relevance(Carbonell and
Goldstein, 1998) to reduce the “redundancy” in
the final summary. We use different modules of
eGIFT(Tudor et al., 2010), a gene information
mining tool, to extract the set of abstracts relating to a gene, compute “descriptive words”, and
extract gene name variations. The major contributions of this paper are:
• Applying the generic features used by Edmundson(1969) to the biomedical domain.
• Augmenting the generic features with
biomedical domain specific features.
• Using terms provided by Gene Ontology
and UnitProtKB medical databases to re-rank
sentences based on “information novelty”.
Introduction
Biomedical databases like PubMed(McEntyre and
Lipman, 2001) and BioMed Central1 are expanding rapidly and contain millions of articles. Due to
this vast amount of information, biologists spend
a large amount of time searching and reading articles to find relevant information. One such “information need” which life scientists look for is
gene-specific information. A quick overview of
the different properties, functions and other aspects of a gene would be very useful. Efforts have
1
http://www.biomedcentral.com/
2
Approach
In this section we discuss the details of the the
gene summarization system. The input to this system is a gene identifier, same as the one used in the
eGIFT system (Tudor et al., 2010). Given a gene
2
3
http://www.geneontology.org
http://www.uniprot.org/help/uniprotkb
identifier, we first extract a set of abstracts from
Medline. The retrieval of relevant abstracts for a
given gene is done using the eGRAB (Extractor of
Gene-Relevant ABstracts) module of eGIFT. The
eGRAB module considers all gene names, synonyms, and aliases, to query the Medline database
and return a set of abstracts for the given gene.
Each sentence in the set of abstracts is scored
based on a number of features. A subset of these
features like term frequency, sentence position,
presence of title words and sentence length, are
similar to the ones used in (Edmundson, 1969; Kupiec et al., 1995). In addition to these, we use features like the presence of the gene name and certain biological phrases to adapt the generic techniques to the biological domain. After several iterations on some test genes, we manually assigned
weights to each of the features and compute a final
score for a sentence. The top ranking sentences are
then selected to be included in the summary.
We have also explored the notion of “information novelty” to reduce redundancy across the sentences to be selected. This approach is based on
Maximal Marginal Relevance(MMR) model used
in (Carbonell and Goldstein, 1998), but the difference lies in the computation of “novelty” and how
it is used. Based on the MMR model we re-rank
a subset of the sentences returned by the featuredbased system. In the next two subsection we will
discuss in details the features and the re-ranking
system.
2.1
Computing Sentence Importance
The set of abstracts returned by eGRAB mdoule
are preprocessed and segmented into sentences.
A set of features are used to score and compute
the importance of the sentences. Based on the
weighted score, the sentences are ranked and top
ranking sentences included in the summary. The
first four features are used by generic extractive
summarizers. We have added two new features
which are more specific to the bio-medical domain. We name the system using the first four
features as System-A. System-A will help us understand if and how well generic approaches adapt
to particular domain. We hope to see a significant
difference and improvement when the last two domain specific features are added(System-B).
Sentence Position Feature
This features encodes positional information about
a sentence in an abstract. Sentence position can
be one of the following: title, first, last and middle sentence. As argued in early work of extractive
summarization by Edmundson(1969), first and last
sentences are typically important than other sentences. Thus higher scores are assigned to first
and last sentence positions as opposed to middle
or title sentence positions.
Title Words Feature
This feature assigns a score between 0 and 1 to a
sentence based on the presence of title words in the
sentence. The title of the abstract are decomposed
into words, the words stemmed. These words are
regarded as “descriptive words” and each sentence
is scored based on the frequency of occurrence of
title words in them. The score is divided by the
length of the sentence and then normalized.
Sentence Length Feature
Kupiec et al.(1995) used sentence length as one of
the feature for summarization. In their implementation the feature was true if the sentence length
was above a certain threshold, thereby giving less
importance to very short sentences. In our system,
we have used a low and a high threshold is used to
assign low scores to very short or very long sentences. Very long sentences alongwith containing
some relevant information contain unnecessary information(noise, we argue should also be given a
low score. This helps us to select short sentences
in which “noise” is minimal and thus is more informative to the user. It also helps us in the second
phase - ”the re-ranking” step, by allowing more
relevant and novel sentences to be selected.
Frequency Based Feature
This feature is used to assign a score to the sentence between 0 and 1, indicating the presence
of “descriptive words” in the sentence. Most of
the early works in the area of summarization used
term frequency and its variations to identify the
most descriptive words of a document. Term Frequency*Inverse Document Frequency (TF*IDF)
has been used in the field of Information Retrieval(Salton and Buckley, 1988; Jones, 1972) as
a measure of computing “descriptive words” in a
document.
We use eGIFT’s(Tudor et al., 2010) iTerm
scores, a variant of TF*IDF weights to extract “descriptive words” in a set of abstracts relating to
a gene. eGIFT automatically computes and associates informative term, iTerms with a gene based
on frequency information from a set of abstracts
returned by eGRAB module, which is called the
About Set for the gene. It assigns scores to unigrams and bigrams, excluding stop-words, as well
as a set of bio-medical terms that we extracted
from different knowledge bases, including EntrezGene, Gene Ontology, NCBI Taxonomy, UMLS,
and MeSH that matched in text. The terms are
converted to base-form for scoring purposes. Each
term is assigned a score depending on its frequency in the About Set, contrasted with it’s frequency in Background Set. The background set is
the set of all abstracts in the bio-medical database.
For each term t, a score s(t) is assigned as follows:
s(t) = (
dfa (t) dfb (t)
Nb
−
) ∗ ln(
)
Na
Nb
dfb (t)
where dfa (t) and dfb (t) are the number of abstracts containing term t in the About Set for the
gene and the Background Set, respectively, and Na
and Nb are the total number of abstracts in these
two sets. The difference between the normalized
document frequencies dfNa (t)
− dfNb (t)
rewards terms
a
b
occurring more frequently in the About Set and
b
ln( dfNb (t)
) penalizes very frequent terms in all documents. An important thing to note is that eGIFT
considers document frequency as opposed to term
frequency in a specific document. This is because,
iTerms are “descriptive terms” across a set of abstracts and not a single document and thus yields
better relevance of term to a gene. Given the
score for each term a set of top ranking informative terms or iTerms are computed for gene.
We score each sentence in the About Set of a
gene by considering the occurrences of the iTerms
and its score. The final score is divided by the
number of words in the sentence and normalized.
Gene Feature
The abstracts returned by the eGRAB module are
related to the gene, whose summary is to be extracted. This feature indicates the presence of the
gene name in the sentence. The sentences may
or may not contain the gene name, which might
be used as an indicator of the sentence’s importance. This features assigns a score of 1 to sentences which contains the gene name and 0 otherwise. This boosts the score of sentences containing the gene name in them. A gene in bio-medical
literature is referred by several names, abbreviations. For example the SMAD2 has variations such
as Smad family member 2, smad-2, madr2, xsmad2
etc. eGIFT provides certain APIs which given a
gene identifier returns all the variations of the gene
name. It uses official names of genes provided
by Entrez Gene(Maglott et al., 2005), synonyms,
and word sense disambiguation techniques to return the different variations.
Biological Cue Phrase Feature
This features assigns a score between 0 and 1 depending upon the presence of certain phrases in
the sentence. This approach is based on the fact
that certain phrases in a document indicates sentence importance. Authors of technical documents
follow certain writing styles, using certain phrases
to indicate important relations between different
entities in text. These writing styles are domain
dependent and require study of the documents to
identify them. We argue that phrases are more important than others to indicate a sentence important as they convey very strong relations between
the entities in text.
EntrezGene(Maglott et al., 2005) contains manually created summaries for some of the genes.
We did a preliminary study of the human written
summaries from Entrez, in-order to understand,
what types of information is typically conveyed in
a summary. We identified several aspects which
are covered almost in every summary.
• ATTRIBUTE: The different properties/attributes associated with a gene.
• FAMILY: Gene family the gene belongs to.
• FUNCTION: The various biological functions or processes the gene is involved in.
• DOMAIN: The domains the gene contains.
• INTERACTION: The interaction of this gene
with other gene or proteins.
• DISEASE: Diseases caused by this gene.
These aspects were found to span multiple sentences or different aspects mentioned in a single
sentence. For the purposes of this paper we explored the first three aspects. In next paragraph we
examine first three aspects in some details and discuss the biological phrases associated with each.
ATTRIBUTE: A gene typically has some wellknown properties which need to be captured in a
summary. These are typically isA relations between a gene and a noun phrase. For example,
sentence fragments like, “.. groucho proteins are
transcriptional corepressors ..” and “.. groucho
homolog tle-4 , a corepressor ..” both indicate the
gene groucho is a corepressor. Thus for this as-
pect we look for phrases like “is a”, appositives
and relative clauses. The pattern should be immediately preceded by the gene in question for this
feature to be considered.
each bio-feature in a sentence is added and the
scores normalized.
FAMILY: Almost all gene belongs to a family of genes, which share certain common characteristics. Including the family information, helps
biologists to ascertain certain important attributes
of the gene. For example, sentence fragments
like, “The Drosophila Groucho (Gro) protein is
the defining member of a family of metazoan corepressors ..”, “Groucho (Gro) is the founding member of a family of transcriptional co-repressor..”
indicate that grocho belongs to a family of gene
which are corepressors. For this aspects we look
for phrases like “belongs to” and “member of”.
Similar to the above patterns, this pattern should
be immediately preceded by the gene in question
for this feature to be considered.
Gene summary should contain as much diverse information as possible, thereby reducing the redundancy of information, while maintaining maximal
relevance to the gene. As the number of abstracts
in the About Set for a gene is very large in number,
sentences extracted based only in feature scores
may contain high amount of redundant information. Hence the removal of information is necessary, hence redundant sentences should not be selected when producing the final summary.
The main intuition behind this method
is based on Maximal Marginal Relevance
(MMR)(Carbonell and Goldstein, 1998).
A
sentence which is “similar” to a sentence already selected should be penalized. A weighted
combination of the “feature score” and “novelty
score” is used to make selected maximally diverse
and maximally relevant sentences to a gene.
Algorithm 1 provides the pseudo-code for the
re-ranking systems. Our re-ranking system takes
as input the set of ranked sentences returned the
featured based method discussed in section 2.1.
For every selected sentence a set of important
terms is computed. These include GO terms
and UniProtKB keywords.
Gene Ontology
(GO)project is a major bioinformatics initiative
with the aim of standardizing the representation
of gene and gene product attributes across species
and databases. The project provides a controlled
vocabulary of terms for describing gene product
characteristics and gene product annotation data
from GO Consortium member. The ontology covers three domains: cellular component, molecular
function and biological process. The UniProt
Knowledgebase (UniProtKB) is the central hub
for the collection of functional information
on proteins, with accurate, consistent and rich
annotation. UniProtKB gene entries are tagged
keywords relating to the gene.
Instead of considering and minimizing similarity between two sentences as used in MMR, we
compute “novel score’ for each sentence. When
a sentence is selected, the GO terms and UniProt
Keywords are added to the set seletedT erms.
The novel score for a sentence is assigned based
on the number of new GO terms and UniProt Keywords that is contained in the sentence. The final
FUNCTION: Most of the sentences in the human written summaries contain this aspect. These
indicate the different biological processes and
functions the gene is involved in, required for etc.
These are typically mentioned with different aspects, for example typically followed after an INTERACTION apsect. Identifying the different
functions of a gene is very important and sentences
which mention such kind of relations should be included in a summary. From the following sentence
fragments we can determine easily that groucho is
related to the biological functions such as notch
signaling, segmentation and neural development.
Examples: “Groucho is a transcriptional repressor
implicated in notch signaling..”, “.. Groucho .. involved in neural development and segmentation in
drosophila”, “Groucho is required for Drosophila
neurogenesis, segmentation..” and “that Gro/TLE
proteins play a role in the repression of target
genes”. We look for the highlighted phrases mentioned in the above sentences when assigning this
bio-feature. The gene may not immediately precede the pattern for this aspect, but further the gene
from the phrase, the lower the score.
Each sentence in the About Set for a gene is
searched for the mentioned patterns. The sentence
should also contains the gene name. The “lexical distance’ between gene mention and the pattern/phrase is considered while assigning the score
for this feature. The distance should be small for
FAMILY and ATTRIBUTE aspects, and may be
longer for the FUNCTION aspect. The score for
2.2
Re-Ranking based on Novelty
Input: Set of Ranked Sentences Set D
Tuning parameter : λ
Output: Set of Re-Ranked Sentences R
selectedT erms ← empty;
rerankedSents ← empty;
while D is not empty do
foreach sentence s in the set D do
f Scores ← feature score for s;
extract GO Terms for s;
extract UniProt Keywords for s;
add extracted terms to currT ermss ;
newT ermss ←
dif f (currT ermss , selectedT erms);
nScores ← novelScore(newT ermss );
score ← λ∗f Score+(1−λ)∗nScore;
end
determine sent s0 for which scores0 is max;
delete s0 from D;
add s0 to R;
add newT ermss0 to selectedT erms;
end
return R;
Algorithm 1: Novelty Based Re-Rank
Table 1: Features Based Ranking: Summary
Phrases Matches
System A System B Improvement
SMAD2
3
5
66.7%
VPS35
2
3
50.0%
BRI1
3
2
-33.3%
BAG3
0
3
NA%
LTBP2
2
3
50.0%
KAT2A
2
3
50.0%
score is a weighted depending on a user-tunable
parameter λ. The sentence with the highest final
score is added to set of re-ranked sentences and
deleted from the original ranking. Finally the GO
terms and UniProt Keywords are added to the set
selectedT erms. A λ value closer to 1 will yield
a relevance based ranking while λ value closer to
0 will retrieve a novelty based ranking. When the
initial rank set of sentences is empty the algorithm
stop and yield a new ranking of sentences.
3
Results
In this section we present the results of our evaluation. We used six genes for evaluation purposes.
EntrezGene Summary for these genes were used
as the gold set. We measured the number of phrase
in the extracted sentences which matched with the
phrases in the summary. While matching phrases
we also considered the relation between the phrase
and the gene. A phrase in extracted summary sentence was said to matched if it matched to a phrase
in the gold set and had the same relation with gene
as in the gold set. For example for gene kat2a
a summary sentence is: “KAT2A, or GCN5, is
a histone acetyltransferase (HAT) that functions
primarily as a transcriptional activator.”. ReRanking system with λ = 0 extracted the following sentence : ‘‘histone acetyltransferases ( hats )
such as gcn5 play a role in transcriptional activation .” The phrase “transcriptional activation” is
marked as matched because its has the same relation with the gene i.e. same function.
Figure 1 shows the matching phrases for the
gene smaad2 in the summary extracted from the
feature based system. System A refers to output generated by using only generic features while
System B refers to the output generated by adding
the bio-domain specific features. The matched
phrases are shown as bold text. Figure 2 shows
matching phrases in the summary extracted by the
re-ranking system with lambda = 0, 0.3and0.7.
A lambda value closer to 0 indicated more importance to “information novelty”. Table 1 shows
the comparison between System A and System B
with respect to number of phrase matches each
system achieved. The last column indicates the
improvement of System B over System A i.e. improvement after adding bio-domain specific features. The results indicate adding domain specific
features increase the phrase matches and thus improving the summary content. Table 2 shows the
number of matched phrases for the re-ranking system over different values of λ. The first column
with λ = 1 is the same as System B in table
1. In the evaluation of the re-ranking system we
have the used th set of ranked sentences returned
by System B only. The results indicate λ value
closer to 0 yields the best results for most of the
genes. For example for the gene bri1 the set of
summary sentences : “BRI1 ligand is brassinolide
which binds at the extracellular domain. Binding results in phosphorylation of the kinase domain which activates the BRI1 protein leading to
BR responses”. is accurately captured by the reranker system (with λ = 0) sentence : “brassinosteroids ( brs ) bind to the extracellular domain of
the receptor kinase bri1 to activate a signal trans-
Table 2: Novelty Based Re-ranking: Summary Phrases Matches
λ = 1 λ = 0.9 λ = 0.7 λ = 0.3 λ = 0 Max Improvement over System B
SMAD2
5
5
5
5
7
20.0%
VPS35
3
3
4
3
2
33.3%
BRI1
2
2
2
3
4
50.0%
BAG3
3
3
3
3
2
0.0%
LTBP2
3
2
3
4
4
33.3%
KAT2A
3
2
2
2
2
0.0%
References
duction cascade that regulates nuclear gene expression and plant development.” A similar example occurs for the gene smad2 with extracted sentence: “activated tbetari phosphorylates smad2 ,
which then heterodimerizes with smad4 , translocates into the nucleus , and subsequently effects
gene transcription .” which perfectly captures the
a set summary sentences(refer fig1).
[Carbonell and Goldstein1998] Jaime Carbonell and
Jade Goldstein. 1998. The use of mmr, diversitybased reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval,
pages 335–336. ACM.
4
[Edmundson1969] H. P. Edmundson. 1969. New methods in automatic extracting. J. ACM, 16(2):264–
285, April.
Conclusion
We combine generic features for computing sentence with certain bio-medical domain specific
features like presence of gene name and biological cue phrases. We also use GO terms and UnitProtKeywords as a “novelty measure” to re-rank
sentences and remove “information redundancy”.
Our evaluation suggests that bio-medical features
and “redundancy removal” augmented system extract much more informative summaries. One of
the problems of these extractive approaches is the
presence of noise in addition to relevant information in the extracted sentences. For example consider a extracted summary sentence for smad2:
“second , the role of smad 2 , an intracellular
mediator of activin and tgf-beta , in oocyte maturation was investigated”. Only the highlighted
fragment is relevant and there is no need to include
the entire sentence. In future, we hope that the biological relation patterns discussed in section 2.1
will helps us to determine only the “relevant” portions of a sentence. These patterns will helps us
create an intermediate representation of the set of
sentences like “smad2 [isA] intracellular mediator OF(activin)”. Instead of just extracting representative sentences from the About Set, these relations will helps us generate phrases and move toward abstractive summarization. We could combine different relations in a single depending on
certain causal links like, INTERACTION aspect
followed by FUNCTION aspect.
[Jones1972] Karen Sparck Jones. 1972. A statistical
interpretation of term specificity and its application
in retrieval. Journal of documentation, 28(1):11–21.
[Kupiec et al.1995] Julian Kupiec, Jan Pedersen, and
Francine Chen. 1995. A trainable document summarizer. In Proceedings of the 18th annual international ACM SIGIR conference on Research and
development in information retrieval, SIGIR ’95,
pages 68–73, New York, NY, USA. ACM.
[Maglott et al.2005] Donna Maglott, Jim Ostell,
Kim D. Pruitt, and Tatiana Tatusova. 2005. Entrez
gene: gene-centered information at ncbi. Nucleic
Acids Research, 33(suppl 1):D54–D58.
[McEntyre and Lipman2001] Johanna McEntyre and
David Lipman. 2001. Pubmed: bridging the information gap. Canadian Medical Association Journal, 164(9):1317–1319.
[Radev et al.2002] Dragomir R. Radev, Eduard Hovy,
and Kathleen McKeown. 2002. Introduction to the
special issue on summarization. Comput. Linguist.,
28(4):399–408, December.
[Salton and Buckley1988] Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches
in automatic text retrieval. Information processing
& management, 24(5):513–523.
[Tudor et al.2010] Catalina O Tudor, Carl J Schmidt,
and K Vijay-Shanker. 2010. egift: Mining gene information from the literature. BMC bioinformatics,
11(1):418.
Gene : SMAD2
Entrez Summary: The protein encoded by this gene belongs to the SMAD, a family of proteins similar to the gene products of the
Drosophila gene 'mothers against decapentaplegic' (Mad) and the C. elegans gene Sma. SMAD proteins are signal transducers and
transcriptional modulators that mediate multiple signaling pathways. This protein mediates the signal of the transforming growth factor
(TGF)-beta, and thus regulates multiple cellular processes, such as cell proliferation, apoptosis, and differentiation. This protein is recruited
to the TGF-beta receptors through its interaction with the SMAD anchor for receptor activation (SARA) protein. In response to TGF-beta
signal, this protein is phosphorylated by the TGF-beta receptors. The phosphorylation induces the dissociation of this protein with SARA
and the association with the family member SMAD4. The association with SMAD4 is important for the translocation of this protein into the
nucleus, where it binds to target promoters and forms a transcription repressor complex with other cofactors. This protein can also be
phosphorylated by activin type 1 receptor kinase, and mediates the signal from the activin.
System A (without Bio-Features)
System B(With Bio-Features)
smad2 overexpression suppressed osteocalcin mrna expression in phosphorylation-dependent activation of the transcription factors
ros17/2.8 cells .
smad2 and smad3 plays an important role in tgfbeta-dependent
signal transduction .
tgfbeta signaling is initiated when the type i receptor
phosphorylates the mad-related protein , smad2 , on c-terminal
serine residues .
we report that smad2 , a transcription factor activated by tgfbeta , mediates tgf-beta induction of enos in endothelial cells .
mad-related genes on chromosome 18q21.1 are altered
infrequently in escc .
identification of smad2 , a human mad-related protein in the
transforming growth factor beta signaling pathway .
activation of transforming growth factor-beta ( tgf-beta )
receptors triggers phosphorylation of smad2 and smad3 .
conclusions : the results suggest that mutation of smad2 does not
play a key role in human stomach carcinogenesis .
cells that lack smad2 may escape from tgf-beta-mediated growth
inhibition and promote cancer progression .
second , the role of smad 2 , an intracellular mediator of activin
and tgf-beta , in oocyte maturation was investigated .
phosphorylation-dependent activation of the transcription factors
smad2 and smad3 plays an important role in tgfbeta-dependent
signal transduction .
thus , heteromeric complex formation of smad2 with smad4 is
required for nuclear translocation of smad4 .
furthermore , we observed a strong correlation between sustained
smad2 phosphorylation and resistance to tgf-beta1-mediated
growth inhibition .
evidence that smad2 is a tumor suppressor implicated in the
control of cellular invasion .
Figure 1: Feature-Based Ranked Summaries for SMAD2 for System A and B
λ = 0.0
1. phosphorylation-dependent activation of the transcription factors smad2 and smad3 plays an important role in tgfbeta-dependent
signal transduction .
2. second , the role of smad 2 , an intracellular mediator of activin and tgf-beta , in oocyte maturation was investigated .
3. smad2 and smad3 are signalling proteins that are involved in mediating the transcriptional regulation of target genes downstream of
transforming growth factor-beta and activin receptors .
4. activated tbetari phosphorylates smad2 , which then heterodimerizes with smad4 , translocates into the nucleus , and
subsequently effects gene transcription .
5. identification of smad2 , a human mad-related protein in the transforming growth factor beta signaling pathway .
6. xmad2 , a recently identified tgf-beta signal transducer , forms a complex with the transcription factor in an activin-dependent fashion
to generate an activated are-binding complex .
7. ligation of the t cell receptor complex results in phosphorylation of smad2 in t lymphocytes .
λ = 0.3
1. phosphorylation-dependent activation of the transcription factors smad2 and smad3 plays an important role in tgfbeta-dependent
signal transduction .
2. second , the role of smad 2 , an intracellular mediator of activin and tgf-beta , in oocyte maturation was investigated .
3. smad2 and smad3 are signalling proteins that are involved in mediating the transcriptional regulation of target genes downstream of
transforming growth factor-beta and activin receptors .
4. identification of smad2 , a human mad-related protein in the transforming growth factor beta signaling pathway .
5. thus , heteromeric complex formation of smad2 with smad4 is required for nuclear translocation of smad4 .
6. ubiquitination of smad2 is a consequence of its accumulation in the nucleus .
7. xmad2 , a recently identified tgf-beta signal transducer , forms a complex with the transcription factor in an activin-dependent fashion
to generate an activated are-binding complex .
λ = 0.7
1. phosphorylation-dependent activation of the transcription factors smad2 and smad3 plays an important role in tgfbeta-dependent
signal transduction .
2. second , the role of smad 2 , an intracellular mediator of activin and tgf-beta , in oocyte maturation was investigated .
3. identification of smad2 , a human mad-related protein in the transforming growth factor beta signaling pathway .
4. thus , heteromeric complex formation of smad2 with smad4 is required for nuclear translocation of smad4 .
5. we report that smad2 , a transcription factor activated by tgf-beta , mediates tgf-beta induction of enos in endothelial cells .
6. conclusions : the results suggest that mutation of smad2 does not play a key role in human stomach carcinogenesis .
7. evidence that smad2 is a tumor suppressor implicated in the control of cellular invasion .
Figure 2: Re-Ranked Summaries for SMAD2 with λ = 0, 0.3, 0.7