Refining JST thesaurus and discussing the effectiveness
in life science research
Tatsuya Kushida1, Takeshi Masuda2, Yuka Tateisi1, Katsutaro Watanabe3, Katsuji
Matsumura1, 3, Takahiro Kawamura3, Kouji Kozaki2, and Toshihisa Takagi1, 4
1
National Bioscience Database Center, Japan Science and Technology Agency, Tokyo
{kushida,tateisi} @biosciencedbc.jp
2
The Institute of Scientific and Industrial Research, Osaka Univ., Japan
{masuda, kozaki}@ei.sanken.osaka-u.ac.jp
3
Department of Information Planning, Japan Science and Technology Agency, Tokyo
{katsutaro.watanabe, matsumur, takahiro.kawamura}@jst.go.jp
4
Dept. Biological Sciences, Gard. School of Science, The University of Tokyo, Japan
[email protected]
Abstract. We refined JST thesaurus, the thesaurus used for indexing scientific
literatures by Japan Science Technology Agency (JST), to create an ontology
for general science-technology domain. Four expert life science biologists subclassified the RT category (skos:related), which is a relational term within life
science categories of the JST thesaurus, into 31 kinds of relations including
sio:SIO_000225 (‘has function’). We created the RDF of the sub-classified relations between terms, stored them in a triple store in order to be able to perform
SPARQL queries, and verified the effectiveness of this approach in life science
research. As results, we were able to discover gene products associated with
thromboembolism from a number of related concepts by using the reasoning
within the JST thesaurus. We conclude that it is possible to use the JST thesaurus as a hub among other thesauri and ontologies such as MeSH and Gene Ontology, because the JST thesaurus contains direct relations between different
levels of biological concepts, as exemplified by the terms blood aggregation
and thromboembolism.
Keywords: Ontology, Thesaurus, Reasoning, RDF, SPARQL, Life science
1
Introduction
Japan Science and Technology Agency (JST) is operating a literature retrieval service called J-GLOBAL [1], where two dictionaries, the JST thesaurus and the Large
Science and Technology Dictionary (LSTD) have been used for indexing scientific
literatures. The JST thesaurus contains about 40,000 terms related to the science and
technology fields. The LSTD contains over 1,100,000 synonyms, spelling variants
and terms related to terms in the JST thesaurus. All of the terms are written in both
English and Japanese.
adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011
At present, the JST thesaurus contains three kinds of relations between terms,
namely “BT (broader)”, “NT (narrower)”, and “RT (related)”. The relations between
the terms in the JST thesaurus (the thesaurus map [2]) are used for information retrieval from scientific and technological literature that have been indexed using the
thesaurus terms, as well as the other related concepts (not the simple keyword search)
in J-GLOBAL.
Furthermore, we are considering the use of machine learning to extend the information in the JST thesaurus [3] and a Resource Description Framework (RDF) representation of the JST thesaurus, which will be available through a SPARQL endpoint
called “J-GLOBAL knowledge” [4].
However, we recognize, in its present form, that the JST thesaurus is insufficient
for performing intelligent exploration. This is because the thesaurus has only three
kinds of simple relations, and therefore, we cannot rigorously and appropriately express diverse and complicated relations between various kinds of terms using only
these three types of terms. For example, in the JST thesaurus we can discover simple
connections among gene products, chemical substances, biological phenomena, and
diseases using BT, NT, and RT relations. On the other hand, we cannot effectively
discover more rigorous connections among them such as relations between diseases
and the preceding biological phenomena, relations between diseases and the related
gene products, relations between disease states and the succeeding more severe ones,
and relations between chemical substances and biological phenomena which they
positively regulate.
Accordingly, we have set out to improve the JST thesaurus, develop an ontology
based on the thesaurus, and utilize it to perform information retrieval by using richer
semantics that the ontology has, and use it as a knowledge base to perform intelligent
exploration.
In this study, as a test case for the improving the JST thesaurus, we focused on the
terms and the relations within the life science, and tried to develop a new ontology on
the base of the JST thesaurus. Here we discuss the problems associated with the development of the ontology and its effectiveness in life science research.
This paper is organized as follows. In Section 2 we describe methods for improving the JST thesaurus and creating the RDF. In Section 3 we present results of the
sub-classification of the 2065 Related Term relationships in the JST thesaurus. In
Section 4 we demonstrate examples of reasoning using the refined JST thesaurus and
we discuss the effectiveness in life science research. In Section 5 we summarize our
conclusions and we suggest directions of future works.
2
Refining the JST thesaurus and creating the RDF
2.1
Sub-classification of RT into 31 kinds of relation by four biologists
The JST thesaurus has 51 kinds of subcategories of life science, and over 10,000
life science terms representing various levels of life science concepts such as gene
products (e.g. CLEC2), chemical substances (e.g. caffeine), biological phenomena
(e.g. platelet aggregation), diseases (e.g. thromboembolism), and anatomy (e.g. carti-
lage). The terms are structured using three kinds of relations such as BT, NT, and RT.
RT is used for describing the relations between the various levels of terms in the JST
thesaurus and LSTD (hereinafter referred to as “JST thesaurus”). For example, the
biological phenomenon “platelet aggregation” and the disease “thromboembolisms”,
and, “platelet aggregation” and the gene product “CLEC2”, are linked using RT.
However, there are almost no other ontologies and thesauri that have direct links between different levels of concepts in life sciences fields. MeSH [5] is a medical controlled vocabulary thesaurus, and SNOMED-CT [6] is a clinical healthcare terminology. Neither of them have a direct relation between platelet aggregation and thromboembolism, or between platelet aggregation and CLEC2. We aim to provide a JST
ontology that contains a number of links between various levels of biological concepts, which can be accessed in a machine-readable manner, that is widely used to
elucidate the mechanism of various biological phenomena, and which has a broad
impact on the development of life science research.
On the other hand, we know that RTs in the present JST thesaurus imply various
kinds of relations such as “has part”, “has attribute”, and “has function.” This is because JST thesaurus is developed for indexing controlled vocabularies for scientific
literature, and the fine-grained relation types for a proper ontology are not required.
However, when we use the JST thesaurus for interpreting experimental data in life
science research, more concrete and rigorous relations between biological concepts
are required. Therefore, as described below, we are attempting to sub-classify RT
within the life science categories to provide more semantic meaning in the relations
between terms.
Fig. 1. Sub-classifying RT using the graphical ontology editor Hozo (in Japanese language)
Using the ontology editor “Hozo” [7], four life science experts (A, B, C, and D) set
out to sub-classify 2065 RTs, representing approximately 1/6 of 12225 RTs in the JST
thesaurus (excluding LSTD) life science categories. Initially, three experts (A, B, and
C) attempted to sub-classify 2065 RTs to 10 different kinds of relations, namely “has
part”, “is part of”, “has function”, “is function of”, “has attribute”, “is attribute of”,
and “antonym” along with BT, NT, and RT. Each of the sub-classified 2065 RTs were
checked by all three experts. The work was performed using the graphical interface in
Hozo. The experts selected the sub-classifying RT with the mouse left click, opened
the popup menu with the mouse right click, and they converted it to an appropriate
relation from the displayed relations list using a mouse left click (Fig. 1).
A RT usually also refers to the inverse RT of each other, for example “X” [RT]
“Y”, and “Y” [RT] “X”. We consider that, and by using Hozo the following operation
can be performed. When a RT is converted to an appropriate relation such as “has
function”, the inverse RT is automatically converted to the inverse relation such as “is
function of.” If there are RTs that are difficult to sub-classify to other relations, we
kept them as RTs (e.g. RT between “proteome” and “genome”). Each result of the
sub-classification performed by the three experts was merged, and saved in csv format
by using Hozo. The summarized result file contains the sub-classification results, and
experts’ comments.
Finally, the remaining life science expert (D) checked the results obtained by the
three other experts. As a result of this process, two issues were recognized. First, the
usage of some relations differed among the three experts. For example, the relation
between blood and blood corpuscle was either “has narrower”, or “has part” (See
Table 2 in Section 3). Second, it is possible and sometimes necessary to sub-classify
RT to more than 10 relations (See Table 1 in Section 3). Subsequently, the last life
science expert created a guideline which contained 31 different kinds of relations with
examples and definitions.
For example, in the guideline a relation “has function” is described as follows: “has
function” is used for the relation between an entity including materials, cellular components, anatomical locations, and organisms, and the related events including molecular functions, biochemical reactions, biological phenomena, and diseases. Usage
examples for this are “Glucagon-hyperglycemic action”, “phycobilisomePhotosynthesis”, “mesencephalon-visual sense”, and “Helicobacter pylori-gastric
ulcer.” The inverse property is “is function of.”
Ultimately, this expert sub-classified 2065 RTs into 31 different kinds of relations
using the guideline. The results were reviewed and revised by the three other experts.
All four of the experts finalized them. We used the web services Ontobee [8], BioPortal [9], Linked Open Vocabularies (LOV) [10] to investigate the existing ontologies
corresponding to the sub-classified RTs. We assigned standard ontology terms to the
31 kinds of sub-classification (See Table 3 in Section 3).
2.2
Storing the RDF data in a triple store
The sub-classification data was converted to the RDF in turtle format using Ontology API of Hozo as follows.
<http://stirdf.jst.go.jp/id/201006070470021247>
<http://semanticscience.org/resource/SIO_000225>
<http://stirdf.jst.go.jp/id/200906034198027470> .
The RDF of the refined JST thesaurus (i.e. that contains about 290,000 terms from
the JST thesaurus and LSTD) was stored in a triple store (Virtuoso) in order to be able
to perform a SPARQL query. The public SPARQL endpoint is being prepared and is
being planned for release.
3
Results of the sub-classification of RT
Table 1. Sub-classification of 2023RTs into 10 kinds of relations by three experts
Sub-classification relation
[RT]
[BT] or [NT]
[has part] or [is part of]
[antonym]
[has attribute] or [is attribute of]
[has function] or [is function of]
Number (%)
1752 (86.2)
170 (8.4)
17(1.0)
9(0.4)
2(0.1)
73(3.6)
Table 1 shows the results of the sub-classification conducted by the three life
science experts. Each of the relations was decided by a majority. Note that 42
relations, which were assigned differently by all three experts, are not included in the
table.
In order to evaluate the concordance rate of decisions made by the there life science expert, we calculated the reproducibility by uisng “Fleiss' Kappa for m Raters”
and “Krippendorff's alpha” in Package “irr” of R [11]. Kappa of Fleiss' Kappa for m
Raters (Subject = 2065, Raters = 3 , z = 45.6 , p-value = 0) was 0.354, and alpha of
Krippendorff's alpha (Subjects = 2065, Raters = 3) was 0.354. The reproducibility
was found to be low.
We did not adopt the results shown in Table 1 as the final decision of the RT subclassification, because the usage of some relations differed among the three experts,
such as between “has narrower”, and “has part”, and between “RT” and “has
function” (Table 2).
Table 2. Examples of relations differed among three life science experts
Terms (concepts)
Expert A
Expert B
Expert C
“cell adhesion receptor”
and “CD18 antigen”
RT
NT
RT
“blood” and
“blood corpuscle”
has part
RT
NT
“chemoreceptor”
and “binding site”
RT
RT
has part
“Perforin” and
“cytotoxicity”
RT
has function
has function
Bold and underlined terms represent relations which we ultimately judge as
appropriate.
Moreover, we confirmed that a number of the decided RTs could be sub-classified
into other relations. Instead, based on this information, we prepared a guideline to
effectually and appropriately sub-classify the RT.
Table 3. Results of the sub-classification of 2065 related terms (RTs) into 31 kinds of relations
follwoing the guideline
Assigned ontology
skos:related (RT)
skos:broader (has broader)
skos:narrower (has narrower)
sio:SIO_000028 (has part)
sio:SIO_000068 (is part of)
sio:SIO_000123 (antonym)
sio:SIO_000008 (has attribute)
sio:SIO_000011 (is attribute of)
sio:SIO_000225 (has function)
sio:SIO_000226 (is function of)
sio:SIO_000122 (synonym)
sio:SIO_000203 (is connected to)
xkos:precedes
xkos:succeeds
sio:SIO_000228 (has role)
sio:SIO_000227 (is role of)
sio:SIO_001279 (has phenotype)
nbdc:isPhenotypeOf
obo:RO_0002234 (has output)
obo:RO_0002353 (product of)
sio:SIO:000001 (is related to)
sio:SIO_000364 (has creator)
sio:SIO_000365 (is creator of)
sio:SIO_000066 (has provider)
sio:SIO_000064 (is provider of)
sio:SIO_000655 (transforms into)
sio:SIO_000657 (is transformed from)
sio:SIO_000061 (is located in)
sio:SIO_000145 (is location of)
sio:SIO_001154 (regulates)
sio:SIO_001155 (is regulated by)
Number (%)
924 (44.5)
187 (9.1)
180 (8.7)
103 (5.0)
102 (4.9)
12 (0.5)
3 (0.1)
3 (0.1)
122 (6.0)
121 (5.9)
20 (1.0)
46 (2.2)
14 (0.7)
14 (0.7)
69 (3.3)
69 (3.3)
27 (1.3)
27 (1.3)
1 (0.0)
2 (0.1)
6 (0.2)
1 (0.0)
1 (0.0)
1 (0.0)
1 (0.0)
5 (0.2)
5 (0.2)
0 (0.0)
0 (0.0)
0 (0.0)
0 (0.0)
skos: <http://www.w3.org/2004/02/skos/core#>
sio: <http://semanticscience.org/resource/>
xkos: < http://rdf-vocabulary.ddialliance.org/xkos#>
nbdc: <http://purl.jp/2/nbdc/ontology/>
obo: <http://purl.obolibrary.org/obo/>
Table 3 shows the results after re-classification following the guideline. From this,
it can be easily seen that 1141 (1141/2065=55.5%) RTs could be successfully subclassified into 31 kinds of relations such as “has function”, while the remaining 924
(924/2065=45.5%) could not be sub-classified, and they remained as RT
(skos:related). We assigned 30 standard ontology terms to each of the sub-classified
relations, and we assigned the original term to the remaining one relation type
(nbdc:isPhenotypeOf). This means that we have also succeeded to add more appropriate and detailed semantics to the sub-classified 1141 relations. For example, we have
given 91 kinds of functions such as “apoptosis” to 121 terms such as “BAX protein”,
and have given 41 kinds of roles such as “anti-Parkinson drug” to 68 terms such as
“ergot alkaloid.”
Examples of the RT sub-classification are shown below,
Before sub-classification
Example 1
ABC transporter
[RT] Biological Transport
[RT] P glycoprotein
Example 2
spliceosome
[RT] splicing factor
splicing factor
[RT] RNA splicing
→
After sub-classification
→
→
ABC transporter
[has function] Biological Transport
[narrower] P glycoprotein
→
→
spliceosome
[has part] splicing factor
splicing factor
[has function] RNA splicing
4
Discussion
4.1
Reasoning using “is a”, and “has part” relations
Fig. 2. Examples of reasoning using skos:narrower, and sio:SIO_000028 (has part).
In this section, we demonstrate the examples of reasoning using the refined JST
thesaurus and the effectiveness in life science research which becomes possible as a
result of the sub-classification of RT. Fig. 2 shows two examples of reasoning using
skos:narrower, and sio:SIO_000028 (has part). Example 1 shows that a narrower term
inherits the information of a function possessed by the broader term. The JST thesaurus term “P glycoprotein” inherits the same function “Biological Transport” that the
upper term “ABC Transporter” has. In this study, we can infer that 916 terms such as
“Alcadein” imply 74 kinds of functions such as “Alzheimer Disease” by using this
‘inheritance’ approach. In addition, by using the same approach we can infer 576
terms such as “Nicotinic Agonists” imply 36 kinds of roles such as “anti-Alzheimer’s
disease drug”.
Example 2 shows that reasoning the information of a function of term “A” from a
function that term “B”, which is a part of term “A”, has. A JST thesaurus term
“spliceosome” has the same function “RNA splicing” that the “splicing factor” that
consists of “spliceosome” has. We can infer that five terms including the “spliceosome” imply four kinds of functions by using the whole-part structure. For example, it
is inferred that “NK cell” and “cytotoxic T cell” imply a function “cytotoxicity”,
“Pharynx” implies a function “vocalization”, and “pharmaceutical preparation” implies a function “pharmacological action.”
4.2
Discovery of gene products associated with disease and other biological
phenomena
Fig. 3. Relations among terms related to thromboembolism in the refined JST thesaurus
Fig. 3 shows a graph of the relations among terms related to thromboembolism in
the refined JST thesaurus. The SPARQL query for finding out gene products that
‘have
a
function
of’
(sio:SIO_000225)
thromboembolism
(jstid:200906035454027960), or for finding out gene products that ‘have a function’
of any biological phenomena that precede (xkos:precedes) thromboembolism is described as follows.
PREFIX jstid: <http://stirdf.jst.go.jp/id/>
PREFIX xkos: <http://rdfvocabulary.ddialliance.org/xkos#>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX ndl: <http://ndl.go.jp/dcndl/terms/>
PREFIX xl: <http://www.w3.org/2008/05/skos-xl#>
SELECT distinct ?preevent ?enlabel_preevent ?gene ?enlabel_gene
WHERE
{
# Gene Products linking to Thromboembolism by one step
{?gene sio:SIO_000225 jstid:200906035454027960 .
OPTIONAL
{?gene xl:prefLabel ?label.
?label ndl:transcription ?label_gene . }
}
UNION
# Gene Products linking to Thromboembolism by two steps
{?gene sio:SIO_000225 ?pre_event .
?pre_event xkos:precedes jstid:200906035454027960 .
OPTIONAL
{?gene xl:prefLabel ?label_gene.
?label_gene ndl:transcription ?enlabel_gene .
?preevent xl:prefLabel ?label_preevent.
?label_preevent ndl:transcription ?enlabel_preevent . }
}
}
When this SPARQL query was performed, we were able to discover CLEC2
(CLEC1B (uniprot:Q9P126)) [12] as the unique result. We reasoned that this was
because platelet aggregation is a biological phenomenon that precedes thromboembolism, and CLEC2 is a gene product functions in platelet aggregation. We hypothesize
that the information concerning gene products that function in biological pre-events of
a disease provide valuable knowledge about the prevention of the disease itself. When
the same SPARQL query was performed for the unrefined JST thesaurus, (i.e. the
graph consists of skos:related relations among platelet aggregation, thromboembolism, and CLEC2), we discovered 21 concepts including those distantly related to
thromboembolism such as PRKCH gene. The example demonstrates that it is expected to decrease the false positive rate of the SPARQL query performed by using
sub-classified relations such as sio:SIO_000225 (has function), and xkos:precedes.
4.3
Utilizing the information of biological processes and gene products in
Gene Ontology
We recognize that it is not sufficient to utilize JST thesaurus solely for other purposes than indexing controlled vocabularies to scientific literatures. It is necessary to
increase the quantity of information in the JST thesaurus in order to interpret experimental data in life science research. We think it would be effective to integrate the
JST thesaurus with other ontologies, and thesauri. For example, Gene Ontology [13]
contains a class “platelet aggregation” (obo:GO_0070527) that is a biological process.
The GO term “platelet aggregation” contains the information about related gene products, but lacks any diseases related information.
On the other hand, the JST thesaurus contains various levels of concepts and the
relations between them such as a biological process “platelet aggregation”, and a disease “thromboembolism.” We incorporated the information about gene products that
act platelet aggregation, based on Gene Ontology, into the RDF data of the refined
JST thesaurus, and discovered 60 human gene products (e.g. RAP2B (uniprot:P61225)) related to thromboembolism using the same SPARQL query used previously for the refined JST thesaurus.
Furthermore, we confirmed that it is possible to integrate the information from
several biological processes (e.g. fibrinolysis (obo:GO_0042730)) and the associated
gene products in the Gene Ontology with that of the corresponding biological processes (e.g. fibrinolysis (jstid:200906057747871335)) that precede diseases (e.g. fibrinolytic purpura (jstid:200906056051568500)) in the JST thesaurus, and discover
gene products (e.g. 27 human gene products associated with “fibrinolytic purpura”
including PLAT (uniprot:P00750)) by using a SPARQL query..
4.4
Related work: Comparing JST thesaurus, Gene Ontology, and
DisGeNET
The database “DisGeNET” [14] contains gene-disease associations that are collected by expert human curation and text mining from several public data sources and
the literature. We found out 114 genes products (e.g. F5 (uniprot:P12259)) that are
associated with thromboembolism using the SPARQL endpoint [15].
Fig. 4 shows the number of genes products associated with thromboembolism discovered using JST thesaurus, Gene Ontology, and DisGeNET respectively. Eight
genes (e.g. GAS6 (uniprot:Q14393)) were found to be overlapping between Gene
Ontology and DisGeNET, however there were no overlapping genes between JST
thesaurus and DisGeNET, or between JST thesaurus and Gene Ontology. From this,
we conclude that it is more appropriate to mash-up these resources to discover the
gene products associated with thromboembolism than to use each of them separately.
Fig. 4. The number of genes/gene products associated with thromboembolism discovered using
the refined JST thesaurus, Gene Ontology1, and DisGeNET
On the other hands, as previously mentioned, the number of the gene products associated with “fibrinolytic purpura” (jstid:200906056051568500) discovered by using
the information of Gene Ontology integrated with that of JST thesaurus is 27 human
gene products including PLAT (uniprot:P00750). In contrast, we could not discover
the genes related to “Fibrinolytic purpura” (umls:C0311369) [16] in DisGeNET.
Besides, DisGeNET does not have the information of biological process-disease
associations; therefore, using a biological process such as platelet aggregation, and
fibrinolysis, we cannot find out the related diseases. On the other hands, the refined
JST thesaurus integrated with Gene Ontology contains the information of biological
process-disease associations (e.g. “platelet aggregation [precedes] thromboembolism”, and “fibrinolysis [precedes] fibrinolytic purpura”), biological processbiological process associations (e.g. “platelet activation (jstid:200906081976560414)
[precedes]
platelet
aggregation”,
and
“complement
activation
(jstid:200906074237853199)
[precedes]
immune
adherence
(jstid:200906025733875940)”),
and
disease-disease
associations
(e.g.
“thromboembolism [related] cerebral infarction (jstid:200906091496103906)”, and
“interstitial pneumonia (jstid:200906019954244040) [precedes] pulmonary fibrosis
(jstid:200906004756473001)”). Therefore, for example, using the information of a
biological process (e.g. platelet activation), and directional relations (e.g. precedes),
we can discover the related biological processes (e.g. platelet aggregation), and diseases (e.g. thromboembolism and cerebral infarction) (Fig. 3).
1
The gene products in Gene Ontology were obtained using the SPARQL endpoint that contains Gene Ontology integrated with that of JST thesaurus.
5
Conclusions
In this study, four expert life scientists sub-classified 2065 RTs in the life science
category within the JST thesaurus into 31 kinds of relations and assigned 30 standard
ontology terms. We establish an efficient workflow of the sub-classification: an expert preprocesses the sub-classification, the other several experts review, and revise it,
and all of experts finalize it along with the guideline.
We can infer that 916 terms imply 74 different types of functions, and 576 terms
imply 36 different types of roles by using “is a” relations. In addition, we can infer
that five terms imply four different types of functions by using “has part” relations.
The sub-classification relations are not life science domain specific, and are widely
applicable to science and technology terms. We also consider reasoning using “is a”,
and “has part” relations could be applied to wieldy to science and technology terms.
Fig. 5. The relations among the JST thesaurus, Gene Ontology, and MeSH terms related to
thromboembolism, and platelet aggregation
We created the RDF of the refined JST thesaurus, and stored them in a triple store,
and prepared to mash-up the RDF with other ontologies, thesauri, and data sets such
as MeSH, and Gene Ontology, and to use the JST thesaurus as a hub among them, in
order to perform the SPARQL query for the JST thesaurus (Fig. 5). In particular, we
expect to be able to apply the highly reliable associations information relating to biological processes and gene products in Gene Ontology to study the associations
among gene products, biological processes, and diseases using the JST thesaurus.
This is one of the major outcomes of this study.
We are planning to release the public SPARQL endpoint for the refined JST thesaurus, and moreover, to make the RDF downloadable with Creative Commons License. Utilizing the endpoint and/or the downloaded RDF, the users will be able to
perform SPARQL queries for any semantic data including the refined JST thesaurus,
Gene Ontology, DisGeNET, and MeSH. As a result, using the refined JST thesaurus
as a hub, it is expected to discover connections such as relations between diseases
(e.g. thromboembolism) and the preceding biological phenomena (e.g. platelet aggregation), relations between diseases (e.g. fibrinolytic purpura) and the related gene
products (e.g. PLAT), and relations between disease states (e.g. interstitial pneumonia) and the succeeding more severe ones (e.g. pulmonary fibrosis).
In the future we intend to sub-classify the remaining RTs that have not been subclassified yet, integrate the JST thesaurus with other data sources, ontologies, and
thesauri such as MeSH, DBpedia [17], ChEBI [18], and, NikkajiRDF [19] that is a
chemical substance dictionary. In addition, we are trying to elucidate the associations
between genes, chemical substances, and various biological phenomena including
diseases.
6
Acknowledgements
A part of this study was presented at BioHackathon 2016
(http://2016.biohackathon.org/) which is a research and development
meeting for database integration. We are grateful to all of participants who gave us
their valuable and constructive comments.
7
References
1. J-GLOBAL, http://jglobal.jst.go.jp/en/
2. JST thesaurus map, http://thesaurus-map.jst.go.jp/
3. Kawamura T., Kozaki K., Kushida T., Kimura T., Watanabe K., Matsumura K.:
Preliminary Report of Fertilizing Science and Technology Thesaurus from LargeScale Bibliographic Datasets using Word Embedding (in Japanese). SIG-SWO038,1-6 (2015)
4. J-GLOBAL knowledge, https://stirdf.jglobal.jst.go.jp/
5. Bodenreider O, Nelson S.J., Hole W.T., Chang H.F.: Beyond synonymy: exploiting the UMLS semantics in mapping vocabularies. Proc AMIA Symp. 1998, 815819 (1998)
6. SNOMED CT, http://www.ihtsdo.org/snomed-ct
7. Hozo: Ontology Editor, http://www.hozo.jp/
8. Ontobee, http://www.ontobee.org/
9. BioPortal, http://bioportal.bioontology.org/
10. Linked Open Vocabularies, http://lov.okfn.org/dataset/lov/
11. R
package
“irr”,
https://cran.rproject.org/web/packages/irr/irr.pdf
12. UniProt Knowledgebase, http://purl.uniprot.org/uniprot/
13. Gene Ontology Consortium: Creating the gene ontology resource: design and implementation. Genome Res. 11(8), 1425-1433 (2001).
14. Piñero J., Queralt-Rosinach N., Bravo À., Deu-Pons J., Bauer-Mehren A., Baron
M., Sanz F., Furlong L.I.: DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database 2015, 1-17 (2015).
15. DisGeNET SPARQL endpoint, http://rdf.disgenet.org/sparql/
16. Unified
Medical
Language
System
(UMLS),
http://linkedlifedata.com/resource/umls/id/
17. Bizera C., Lehmannb J., Kobilarova G., Auerb S., Beckera C., Cyganiakc R.,
Hellmannb S.: DBpedia - A crystallization point for the Web of Data. The Web of
Data Sept; 7 (3), 154–165 (2009)
18. Hastings J., de Matos P., Dekker A., Ennis M., Harsha B., Kale N., Muthukrishnan
V., Owen G., Turner S., Williams M., Steinbeck C.: The ChEBI reference database
and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic
Acids Res. 41(Database issue), D456-463 (2013).
19. NBDC NikkajiRDF, http://doi.org/10.18908/lsdba.nbdc0153002-000
© Copyright 2026 Paperzz