DaTo: an atlas of biological databases and tools

Journal of Integrative Bioinformatics, 13(2):297, 2016
http://journal.imbio.de/
DaTo: an atlas of biological databases and tools
1
2
Department of Bioinformatics, College of Life Sciences,
Zhejiang University, 310058 Hangzhou, China
Computational Life Sciences, Department of Computer and Information Science,
University of Konstanz, 78457 Konstanz, Germany
3
Faculty of Information Technology,
Monash University, 3800 Melbourne, Australia
4
Department of Bioinformatics and Medical Informatics, Faculty of Technology,
Bielefeld University, 33615 Bielefeld, Germany
Summary
This work presents DaTo, a semi-automatically generated world atlas of biological
databases and tools. It extracts raw information from all PubMed articles which contain
exact URLs in their abstract section, followed by a manual curation of the abstract and the
URL accessibility. DaTo features a user-friendly query interface, providing extensible
URL-related annotations, such as the status, the location and the country of the URL. A
graphical interaction network browser has also been integrated into the DaTo web interface
to facilitate exploration of the relationship between different tools and databases with
respect to their ontology-based semantic similarity. Using DaTo, the geographical
locations, the health statuses, as well as the journal associations were evaluated with respect
to the historical development of bioinformatics tools and databases over the last 20 years.
We hope it will inspire the biological community to gain a systematic insight into
bioinformatics resources. DaTo is accessible via http://bis.zju.edu.cn/DaTo/.
1
Introduction
In the past decades, a variety of publicly available data repositories and resources have been
developed for many medical and biology-related applications. Nowadays, many online
databases and data analysis tools have been developed for life science research. Although some
of these approaches have been collected in special journal issues, such as the Nucleic Acids
Research (NAR) Database Issues [1] or Webserver Issues [2], others remain scattered
throughout the Internet and within a vast amount of literature. Web searches via Google®,
Yahoo® and similar general search systems do not exclusively retrieve online resources.
Therefore, the extraction of useful information is quite difficult. The huge amount of resources
and the lack of a complete list of these resources make it difficult for researchers to quickly find
the appropriate tools or databases for their specific purpose [3].
To tackle these problems, we developed DaTo (Databases and Tools), a semi-automatic
approach to collect, curate and navigate through a large list of different tools and databases.
*
To whom correspondence should be addressed. Email: [email protected]
doi:10.2390/biecoll-jib-2016-297
1
Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Qilin Li1, Yincong Zhou1, Yingmin Jiao1, Zhao Zhang1, Lin Bai1, Li Tong1, Xiong Yang1,
Björn Sommer2,3, Ralf Hofestädt4, Ming Chen1,*
http://journal.imbio.de/
Moreover, DaTo adds a complete new dimension to the analysis and visualization of important
tools and databases. By using a Google MapsTM-based approach, it is possible to localize online
resources to specific countries, states and cities. Therefore, it is possible to find out which
research-related tools and databases are geographically close to the home university, plus which
online resources are well-developed at a certain university and which ones might have to be
extended in the future. Also, it is possible to located cooperation partners in the neighborhood,
providing services required for local research. Depending on the research area, it might be
important to cooperate with close-by institutions.
Moreover, we have analyzed the health status of the relevant web links, as well as the impact
of the respective publications’ countries, journals and years.
2
Related Work
To address the previously discussed issues, many groups have collected life-science-related
online resources, providing several bioinformatics-related resources with a search function that
are available via Internet.
For instance, the Neuroscience Information Framework (NIF) is a dynamic inventory of webbased neuroscience resources: data, materials, and tools accessible via any computer connected
to the Internet [4]. However, NIF does not exclusively return biomedical tool or database
resources, but also some literatures describing the application of these resources. This mixed
return page confuses researches to find the appropriate ones.
Other examples include the Bioinformatics Links Directory (BLD) [5], OReFiL [6] (using
URLs as a ‘proxy’ to extract items from BioMed Central papers), BIRI [7], JIBtools [8] and
bio.tools [9].
BLD is a catalogue of links to bioinformatics resources, tools and databases based on
recommendations from experts in the field. BIRI utilizes keywords and sentence structures to
identify relevant terms through custom patterns. BLD and BIRI subdivide the resources into
subclasses based on the research topic, making it impossible to return resources that correspond
to a specific search word. For example, a search for the word ‘miRNA’ in BLD as well as BIRI
returns no results. In addition, BIRI has stopped updating.
OReFiL is the only website collection which returns up-to-date and query-relevant online
resources based on peer-reviewed publications. But OReFiL does not focus on the collection of
tools, it includes all online resources relevant for a specific key word. In addition, some of the
returned resources either lack an accurate title, description, or contain unrelated items. Another
problem is that OReFiL cannot return all the search-word-related resources. Taking ‘miRNA’
as an example and restricting the search results to 500, many miRNA-unrelated items are
returned, some of which even link to PNG images without any relation to miRNA.
JIBtools is a collection of tools lists curated by a specific editor who is responsible for a specific
expert field. This approach highly depends on the motivation of single persons to provide and
update an field-adequate list of tools.
Bio.tools also provides a manually curated list of tools. Any researcher is allowed to register to
the system and to add additional entries. However, no automatization of this process is included,
and the registration process is quite complex, as many different key terms have to be registered
to the system.
Other large biological data repositories for researchers which have to be mentioned are Re3data
(http://re3data.org, last updated April 13 2016)[10] is a comprehensive registry of biological
data repositories available on the web, which lists over 1,500 research data repositories. It also
doi:10.2390/biecoll-jib-2016-297
2
Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Journal of Integrative Bioinformatics, 13(2):297, 2016
Journal of Integrative Bioinformatics, 13(2):297, 2016
http://journal.imbio.de/
Moreover, none of these tools provide the opportunity to visualize the geographical locations
of the database- or tool-publishing universities. But a large number of Google MapsTM -related
approaches in other research fields exist nowadays, which are making use of the underlying
Google MapsTM Javascript API to visualize specific scientific aspects in relation to their
location. To give only one recent example: it was used to provide geolocational information
concerning the health status of the Great Barrier Reef [12].
3
Architecture/Implementation
3.1
PubMed data and DaTo
PubMed currently includes citations and abstracts from over 5,000 life science journals for
biomedical articles reaching back to September 1, 1994. Since its inception, PubMed has served
as the primary tool for electronically searching and retrieving biomedical literature. However,
like NIF, PubMed also contains a large amount of literature which is not related to databases or
tools and its search results are too broad for users who want to find online resources.
Recognizing the limitations of the previously-discussed related approaches, we developed
DaTo (Databases and Tools), which is comprised of 21,159 bioinformatics resources, and
which is currently the most comprehensive database providing high actuality.
DaTo extracts the raw information from all the PubMed articles which contain exact URLs in
their abstract section, followed by manually checking the URL accessibility. DaTo features a
user-friendly query interface, providing comprehensive annotations for each result, such as the
description of the resources, the abstract of the original literature, the link to the corresponding
PubMed entries and corresponding webpage. A graphical interaction network browser has also
been integrated into the DaTo web interface, enabling the exploration of the relationship
between the tools and databases based on the similarity of MeSH term, a comprehensive
controlled vocabulary for the purpose of indexing journal articles and books in the life sciences
and serving as a thesaurus that facilitates searching.
3.2
Data collection and management
DaTo aims to be the most comprehensive repository of databases and tools extracted from
PubMed. Any PubMed article which provides a URL with “http” or “ftp” pattern in its title
and/or abstract section is downloaded and its name and URL is computationally extracted. The
workflow is shown in Figure 1. Specifically, the detail information such as the name, URL etc.
are checked and corrected manually. A care study of all the records’ simple description mined
is performed in order to save time and reach more reliable conclusions by choosing the most
appropriate analysis tools. Therefore, the database is capable to provide not only abundant but
also accurate records of databases and tools. DaTo is updated automatically from PubMed every
two weeks, and then the raw information is manually curated.
The search result page is organized in six columns: “Local ID”, “PubMed ID”,
“Database/Software Name”, “Description”, “URL Health” and “URL Country”. The advanced
doi:10.2390/biecoll-jib-2016-297
3
Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
supports browsing by subject, content type and country, and offers an API for researchers.
However, the website is deficient in stability, making it easy to freeze when users try to search
for biological resources. Biosharing (https://biosharing.org, last updated Dec 7 2016) [11] is
another resource on inter-related data standards, databases, and policies. It consists of 671 data
standards, 831 databases and 85 policies, and displays them with different tags. On the other
hand, biosharing only showed these data in static pages, which makes user hard to analyze the
trend of biological data sources because of a lack of temporally and spatially dynamic module.
http://journal.imbio.de/
search option incorporates additional options, such as the abstract, publication date, journal etc.
Users can also specify different search connotations with logic terms – such as “and”, “or”, and
“not” – to find suitable biology-related software. To trace the URL health, we developed a Perl
script to check if the website is still available. Because each request of the website will be
returned with a response code, we can detect the status of the website. For example, the code
started with 2 means the request is successful, with 3 means the site has been redirected to other
location, and with 4 or 5 means some error happened at client or server side, respectively.
Another strategy for tracing the URL health is that the URL health information will be refreshed
automatically if the user submits a search query.
Figure 1: DaTo workflow.
3.3
Geographic location parsing and visualization
In order to facilitate efficient investigation of the international geographic distribution of the
hosts and to provide the user a straightforward and accurate perception of the geographic
location, we have traced all the first author’s affiliations and website IP addresses to reveal their
location: region code, city, latitude, longitude, ISP, and organization as well as country code
and country name. DaTo adopts MaxMind® GeoIP for the geographic location of the IP
addresses and Google MapsTM Javascript API to display the atlas.
4
Application
4.1
Geographic location
A map of all the geographic locations of the hosts is shown in Figure 2. Secondly, the
distribution over time was computed. This can help researchers to analyze the historical
development of tools and databases from a geographical perspective. The first biological host
appeared in Italy on September 1, 1994 [13]. The amounts of biological hosts increased to over
800 during the following years up to 2000, and the hosts spread over North America, Europe
and East Asia. In the new century, biological databases and tools underwent a rapid
development, especially in emerging countries such as China, Russia, Brazil, India and South
Africa. Currently, hosts of biological resources exist in six continents, and most of them are
located in United States and Europe Union.
doi:10.2390/biecoll-jib-2016-297
4
Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Journal of Integrative Bioinformatics, 13(2):297, 2016
http://journal.imbio.de/
Figure 2: Geographic locations of biological databases and tools (over the years 1994-2013).
Figure 3: San Francisco Bay Area geographic locations of biological databases and tools:
a) San Francisco Bay plus neighborhood and Los Angeles plus neighborhood;
b) San Francisco neighborhood only; c) San Francisco Bay Area.
Figure 3 shows as an example the San Francisco Bay area at different granularity levels. Figure
3a shows the amount of databases and tools in San Francisco and Los Angeles and the
corresponding neighborhoods. The detail level depends on the zoom level, just as known from
Google maps. Figure 3b shows now in more detail San Francisco and its neighborhood. The
205 entries are distributed over the San Francisco Bay Area.
doi:10.2390/biecoll-jib-2016-297
5
Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Journal of Integrative Bioinformatics, 13(2):297, 2016
Journal of Integrative Bioinformatics, 13(2):297, 2016
http://journal.imbio.de/
One of the geo locations in the top-left side of Figure 3b is showing a green “G”, indicating that
the website has a good health status. Finally, Figure 3c shows the San Francisco Bay Area
which contains a number of red “B”, indicating that the corresponding web resources are
offline. The 205 databases and tools from Figure 3b are distributed over San Francisco, South
San Francisco, and Berkeley/neighborhood. The data was collected over the years 1994 to 2013.
Country, Journal and Year Associations
Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
4.2
Figure 3: DaTo status statistics of countries (A/top), journals (B/center) and years (C/bottom).
doi:10.2390/biecoll-jib-2016-297
6
http://journal.imbio.de/
The status statistics show the top ten countries, the top ten journals and the top ten years (Figure
3A-3C). Figure 3A/top shows that the USA has the highest number of resources based on the
publication number. We also ranked journals based on the publication number. The top three
magazines are Oxford Bioinformatics, Nucleic Acids Research, and BMC Bioinformatics.
These three journals account for more than the half of all database/tools-related publications.
Other journals do not provide a significant number of publications in this area in comparison to
the top three journals. Obviously, the highest impact for Bioinformatics database/tool
developers has N.A.R. and Bioinformatics (Figure 3B/center). Focusing to the recent years
(starting from 2007), the publication of new tools peeked in 2013 at 10%, whereas 2014 and
2015 showed only 8%. The 31% from the other years s go back to the year 1994 and in each of
the here accumulated years max. 5% of the overall relevant articles were published.
5
Discussion
In the bioinformatics resource fields, there are few databases. As a manually-curated database
focusing on not only collection but also system analysis for the bioinformatics resources, there
is much value for both experimental biologists and computational biologists.
As future developments, we will incorporate a large set of resources and improve the accuracy
of the detail information. While many publications provide the URL pattern in the abstract,
there are also some exceptions. We should continue to effectively mine this kind of big data
sets and it will take a long time to select them manually to make sure the records feature a high
accuracy.
In order to understand the maintenance of biomedical resources better, the systematic study of
the status of the URL links in biological systems is necessary to reveal the qualitative and
quantitative health status of these biological resources. Since the maintenance of different
systems, such as the update process or even the redesign might temporarily change the
availability of a website, DaTo must frequently check the resources availability. If we cannot
do this, the related information of the database will disappear from DaTo. The statistical results
of DaTo illustrate a really bad situation that a considerable amount of resource included has
been unavailable, and it is a common problem regardless of the corresponding countries or
journals. We have to pay more attention on this and establish some mechanism to monitor such
tasks.
As studied in DaTo, some developing countries show a fast progress within the last few years
in the biology-related resources area. But in comparison to the USA, most of these countries
have a long way to go and they have to invest more resources into bioinformatics research.
Some countries started relatively late their research in the bioinformatics area, but most of them
show a fast progress within the recent years.
The underlying algorithm to identify the geographical location requires that the URL is clearly
stated in the abstract. In the future, it might be possible to mine complete articles, however, this
will be a very complex task. Also, the health status is only checked automatically. In case, an
URL has changed over years, because, for example, a tool has changed its hosting institution,
DaTo will not find it. Again, manual curation is required in this case. Also, DaTo does not take
versioning into account. If for example a tool is published in two different, separately published
versions, DaTo will interpret them as two different tools. Moreover, a project might be
developed at different universities and hosted at different universities – also in this case DaTo
does not provide an optimal approach yet.
However, as a basic claim for all publications in the scope of DaTo, we strongly encourage
authors to include the URLs to their published services or tools in the corresponding abstract.
As previously mentioned.
doi:10.2390/biecoll-jib-2016-297
7
Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Journal of Integrative Bioinformatics, 13(2):297, 2016
Journal of Integrative Bioinformatics, 13(2):297, 2016
http://journal.imbio.de/
The rapid growth of available bioinformatics databases and tools requires new approaches to
organize and mine them. DaTo is a comprehensive atlas of biological databases and tools. It is
based on the biological resources collection from PubMed and facilitates the investigation of
the semantic relationship between different databases/tools based on the involved MeSH terms.
DaTo is a well-organized database with high accuracy. It provides comprehensive analysis tools
including geographical location, URL health status and semantic similarity visualization. We
hope it will inspire the biological community to systematically gain an insight into
bioinformatics resources.
Comparing DaTo with biological data repositories, such as Re3data and Biosharing, our DaTo
includes over 17,000 records of bioinformatics tools and databases based on first author’s
affiliations. In addition, DaTo supports a dynamic display with google maps, helping users to
explore the trend of bioinformatics since 1993. Currently, we are developing a new version of
DaTo called DaTo2, which will display biological data resources in a more concise format.
We believe that DaTo will be an important contribution enabling the systematic mining of
bioinformatics resources for users as well as developers. DaTo is accessible via
http://bis.zju.edu.cn/DaTo/.
Acknowledgements
This work was partially supported by the National Natural Science Foundation of China [No.
30971743, 31371328, 31450110068, 31571366], and CSC & DAAD (PPP program No.
57136444).
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
X.M. Fernández-Suárez, D.J. Rigden, M.Y. Galperin, The 2014 nucleic acids research
database issue and an updated NAR online molecular biology database collection. Nucleic
acids research, 42(D1):D1-D6, 2014.
S. Hamperl, G. Benson, C. Brown, et al., Editorial: Nucleic Acids Research annual Web
Server Issue in 2014. Nucleic acids research, 42(1):W1-W2, 2014.
Y.-B. Chen, A. Chattopadhyay, P. Bergen, C. Gadd, N. Tannery, The Online
Bioinformatics Resources Collection at the University of Pittsburgh Health Sciences
Library System—a one-stop gateway to online bioinformatics databases and software
tools. Nucleic acids research, 35(suppl 1):D780-D785, 2007.
D. Gardner, H. Akil, G.A. Ascoli, et al., The neuroscience information framework: a data
and knowledge environment for neuroscience. Neuroinformatics, 6(3):149-160, 2008.
M.D. Brazas, D. Yim, W. Yeung, B.F. Ouellette, A decade of web server updates at the
bioinformatics links directory: 2003–2012. Nucleic acids research, gks632, 2012.
Y. Yamamoto, T. Takagi, OReFiL: an online resource finder for life sciences. BMC
bioinformatics, 8(1):1, 2007.
G. de la Calle, M. García-Remesal, S. Chiesa, D. de la Iglesia, V. Maojo, BIRI: a new
approach for automatically discovering and indexing available public bioinformatics
resources from the literature. BMC bioinformatics, 10(1):1, 2009.
doi:10.2390/biecoll-jib-2016-297
8
Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Nowadays, biological sciences are generating more data than ever. Therefore, it is required to
organize, catalogue and rate these resources, so that the contained information can be most
effectively explored [14]. So the next steps in the development of DaTo will be the integration
of a wiki-like concept and categories similar to the ones used by MetaBase [15]. Moreover, the
visualization of the databases/tools interconnections and evolutions has to be improved.
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
http://journal.imbio.de/
R. Hofestädt, B. Kormeier, M. Lange, et al., JIBtools: a Strategy to Reduce the
Bioinformatics Analysis Gap. Journal of Integrative Bioinformatics, 10(1):226, 2013.
J. Ison, K. Rapacki, H. Ménager, et al., Tools and data services registry: a community
effort to document bioinformatics resources. Nucleic acids research, gkv1116, 2015.
H. Pampel, P. Vierkant, F. Scholze, et al., Making research data repositories visible: the
re3data. org registry. PloS one, 8(11):e78080, 2013.
P. McQuilton, A. Gonzalez-Beltran, P. Rocca-Serra, et al., BioSharing: curated and
crowd-sourced metadata standards, databases and data policies in the life sciences.
Database, 2016(baw075, 2016.
H. Nim, T. Done, F. Schreiber, S. Boyd, Interactive geolocational and coral compositional
visualisation of Great Barrier Reef heat stress data, in: Big Data Visual Analytics
(BDVA), 2015, IEEE, 2015, pp. 1-7.
S. Pongor, Z. Hátsági, K. Degtyarenko, et al., The SBASE protein domain library, release
3.0: a collection of annotated protein sequence segments. Nucleic acids research,
22(17):3610, 1994.
J. Huang, B. Ru, P. Zhu, et al., MimoDB 2.0: a mimotope database and beyond. Nucleic
acids research, gkr922, 2011.
D.M. Bolser, P.-Y. Chibon, N. Palopoli, et al., MetaBase—the wiki-database of
biological databases. Nucleic acids research, 40(D1):D1250-D1254, 2012.
doi:10.2390/biecoll-jib-2016-297
9
Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Journal of Integrative Bioinformatics, 13(2):297, 2016