We Are WorldQuant

05.24.17
WorldQuant
Perspectives
An
Algorithmic
Approach
to Hacking
Cancer
Advances in bioinformatics and
new algorithms are giving researchers
the tools to more effectively combine
data from multiple sources and better
tackle disease.
WorldQuant, LLC
1700 East Putnam Ave.
Third Floor
Old Greenwich, CT 06870
www.weareworldquant.com
WorldQuant
Perspectives
An Algorithmic Approach to Hacking Cancer
05.24.17
LAST YEAR AN INTERNATIONAL TEAM OF RESEARCHERS FROM
Canada, Germany, Russia and the U.S. published a paper about
macrophages — large white blood cells whose job is to search
out, engulf and destroy pathogens in the body. Macrophages, which
toggle among three different states depending on the concentration
of certain substances in the body, can also play a role in tumor
growth when one of those modes is activated. As part of their
research, the scientists looked at the Krebs cycle — a metabolic
process first identified and described in 1937 — and studied how
macrophages switch among states. Notably, the team used a
specially designed computer algorithm to combine RNA sequencing
with metabolic profiling data. They discovered that the chemical
compound known as itaconate plays a crucial role in both the Krebs
cycle and macrophage mode switching. One of the hypotheses for
future researchers to test is whether itaconate can be used to force
macrophages to switch among modes — essentially, employing a
computer algorithm to try to “hack” the human immune system to
help fight cancer.
biomolecules. Algorithms, which are capable of processing large
amounts of varied types of data, become a crucial component in
understanding that information.
The macrophage research was made possible by advances in
bioinformatics, the science of using computers and large
computational clusters to analyze biological and biochemical data.
Similar to hacking a computer, where programmers decode machineexecutable instructions to understand what is happening inside a
program, scientists studying living cells need to understand all the
internal processes at work. Cells consist of dozens of components
and contain thousands of proteins and metabolites. Deciphering their
interactions is much more complicated than understanding how a
smartphone or another complex modern device works. Biological
information is stored on multiple levels: deoxyribonucleic acid
(DNA), ribonucleic acid (RNA), proteins and many variations of those
Qualitative Science
“
The computer analogy
only goes so far,
because understanding
all the intracellular
processes requires more
than simply identifying
the genetic code.
Copyright © 2017 WorldQuant, LLC
”
DNA, which stores genetic information, is similar to a computer hard
drive. RNA is analogous to a computer’s working memory, where
only the fragments currently needed are loaded (in cells this operation
is called transcription). Proteins implement the many functions in the
cell; they are synthesized from RNA in a process called translation.
Proteins can be compared to the software programs executed on
a computer. But the computer analogy only goes so far, because
understanding all the intracellular processes requires more than
simply identifying the genetic code. Those processes depend not only
on various objects and substances but on their interactions, which
can be quite complex. Moreover, metabolic processes like the Krebs
cycle can transform substances, making the situation unstable. In
total, the number of various interacting components in the cell can
reach the tens of millions, as the composition and concentration of
chemicals change with time.
Dutch theoretical biologist Paulien Hogeweg and her colleague Ben
Hesper are generally credited with coining the word “bioinformatics”
in the early 1970s, using it to describe “the study of informatic
processes in biotic systems.” However, it wasn’t until the late 1980s
that the term was used to refer to the computational analysis
of genomics data. In fact, before the 1980s biology was not a
quantitative discipline; it was aimed at describing, classifying and
building qualitative models. Experimental data was small, and
sophisticated methods were not needed to analyze it.
Bioinformatics would be impossible without the progress in
sequencing over the preceding decades, starting in the 1950s with
work by British biochemist Frederick Sanger, who determined the
amino acid sequence of the protein insulin. Sanger would go on to
make major breakthroughs in sequencing RNA molecules, in the
1960s, and the nucleotide order of DNA, in the 1970s; he
is one of only two people who have received a Nobel Prize twice in
the same category. (He won in chemistry in 1958 and 1980.)
The first sequencing experiments produced kilobytes of data. By the
time Sanger won his second Nobel, they were producing hundreds
of kilobytes of data thanks to improvements in technology. By
April 2003, when an international consortium completed the 13year project to map and sequence the 23 chromosome pairs that
constitute the human genome, DNA sequencing was creating
gigabytes of data (six gigabytes in the case of the Human Genome
Project, which used the Sanger method for sequencing). Since then
WorldQuant Perspectives
May 2017
2
WorldQuant
Perspectives
An Algorithmic Approach to Hacking Cancer
05.24.17
several high-throughput sequencing technologies have emerged. The
most notable is next-generation technology introduced by Solexa,
a Hayward, California, company that was acquired by San Diego–
based Illumina in 2007. These technologies have lowered the cost of
sequencing the human genome from around $5 billion in 2003 to just
$1,000 in 2017.
Gigabytes of Data
Modern genome sequencing machines produce hundreds of
gigabytes of data on a daily basis; processing this data is not feasible
without computers. Advances in technology — namely, steady-state
profiling methods — allow researchers to take snapshots of what
is happening in cells on the RNA, protein and metabolic levels. To
process this information, scientists need new algorithms that are
capable of combining data from multiple sources in a rigorous and
systematic way.
Besides sequencing and steady-state profiling, researchers can
turn to databases curated by biology and bioinformatics community
members; examples include the National Center for Biotechnology
Information (NCBI) and the Kyoto Encyclopedia of Genes and
Genomes (KEGG). These databases store information in a structured
form around known biological objects (genes, proteins, enzymes,
molecules) and their interactions and transformations (pathways,
reactions, functional hierarchies, etc.). Unstructured information
can be found in research papers, which also are available in online
databases, such as PubMed, maintained by the National Institutes
of Health (NIH), and in open access journals such as PLOS Biology.
Although these databases are generally well maintained, they may
not reflect all the available information at a current moment in time.
Therefore, the main goal of unstructured information analysis is
to extract relationships among biological objects not present in
databases. Tools and techniques that can be used to do this include
would
“be Bioinformatics
impossible without the
progress in sequencing,
starting in the 1950s with
work by British biochemist
Frederick Sanger.
”
Copyright © 2017 WorldQuant, LLC
“
Advances in technology
allow researchers to
take snapshots of what
is happening in cells
on the RNA, protein and
metabolic levels.
”
deep learning, latent Dirichlet allocation (LDA), Word2vec and other
natural language processing methods.
However, the amount of data and knowledge available in these
databases far exceeds what a researcher can grasp without the aid
of computer algorithms. For example, there are more than 20 million
genes and more than 10,000 biochemical reactions in the KEGG
database, and PubMed contains more than 20 million abstracts.
These algorithms need to incorporate both the existing knowledge
available in databases and new experimental results from RNA,
protein and metabolic profiling. The first step in processing new data
is to map it to the existing knowledge base: sequences or networks
from databases. The second step is to check whether the new data
is consistent with existing data; this usually involves computing
relevance scores and solving an optimization problem. In cases
where some inconsistencies are found, researchers will need to
use algorithms to check for statistical significance. If it exists, new
knowledge has been found.
Genes and Metabolites
The basic building blocks and methods used in these algorithms
depend on the type of data with which researchers are dealing. For
genome and RNA sequencing data, the main methods are approximate
and include local string matching and alignment, used in tools such as
the NIH’s Basic Local Alignment Search Tool (BLAST). To deal with
networks that describe protein-to-protein interaction, metabolic
pathways, gene interactions or chemical reactions, researchers
typically use graph algorithms to find connected components and
solve optimization problems on subnetworks. Statistical tests are
needed to calculate probability values and relevance scores.
Consider a web service for integrated transcriptional and metabolic
network analysis, developed by several of the researchers who
WorldQuant Perspectives
May 2017
3
WorldQuant
Perspectives
An Algorithmic Approach to Hacking Cancer
05.24.17
published the 2016 paper on macrophages. Called GAM, which
stands for “genes and metabolites,” the new service was built
using an inventive subnetwork search algorithm. As a first step,
the researchers loaded the chemical reactions network from the
KEGG database. Then they mapped gene expression and metabolic
data on network nodes, leaving only part of the network that is
likely to be connected with input data. After that they assigned
relevance scores to each node and link in the network. The
final step involved solving an optimization problem — finding a
connected subnetwork and maximizing the total relevance score.
As computational power grows, researchers can handle everlarger optimization problems, taking advantage of recent
improvements in convex and distributed optimization methods.
Algorithms such as those described in this article can help
discover chemical compounds that play important roles in disease
pathways. When applied to data related to cancer cells, these
algorithms may even be able to help researchers find ways to
regulate them directly. ◀
Thought Leadership articles are prepared by and are the property of WorldQuant, LLC, and are circulated for informational and educational purposes only.
This article is not intended to relate specifically to any investment strategy or product that WorldQuant offers, nor does this article constitute investment
advice or convey an offer to sell, or the solicitation of an offer to buy, any securities or other financial products. In addition, the above information is not
intended to provide, and should not be relied upon for, investment, accounting, legal or tax advice. Past performance should not be considered indicative of
future performance. WorldQuant makes no representations, express or implied, regarding the accuracy or adequacy of this information, and you accept all
risks in relying on the above information for any purposes whatsoever. The views expressed herein are solely those of WorldQuant as of the date of this article
and are subject to change without notice. No assurances can be given that any aims, assumptions, expectations and/or goals described in this article will be
realized or that the activities described in the article did or will continue at all or in the same manner as they were conducted during the period covered by
this article. WorldQuant does not undertake to advise you of any changes in the views expressed herein. WorldQuant may have a significant financial interest
in one or more of any positions and/or securities or derivatives discussed.
Copyright © 2017 WorldQuant, LLC
WorldQuant Perspectives
May 2017
4