“Introduction” slide

“Introduction” slide
OK so you've all read that today's course is about metagenomics. But what does that
mean ? Well the answer is that the "meta" prefix is added only to signify that there are
more than one. The picture here is supposed to illustrate that fact. In summary:
- Genomics studies one genome
- Metagenomics studies several genomes collectively
“Genomics” slide
Before defining metagenomics a bit better, I think we should step back and remind
ourselves exactly what is genomics and how is it different from the more usual term
"genetics".
Genomics is somehow a sub-category of genetics, and studies the whole genetic
information of an organism, not just a single gene.
The "omics" suffix conveys the notion of a systematic and comprehensive study. A sort of
totality. For instance you could say "Genomics is the simultaneous study of all genes, all
discrete features in a genome" -- in the same way for proteins you would say: "Proteomics
is study of all the proteins in a cell". Then you have other terms such as: Lipidomics,
Metabolomics, etc.
So, genomics tries to views all the genes in a network or a system, it studies the
interactions between them within the context of a cell and via the mechanics of
transcription and translation. It tries to study complete processing pathways and attempts
to predicts the global function of a cell.
To do this, genomics uses tools such as recombinant DNA, DNA sequencing in
combination with a lot of bioinformatics to make sense of the data.
So metagenomics is doing the same kind of study as genomics only with several genomes
at once !
“Metagenomics” slide
Still we should properly define the term properly before going any further.
Obviously metagenomics is a science. It is the science that studies metagenomes. That
doesn't help us much.
More precisely metagenomics is concerned with genetic material in samples that are
extracted straight from the environment. This means for instance taking a glass of water
from a lake, or taking a liter of soil from the ground with a shovel, and just extracting all the
DNA from it.
You will find many living organisms in kilogram of earth, but the only reason you would
want to analyze a collective of all the genomes in there is because you are targeting the
very small organisms, the microbes.
If you are interested in the worms living in the soil for instance, you don't have to study
them collectively and can just separate them by hand to study their genome individually
saving you lots of trouble.
The purpose of the metagenomic technique is to investigate the minuscules things that
you can't separate by hand or manipulate at all.
So, most of the time when someone uses the term metagenomics you can replace it with
"microbial community genetics".
“Microbes” slide
So we said we were interested in microbes. What exactly does that term cover exactly ?
Most often it's defined by the size. So it could be any living thing that's small enough for
you not to see it in a microscope. More precisely anything smaller than 100 µm or the
tenth of a millimeter.
However, you often see the term used with a bit more restrictive definition where
something is not a microbe if it is not a single cell. So in this case it would exclude all
multicellular organisms as well as viruses.
For us, in this course, we are going to use the second more restrictive definition.
It does include however all of the prokaryotes (namely bacteria and archea) as well as
some of the unicellular eukaryotes such as found in the protozoa, algae or fungi kingdoms.
That's basically what we mean by "microbes".
And microbes are a very interesting thing to study, they are basically more important than
plants for the production of oxygen as they fill the oceans and lakes. They are essential for
the life of plants themselves as they fill all our soils and are responsible for the production
of inorganic nutrients.
So, now you are all going to tell me that all of this is not right and that you can perfectly
well isolate bacteria by growing them in colonies like what happens to your strawberry jam
when you leave it too long in the fridge. Or like on this picture. (➤➤➤)
“Petridish” slide
Even though you can't manipulate them with your hands why would you have to use
metagenomics and study them while they are still combined just to make your life harder.
Well actually you can't really grow them so easily. Only a very few type of microbes really
like to be grown like that and will start multiplying when placed in such a plastic box.
Imagine the following experience: if you were to take a glass of water from let's say lake
Ekoln here in the south of Uppsala. Mix it with water, vortex it, allow it to settle, dilute the
supernatant and take two droplets of equal size from the result. Put the first droplet under
your microscope and spread the second droplet on a petri dish with some nice agar and
lots of nutrients. Finally, compare the results. What you will see is something like this.
(➤➤➤)
“Great plate anomaly” slide
Because the drops were of equal size and they contained about the same amount of
bacteria, you would expect that the same number of colonies would appear on the agar
plate as the number of microbes you see under the microscope. But this is not the case.
You get between maybe a hundred and a thousand times less colonies on the petri dish.
This is one of the older and most profound puzzles in microbiology. We have some
hypothesis as to why this is the case but the question is largely unsolved since about 100
years.
What's the first explanation that you could think of to explain this difference ? Well, the first
possibility is that all but a few of the microbes you see under the microscope are dead
since a microbe has to be alive and capable of growing to be detected by the other plate
count method. Indeed, the petri dish method will only identify microbes that are viable. This
is not a plausible explanation for such a big difference as it would mean that most of the
microbes we see would have to be dead or in a dormant state.
The basic problem is that an agar plate is a very foreign environment for a bacteria to grow
on. Even on dishes that are prepared with minimal nutrients, the concentration of some
compounds will be much higher than what the microbes would encounter in their natural
environment and might be toxic to them. One must also consider that some microbial
species are simply not adapted to grow in aggregates and form colonies. The current view
is that the microbial communities you find naturally occurring form a complex network
when one type of microbe will produce a substance that another microbe will use and that
the group becomes stable because it will regulate it self. This makes it impossible to pull
out one just one of the members and have it function alone.
At this point you should be telling me that it's just a question of finding the right recipe of
nutritional media for every strain of microbe one wants to isolate, and that scientists just
don't have enough of a "green thumb" with microbes. Well, the problem has been here for
about a century and the tinkering of the growth conditions by those three generations of
scientists have only produced minor improvements in the cultureability problem.
There are other ways to cultivate them such as growing them inside a container with that is
placed in their original environment with semi-permeable membranes, but that is subject to
other problems. An other way of that works sometimes is to create very simple
communities with only 3 or 4 species of microbes and hope they create a stable system.
“Great plate anomaly real pictures” slide
OK, it would look more like this in real life.
The consequence of all this is that we still don't have access to the majority of the
biodiversity contained in the microgransims inhabiting the oceans, soils and air of our
planet. And for many of the divisions (the largest taxonomic units in the tree of life) we only
know they exists thanks to metagenomic experiments.
The conclusion of this slide is that:
1) Microbes don't like to be taken out of their environment.
2) The typical laboratory bacteria such E. coli. is the minority and is not representative at
all of the ecologically relevant organisms.
3) This is why we have no choice but to use metagenomics to study them them directly
where they all live together.
It's a pity we can't isolate them all, but we can still learn lots of things by looking at the
bulk-properties of the microbial communities.
“Bacterial counts” slide
Now that we have modern and accurate counting methods here are some numbers for
different environments.
The deep sea will have about ten thousands cells per milliliter of water, the open ocean
about a million. Highly productive estuaries can go up to 10 million. Soil almost always has
high bacteria abundance reaching about a hundred million. Then the last example is part
of your own body. In your colon you have concentrations of ten billion cells per ml.
If you do the math on that you will quickly realize that you have more microbial cells inside
your body than you have of your own cells. This is possible because the microbes are so
much smaller than your own human cells.
These numbers will be important later when considering how to analyze the data we get.
“The lab technique” slide
OK, so what is a metagenomics experience in practice ?
This is basically the procedure for doing it:
1. You collect a glass of water or dig up some earth from your favorite ecosystem. (➤➤➤)
2. You filter whatever you got through some very small pores. This is a pretty important
step as it determines what kind of organisms will be included in the result. Depending on
the filter size you choose, you define the cutoff limit for what is too big to be a microbe. You
have to consider other things, such as the filamentous forms of bacteria that grow in
multicellular colonies may be filtered because they make colonies that are too large. You
can also use a second filter with even smaller pore sizes to discard the viruses and define
the second cutoff for what is too small to a microbe. (➤➤➤)
3. Then with the cells you have collected you break their membrane and extract all the
DNA that comes out. You can do this by either physically beating them, or disrupting them
using high energy ultrasounds. At this step you also fragment the DNA into shorter pieces
that measure on average maybe 10'000 base pairs. Although this size depends on the
protocol you use. (➤➤➤)
4. There are a few lab steps to prepare the extract, which will include amplifying the DNA
with PCR, but simply put you can basically just insert your solution of DNA fragments into
a sequencer. Here in the picture is a machine called "MiSeq" produced by the "Illumina"
company. We don't have these machines in our lab, there is actually a centralized service
for Uppsala university where you can just submit your samples called the sequencing
facility. (➤➤➤)
5. They often have lots of work to do, so about two or three months later, you receive a file
placed somewhere on a server.
“The file” slide
If you open the resulting file in your text editor here is what it would look like. Just
sequences one after the other, endlessly. You can get between ten thousands to ten
million sequences per experiment depending on how many samples you run in parallel in
the sequencer.
After all the sampling expedition, the careful laboratory procedures, the expensive
sequencing machines, this is the result produced.
Every short DNA sequence here is often called a read. It has first an identifier, then comes
the nucleotides within the ATGC alphabet, and finally with every nucleotide is associated a
quality score which measures how sure the machine was to have read the correct base.
Because indeed, these sequencer machines make mistakes and sometimes give you a G
instead of a C, etc.
Of course viewing the result like this in a text editor doesn't serve much purpose and you
are unable to say anything about the microbes that were living in the environment you
sampled. You need to process it and you can't to it by hand, you are going to need a series
of programs and tools to extract statistical properties from this mess of DNA fragments.
That's the bioinformatics part of things. And it's only with bioinformatics that you can use
this data to answer questions such as:
- What organisms are present in the community ?
- How similar or different are environments with regards to their microflora ?
- How are organisms within the community related and how diverse are they ?
- Can we attribute ecological traits to groups of organisms ? An ecological trait could be
anything directly measurable such as production of methane or speed of turnover.
- Can we get an idea of what the different groups of organisms are doing ? For instance
from a metabolic view point, can we say if they are producing oxygen or maybe they are
breathing and releasing CO2 ?
- Can we look into the evolutionary mechanisms and maybe say something about how
these microbes have adapted ?
When the method was first developed, scientists thought it had the potential to
revolutionize understanding of the entire living world. Of course that's not exactly what
happened.
Those are some examples of interesting problems you can try to tackle some with
metagenomics. The most important thing is that metagenomics is great tool because is
enables us to view the previously hidden diversity of microscopic life.
“Resulting metagenomic data” slide
It's important to conceptualize where the sequences you see in that file are coming from.
You have to realize that you are getting more or less random shattered parts or microbial
chromosomes without any distinction. For instance your first read could be part of the
beginning of chromosome of this guy, while your second read is a tiny piece of the end of
the chromosome of this other microbe floating further away. etc.
By the way, this picture is of course not to scale. A read is maybe only a very small fraction
of a complete bacterial chromosome.
With current sequencing technology the number of reads you get is roughly on the same
order of magnitude as the number of microbes in the original sample. For instance if If you
perform an experiment starting with 1 ml of water from your typical river and obtain one
million short sequence reads, you can conclude that on average each microbe has
contributed one read to your final dataset.
“The overview: two step process” slide
So let's summarize what we have learned up until now.
Metagenomics is really best understood as a two step process with the defining middle
point being the sequence file, containing the reads.
The first step involves taking your glass of water from the lake, bursting all the cells inside,
extracting all the DNA and putting the result into a high throughput sequencing machine.
This produces a file somewhere on a server.
The second steps involves using the information produced in a meaningful manner. This is
where you use the data generated to test a hypothesis or answer a question. It will often
include synthesizing the data in some way and visualizing it with different kind of graphs.
A simple example here is the breakdown of a community by order-rank. But often you
might want to be able to predict what is the metabolism of the bacterial community. What
compounds are they using for energy and what compounds are they producing. Then you
would need some more complex visualization.
The keen eye will notice that these species in the pie chart are not ones that you would
find in the environment. Indeed, Burkholderia and Streptococcus look much more like a
host-associated community. This particular graph was copied from the study of the
bacterial community collected from the lung of an 18 year old patient. You see already that
metagenomics is also applied in the medical field. More about that later.
The first step of a metagenomic experiment happens in the real world and often includes
some exciting sampling expeditions to places that have not been explored already.
Microbial-wise of course.
The second step happens entirely on the computer and almost exclusively in programming
languages and command line interfaces.
There are the different kinds of metagenomics and different procedures, but keep in mind
that this schema holds for more or less all metagenomics experiments.
“The lecture from now on” slide
So we have gone through the introductory slides and we are ready to go deeper and get
into the juicy parts now.
We have many interesting things to cover.
- We need to talk about the sequencing technology, because in my previous example you
kind of magically got the sequences in a file starting from a DNA solution.
- We need to put some context around metagenomics and go over the other "-omics"
technologies.
- We need some more examples of actual ecological questions that this method can
answer, otherwise we will keep wondering why are we doing all this.
- We need to talk a bit about how microbial populations form.
- We need to see how the bioinformatics help us go from the file with all the sequences to
something meaningful and what are the problems to avoid there.
- And we will end by taking a bunch of examples from past studies and seeing some of the
cool stuff that was done using metagenomics.
I hope you guys are excited !
“Overivew: two step” second slide
So we saw that metagenomics can be described as a two step process.
To complete your understanding of what happens in the laboratory part of things we will
now turn briefly to reviewing how the sequencing technology works. After that we will be
ready to turn to the bioinformatics and the data analysis.
“Sequencing technologies” slide
A sequencer is a pretty complicated machine. But it's actually quite interesting to see how
it works.
I imagine you have probably heard a bit about DNA sequencing technologies before, so I'll
go rather quickly over this step.
The important part to know are that the price of sequencing has been dropping constantly
over the years. If you would put on the Y axis the amount of money you need to pay for
sequencing one megabase of DNA or one million nucleotides and on the X axis you put
the date, it would look like this. (➤➤➤)
You can see that the Y axis is logarithmic, so prices are dropping faster and faster.
Companies are continuously trying to make the technology better and come up with
sequencers that make longer reads at a lower cost. It is quite a big market and at the
annual conference for these things it is funny to see the stock prices of one company go
up the second they make impossibly good announcements about the new products they
are going to be selling and the stock of other companies that don't innovate fall down.
Often the impossibly good announcements never come true of course.
The price in itself is completely irrelevant. But what it means for us is that more and more
laboratories can get access to the technology.
There have been several technologies picked up and then abandoned over the years. As I
was looking over last year's slides I realized the information there was already completely
outdated. For instance you might have heard of "454 pyrosequencing" in the past ? That
was a product sold by the company "Roche" but just four months ago they decided they
were shutting down the factory and wouldn't sell anymore of them. So all we had learned
about those is useless now.
What I am going to do here is just present one of the most popular sequencing methods,
the one we are currently using for our research in my group, which is the Illumina method
so you can get an idea of how it can work.
“Illumina technology video” slide
To show you how it works inside I have a video.
What you see here are the DNA strands you want to sequence. First, the DNA is
fragmented and the broken ends are repaired and adenylated. Special adapter nucleotides
are ligated to both ends of every fragment. The fragments are then size selected and
purified.
Cluster generation:
What you see here is the flow cell, to central part of the machine, a dense lawn of short
oligonucleotides have been grafted to its surface. These oligo have the complementary
sequences to the adapters that where ligated to the fragments. Hence the fragments are
going to be able to bind to them. That's what happens, once the fragments are hybridized,
they are extended with a polymerase to create copies. Then by changing the temperature
you can make them bend and attach again. Repeating this will amplify the starting
sequence hundreds of times.
In the end you get hundreds of millions of unique little colonies. The reverse strands are
cleaved and washed away. Then ends are blocked. And the library is ready for sequencing
Sequencing:
What happens now is that every little colony is going to be sequenced in parallel at the
same time. All four bases are added as a solution to the flow cell. Only they are not normal
nucleotides, they have a fluorescent particle attached to them. All Ts have a red dye
attached for instance. All the four bases complete with each other to bind to the template.
After each round of synthesis the clusters are excited by a laser emitting a color that
allows to identify which base was added. The fluorescent label and blocking group and
then removed. And the next cycle starts, the four bases are added again, etc.
This enables you to read the full sequence.
“Illumina flow cell pictures” slide
Here you can see what a flow cell actually looks like on the left. it's very small yet has
about 15 million small DNA clusters on it.
And on the right you can see the kind of pictures that come out of the camera filming the
flow cell. By classifying which color was detected for which colony, the software will
recompose all the DNA sequences. Of course the automatic image processing program
isn't perfect, and makes mistakes. For instance it is hard to make a decision when cluster
are right next to each other or overlapping.
“Indels and substitutions” slide
This leads to abnormalities in the results file. Imagine for instance that the real sequence
contained in one of bacteria was the one at the top. The Illumina machine might make a
mistake and substitute an A instead of G. (➤➤➤) .This is not too problematic depending
on what you want to do with your sequences later.
What is much more dangerous is when the machine drops a letter or adds one by mistake.
(➤➤➤) In this case, when you try to translate the sequence into the corresponding amino
acids, you will get what is called a frame shift. (➤➤➤) And every amino acid predicted
after the deletion will be wrong because you will have jumped a letter.
But we are getting ahead of ourselves.
“Bioinformatics” slide
OK, so now that we understand how that file with the DNA sequences is created we slowly
start turning to the bioinformatics part of things. Maybe we should start by defining that
term because the meaning varies depending on who you ask. What does bioinformatics
mean ?
- Originally it does not mean biological computer science, even though that's how you will
often hear it used. The "informatics" suffix isn't meant to relate to the computer, but simply
relates to the science of information.
- Here is the definition I use of the word: (➤➤➤)
is the science and technology of biological information. It is a theoretical science that
develops mathematical, statistical, and algorithmic methods and makes use of
experimental data (in digital form), in an effort to discover functions and modalities of
information processing in the cell.
- The thing is: The management, storage and processing of biological data across large
computer networks is an important, although technical, aspect of bioinformatics and is
often presented as bioinformatics itself, which it is not. This means for instance that when
you writing a program to convert the reads file to from one file format to another because
the analysis program you want to use doesn't accept the illumina file format, you are not
doing science, you are just loosing your time with irrelevant technical problems.
I like to say bioinformatics is concerned with modeling and thinking about biological
processes not as chemical processes nor physical ones, but as informational processes.
For instance: a biologist could describe DNA in a cell as chromatin and would think of
mRNA as smaller more fragile molecules. A bioinformatician would say DNA is a constant
piece of information in the cell while the mRNA is a rapidly chaining one that is derived
from the first. In analogy to a hard disk and the RAM.
“Digits from microbes” slide
There are a multitude of ways to observe microbes. But to perform bioinformatics with
them can't look at them under a microscope; we need to obtain some type of digital data
from them. There are several ways to do that, and we just describe one, but it's important
to keep in mind that there are other ways. Bioinformatics is not always done on genomics.
We could take all the mRNAs from a cell and put them in the same sequencing machines,
in that case we would be looking at the parts of the genome which are transcribed and it
would be called transcriptomics.
You can also obtain digital information from microbes by extracting the proteins from the
cells and inserting it into a mass spectrometer. That would be called metabolomics.
We are just going to focus on genomics for the moment, of course.
“Genomics from microbes” slide
So if we are doing genomics, we are focussing on data that is derived somehow form the
sequence content of the genome of microbes. But here again there are several ways of
obtaining digits from the genome. And there are actually two different variations of the
metagenomic technique we have been talking about. (➤➤➤). We will be describing these
two more in detail in the following slides, but it is good to know about the two others.
- Microarrays: used to be the next big thing, but the scientific community realized after a
few years the numerous problems of this technology and it has more or less disappeared
today.
- Single cell amplification: using apparatuses like flow cytometers it is possible to isolate a
single microbial cell from the environment and extract DNA from it. It is then heavily
amplified and sent off to sequencing. This solves some of the problems of metagenomics
but has others. We might come back to this at the end of the class.
Now let's look at the two types of metagenomic experiments. We will start by talking about
"targeted metagenomic sequencing", which is just a slight refinement of the technique we
have been discussing.
“The lab technique with primers” slide
If you remember the lab technique from the first part, we just need to modify it like this
(➤➤➤)
We add a special pair of primers just after extracting the DNA and before amplifying it. This
has the effect that only certain pieces of DNA that match with the primers are going to be
amplified. This enables us to target just one gene.
Of course we should choose a gene that all microbes have, otherwise we will be ignoring
large part of the diversity with this technique.
“Amplicon primers” slide
Once you have chosen which gene to target, you artificially design two primer sequences
that will bind to that unique region on the microbial chromosome and enable the TAQ
polymerase to amplify only that part in the PCR.
“Resulting targeted data” silde
If we go back to this picture, when you use targeted sequencing, in the reads file you are
actually getting very similar sequences each time, since you are targeting the same region
in every microbe. Of course the sequences are not exactly the same, otherwise you would
learn nothing.
“A ribosome” slide
So what gene are we going to target ? The most popular gene used for targeting is part of
the ribosome.
As you know, living cells make proteins, and to make proteins they need some apparatus
that assembles the correct chain of amino acids by reading the DNA. This is the role of the
ribosome. Absolutely all living cells have ribosomes in them. This is nice as it means
absolutely all living cells have the genes to make ribosome in their genome. We can use
this to our advantage and target one of the particular genes that participates in this
process to probe a microbial population.
You are looking at a picture of the ribosome from E.coli recomposed from a
crystallography experiment. As you can see it is made of several pieces that fit together,
some pieces of protein and some of RNA. The particular gene that was chosen by the
microbiology community is the piece of RNA colored in blue here. It is called the 16S
subunit because when you centrifuge a denatured mixture of ribosome that part comes out
at the 16 Svedberg units mark.
Not only do we know the structure, we also know the nucleotide sequence of this gene.
(➤➤➤)
“2D ribosome” slide
Here you can see is the full sequence projected onto two dimensions.
The 16S rRNA was chosen because the sequence is very conserved and is almost the
same in all bacteria we have ever seen. Though, it differs for archea and is totally different
in eukaryotes.
It's present in all life forms. Whatever genome you are looking at, maybe a bacteria,
maybe an archea, you can be sure that somewhere there is a gene to make ribosomes.
If you were to unroll it, you would see that it measures about 1500 nucleotides. Most of it is
very constant, but there are some parts that are highly variable between species as they
are not under the same selective pressure. (➤➤➤) These pieces of the gene are free to
mutate and drift randomly in their composition as evolutionary time goes by. This variation
is what enables us to extract valuable information from the genetic sequences and say
something about the diversity of bacteria living in a particular sample, sometimes even
distinguish between species of bacteria.
It is considered a good method for classifying bacteria and, generally, the phylogenies built
using the 16S gene agree with phylogenies built using other marker genes.
“Primer choice” slide
Of course the read length of typical sequencing technologies is pretty short, so unless you
are using the old Sanger sequencing you are actually only getting a piece of the 16S gene
of maybe 400 base pairs. In this case the choice of primers is important, and can greatly
influence the outcome of your experiment.
If one takes as an example a species like the freshwater bacteria "OD1" which has newly
been added to the tree of life by creating a branch, it appears it was previously undetected
mostly because of the common choice of primers did not target its specific 16S sequence.
Now we realize that is quite abundant in almost every freshwater systems. Several other
bacterial species suffer from this problem that they will be underestimated depending on
the primers you use.
“Reads to diversity” slide
After all this we are finally ready to answer the question: How do we get from the
sequences to a proportional break down of microbial species ?
Let's walk through a typical analysis pipeline.
“Typical 16S analysis pipeline” slide
In this example we are going to use a reduced data set, instead of having thousands of
sequences, we are going to start with these seven fake 16S sequences in a file.
[...]
You have to consider also that sometimes you don’t find any similarity in the database in
which case you cannot classify your sequence. Or you find some similarity, but you can
only classify your sequence at a very high level saying for instance that it is part of the
alphaproteobacteria kingdom, but not which species it is.
This is of course somewhat technical, but it’s important to understand what is going on
under the hood. Often someone presents a study and simply says "we collected a sample
from this place and here is the composition in bacteria"; they often forget to mention all of
the processing between those two steps. You now know have an idea of what is involved
in actually producing the measures.
“A real plot” slide
Of course in a real study you would end up with a plot looking something more like this. It
will include several sample of different systems or taken at different times.
This is just an example taken from one the studies in my thesis where we collected
bacterial sample from lakes with very high pHs, called soda lakes.
“Statistics” slide
The next step in this process is to use some statistical tools in an attempt to understand
what is going on, or to extract some general mechanisms. You could try to make a plot like
this one, where you place every one of your lakes on a two dimensional space according
to what microbial composition they had, and you add on top the correlation with the
environmental variables you measured, such as temperature, pH, conductance and the
like.
At the end, it's up to you do chose the relevant comparison statistics or models to answer
your scientific question. You can do many different things by combining bacterial
population abundances from multiple samples with other types of data. For every new
study different statistics will be applied.
“The 16S rRNA databases” slide
As you have seen, a crucial step is to chose a taxonomic database when you want to
assign species to your quality filtered reads.
These databases contains the result of years of experiments on microbes. Without them, it
would be impossible to classify the sequences with the names we are familiar with.
There are three main providers for such a database. Silva, greengenes, and RDP. They
are comparable but sometime use slightly different names for the same species and are
not entirely compatible. The most popular are probably these two I would say. In our
studies we use this one as it seems to be the most up to date and has the largest number
of sequences. But one can't say there is one that is clearly superior to the other. This is
once again a choice that the scientist has to make somewhat arbitrarily.
“Software for 16S” slide
If you need to work one day with this kind of data, there is software that can help you. But
they are mostly command-line interfaces, and require knowledge about the unix operating
system to be used.
“What is a species” slide
This brings us to asking the question what is exactly a species in the microbial world ?
Generally two bacteria are considered to be part of the same species if their 16S
sequence diverge by no more than 3%. But previous to that definition, one would use
DNA hybridization methods and species were defined as any two bacteria which hybridize
over 70%.
Of course the two definitions agree for some species but don't for others.
This shows us that definition of what one species is, can change depending on the
scientific community's opinion and the tools they have.
For instance when studying diversity, the classification of what is different or on the
contrary considered the same is an area of debate and will greatly influence any measure
you can make of diversity.
“What is a species second” slide
In the macrocosmic world and especially with mammals, it is easy to define what is a
species: if two individuals can reproduce they are from the same species. If they can't
reproduce they are from different species. But bacteria reproduce asexually by just
dividing, they grow fast and mutate quickly. So what does a species really mean ? They
are also able to exchange genetic material horizontally, by just secreting DNA and taking it
up through the cell membrane, making things even more complicated.
To develop the idea that the concept of a species is somewhat arbitrary in the microbial
world consider the following:
If you were to classify the species of mammals in a forest, you could count the different
animals and come up with result similar to the one presented here. There are 4 foxes and
6 rabbits in the forest. In this case you can be certain that there is nothing that between
the two species. If there ever was something genetically in between a rabbit and a fox is it
extinct today and you will never find it in your forest.
It is not necessarily the same with bacteria. With the 16S method, depending on what
classification algorithm you use and which database you chose, you can end up with
something such as the following, It should therefor maybe be better viewed as a
continuum and could be drawn as the following.
“16S rRNA conclusions” slide
To conclude this chapter we could say that the 16S rRNA method is good because it was
able to reveal the previously hidden diversity of microbes in the environment. But one has
to be careful about the biases of the method such as what primers are chosen, what
processing is chosen and what databases is chosen. Also one cannot really compare two
studies which have not followed exactly the same protocol.
Also one of the shortcomings of this technique is that is doesn't give much information
about what the microbes are doing metabolically.
“Example studies” slide
Now to illustrate what you can do with the 16S rRNA technique, here are three studies that
I picked from the literature.
1) In the first study, they sampled 18 different lakes in Wisconsin, and looked at a
particular population of the freshwater Actinobacteria. Each lake of course had its unique
composition made up of different strains of actinobacteria showing that the microbes don't
disperse across the environment very rapidly. Furthermore what best explained the
community structure of population was not the distance between the different lakes but the
environmental factors of the lake, such as its pH.
This is interesting because it supports the general ecological theory "That everything is
everywhere, but the environment selects". Meaning that every microbial species can
potentially arise anywhere as long as the conditions are good for it and that the limiting
force is not dispersal.
2) In second place, here is a study where they tried to question what factors influence the
composition of the microbiota in the gut of different mammals and how the host and
microbes co-evolve together. Is it what you eat that determines the population of microbes
or is your ancestry ?
They found that the phylogeny of the host was the dominating effect, and that the diet had
a strong secondary effect. Another finding was that the communities of humans living a
modern life-style was typical of omnivorous primates.
3) Finally an example of a study from the medical field where they compared the microbial
communities of new born babies and tried to determine what factors shape their microbiota
in the beginning. They found for instance that the way a baby is born is the biggest factor
contributing to the composition and that babies born by c-section don't get inoculated with
the same starting population.
Indeed, the medical field is also very interested in bacterial communities, as they are
linked to many different diseases. They also generally have more money than the
ecological field, which is perhaps a questionable decision of our society. But let's not talk
politics !
Microbes are mostly linked to gastrointestinal diseases but for instance a recent study
showed strong link between the gut microbiota and Parkinson's disease. There is also an
interesting procedure to cure some patients from inflammatory bowel disease where, in
order to reset the bad microbial population in the colon, the patient is given a fecal
transplant from an other healthy patient.
“Example studies” references
Newton RJ, Jones SE, Helmus MR, McMahon KD: Phylogenetic ecology of the freshwater
Actinobacteria acI lineage. Appl Environ Microbiol 2007, 73:7169-7176.
http://www.ncbi.nlm.nih.gov/pubmed/17827330
Ley RE, Hamady M, Lozupone C, Turnbaugh PJ, Ramey RR, Bircher JS, Schlegel ML,
Tucker TA, Schrenzel MD, Knight R et al.: Evolution of mammals and their gut microbes.
Science 2008,
320:1647-1651.
http://www.ncbi.nlm.nih.gov/pubmed/18497261
Development of the Human Gastrointestinal Microbiota and Insights From HighThroughput Sequencing
Maria Gloria Dominguez-Bello, Martin J. Blaser, Ruth E. Ley, Rob Knight
http://www.sciencedirect.com/science/article/pii/S0016508511001600
“The lab technique 3” slide
OK you remember how we modified the lab technique to add primers to target the 16S
rRNA gene ? We are going to look now at what happens if you remove that step (➤➤➤)
“The shotgun metagenomic” slide
In that case we don't target any particular gene and we get random pieces from anywhere
on the microbial chromosomes as shown on this picture. Because of the randomness of
breaking the DNA in the sample, this technique it is called shotgun metagenomics.
“Classical shotgun sequencing” slide
It's important to remember the difference between shotgun metagenomics that is applied
to free living microbes and the classical shotgun sequencing that can be applied to
monoclonal microbes growing on a plate.
In a classical sequencing experiment you only take microbes that are growing in the same
colony, so though you get random parts of the chromosome, you know that your reads are
all a fragment coming from a copy of the same chromosome. This makes thing much
easier to pieces together afterwards.
“The reads” slide
So what do you do this time with your reads ? They are not coming at all from the same
region of the chromosome, not from the same type of bacteria.
Does anyone have an idea ?
“Genomic assembly” slide
What one can do if one has enough reads is try to assemble them together. That is, try to
find overlaps between the sequences and build bigger pieces.
That's a pretty easy task if you have three reads. You can simply compare the end of each
read with the start of each read and check for matching parts. Finally you can build a
longer sequence from the three starting reads. This is often called a contig, standing for
continuous piece of DNA. This seams like an easy problem to solve, to find matching parts
between sequences, but when you have several million reads you need to find another
strategy, you can't just compare every read with every read. An algorithm that would
simply try to test all the comparisons would take longer to execute than your life
expectancy and is thus useless.
Many smart algorithms have been developed to counter this problem and nowadays you
can assemble even human sized genomes on a single computer. Of course for a large
genome like that you still need a pretty big computer.
“Networks and graphs” slide
The problem is often solved by using the reads and building a network or a graph with
them. For instance in this simple example we have a very small microbial chromosome
with 10 nucleotides. This might generate the following reads when randomly broken up. By
drawing links between the reads that have an overlap, you can then just follow a path in
the graph that goes through each read once and recover the assembled sequence or
contig.
“De Bruijn graphs” slide
However this strategy quickly reaches a limit as the problem or finding a path that goes
through every node in a graph once is very demanding computationally when the graph is
big. This is why, for data coming from the Illumina machines, a different type of graphs is
used, called De Bruijn graphs where instead of putting the reads as nodes in the graph,
the reads become edges. This means you have to find a path that goes through every
edge once, which is much easier for a computer.
We don't need to go into more detail about these complex mathematical tools. Unless you
ask me to of course. But it's interesting to point out that as tools for doing biology get more
and more technical it becomes increasing harder to understand what is really going on for
a student coming from a standard biology curriculum. Quite often biologists are lacking in
programming and mathematical skills to deal with this new area of biological data and
almost always need to associate or a hire a specialist to conduct such studies.
“From contigs to genomes” slide
OK, but all you need to understand for the moment is that the result of the assembly
operation is that you produce a smaller set of longer sequences.
Unfortunately when you assemble your metagenomic data you never end up with contigs
that are the size of a typical bacterial genome which is about 3 mega base pairs. You get
lots of smaller contigs measuring maybe only kilo base pairs. This is because sometimes a
piece of the bacterial chromosome is under-represented in the reads or because of
multiple repeated regions in the genome.
So how do you decide that two contigs should go together and are part of the same
original genome since by definition they don't share any overlap in sequence ? It's a hard
problem, especially if your sample contains lots of different species. But there are some
smart tricks. Often one is only able to pull out a few semi-complete genomes even with
large number of starting reads.
If you think about such a problem, a real world analogy would be the following task: take
five different english novels and printing a thousands copies of each. Send all the five
thousands book you have in a industrial shredder tearing them to pieces. Take all the small
pieces of paper left from the process and introduce errors in them. Then burry them in the
ground for a week so many of them become unreadable and rotten. Take all of them out
and try to recompose the original stories by piecing the papers together. You could use
tricks such as identifying the author's style and using that to put together different
chapters.
In the same way, you can use the information contained in contigs to separate them. A
simple statistic is to compute the ratio of GC nucleotides against AT nucleotides. A
bacterial genome usually has a constant ratio across its chromosome. You can also use
the fact that microbes have different abundance in the sample and hence contribute more
or less reads to the dataset. If you have a microbe taking up 20% of a given community
and an other taking up only 10%, then all the contigs belonging to the first microbe should
be made up, on average by twice as many reads at every position.
“Ideal case” slide
In an ideal case, you could get a bimodal distribution for both of these measures and are
able to put together four different genomes. But that's rarely the case.
“Real case” slide
Here is a graph that was taken from a study.
You can see some clear groups, but most of it is very mixed.
“FASTA file to metabolic graph” slide
How do you use such data ? How would you go from your contigs file to the conclusion
that the microbes in your sample have the genes to degrades chemical compound X but
not the ones for its transformation to compound Y.
Let's walk through a typical analysis pipeline.
“Typical analysis pipeline” slide
[...]
One of the problems in this approach is that often half or more of your detected open
reading frames have no known function.
Once again you can do this for every sample, it's then up to the scientist to choose the
mathematical methods or statistics to compare samples amongst each other depending on
what you are questioning..
“Databases for proteins” slide
In such a pipeline an essential part is again the database you choose for inferring function
when you detect a protein. There are many such databases, here the main ones are listed.
Uniprot for instance is one of the largest databases for proteins with functional information.
It has about 30 million entries. But only a few thousand have been confirmed by extracting
the protein, about 2% have evidence from some type of published experiment. 20% are
inferred by strong homology and the rest is predicted function.
You have to keep in mind that most functions are just predictions; so be careful when
using these databases. There is a notable problem when new genomes are annotated with
information coming from the database and then the same genomes are later used to fill the
database with the conclusions. This makes the information stored there a reflection of itself
rather than a reflection of reality.
“Software for metagenomics” slide
As I said earlier, bioinformatics is mostly done with interfaces like the command line and
programming languages like Perl and Python. There are no easy user interfaces. There
are however some online tools for processing shotgun metagenomic data. And that can
automatize large parts of the procedure.
There is the MG-RAST website where you can upload a FASTA file and it will give you a
page full of statistics from taxonomic distribution to metabolic process including diversity
measures.
The european bioinformatics institute also promises a similar service, but it has been
changing lately and their new version is still in beta version.
“Shotgun conlusions” slide
To conclude this chapter we can say that the most difficult part of shotgun metagenomics
is assembling the reads and then the contigs in a satisfactory manner. Often the procedure
gives you incomplete genomes or genomes that are called chimeric because they are the
result of wrongfully merging different species together.
In fact, the storage and analysis cost of a metagenomics experiment is always higher
nowadays than the lab supplies and sequencing part of the costs. If you are a small lab
without the bioinformatics knowledge, it's may not be accessible to you.
Predicting the function of the microbes is second hardest problem, with about half of
proteins having no similarity found in the databases.
Even when you are able to predict the function of the protein you found, a problem is
always the fact that seeing a gene in the genome of an organism doesn't mean it's being
actively transcribed, translated and folded properly.
Hopefully, new sequencing technologies will appear in the coming years that produce
longer reads making assembly easier. Since we are currently calling today's technology
"next generation sequencing", I have written next next generation sequencing on the slide.
None the less it enables us to gather lots of information about microbes that otherwise are
impossible to isolate and culture.
“Everything” slide
Doing metagenomics is also quite popular. On this slide are some randomly chosen
examples from a metagenomic repository. You can see that everything and anything has
been sequenced. You have:
- A Rice-straw enriched compost from Berkeley
- An Acid Mine Drainage
- Leg ulcer microbial communities
- Hot spring microbial community
- Salinewater from the Dead Sea
“Everywhere” slide
Here is a screen shot showing the world map and the number of metagenomes recorded
in that particular area. This is probably just a fraction of all the metagenome sequencing
that has been going on, as many experiments are not submitted to the public databases or
tagged with the appropriate GPS data.
“Example study 1” first slide
To demonstrate the kind of studies that can be made, I picked an article from the literature
that represents one of the first success stories of metagenomics.
In this study, scientists from Berkley sampled microbes from biofilms growing on the water
coming out of an abandoned iron mine. You can see a picture here of the greenish
bacterial slime that was collected. This mine drainage water is extremely acid, with one the
lowest pHs ever measured in a natural system of -3. This of course is detrimental to the
environment and is considered pollution. The microbes living there actually contribute to
the pollution as they can catalyze the decomposition of metal ions.
I don't want to go into too much detail but what they did is start with a target 16S study that
showed that there was only three different bacterial lineages and three archeal lineages in
the sample, making it an extremely simple microbial community. Usually you would find
thousands of different species in an sample. However it is often the case that in extreme
environments, the complexity is lower.
Next, they applied shotgun sequencing to the sample and where able to organize their
contigs into clear distinct groups using coverage and GC content. In one of the groups
they recovered a nearly full genome that was related to the Ferroplasma group of
organisms. The 16S gene was similar to a cultivated representative of that group and the
total size of the genome matched with that relative. But the rest of the genome was
sufficiently different to establish it as a new species. They also found a new organism
relating to the Leptospirillum group.
But what is really interesting is that by analyzing the contents of the newly retrieved
genome they could predict almost all the functions of the cell and draw the following
diagram.
“Example study 1” second slide
By finding what proteins where doing what they are able to reconstitute an almost full
metabolic map of the cell. You can also get information by looking at what is missing in the
genome. For example, neither the Leptospirillum group II genome, nor the Ferroplasma
genomes, seem to be lacking the genes necessary for nitrogen fixation, suggesting that
this essential task might be performed by other less abundant members of the community.
You can start answering questions like "How do these communities resist the toxic metals
and maintain a neutral cytoplasm ?”
You can also start looking at the evolutionary events that these organisms have undergone
by examining the amount of nucleotide diversity in each part of the genome. For example,
in the case of the Leptospirillum group, they appear to have undergone a recent selective
sweep, judging by the very low level of nucleotide polymorphism. On the other hand, the
new Ferroplasma friend seems to have been subject to homologous recombination events.
So now you see how one can predict the metabolism of microbes that don't want to be
cultured.
“Example study 1” reference
Nature. 2004 Mar 4;428(6978):37-43. Epub 2004 Feb 1.
Community structure and metabolism through reconstruction of microbial genomes from
the environment.
Tyson GW1, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV,
Rubin EM, Rokhsar DS, Banfield JF.
http://www.ncbi.nlm.nih.gov/pubmed/14961025
“GOS” slides
1.045 billion base pairs of nonredundant sequence
-> Fishing expedition problem.
-> Found a rhodopsin in the Bacteria kingdom.
-> Found unexpected diversity and nitrification in the Archea kingdom.
-> Have to make laboratory experiments to confirm findings. These results can only give
you clues.
-> DeLong et al., Science, 2006
“Targeting larger organism” slide
-> Environmental forensics.
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0035868
“Cryptic fluctuations” slide
Even after all these studies, the truth is we don't really have any universal models to
explain how microbial communities form and change over time. It's something that is very
hard to predict, even if you where given all the external inputs (such as the hydrology, the
amount of precipitation) and all the other driver variables (such as temperature, pH,
amount of sunlight).
The internal processes of growth and competition between microbes are so complex that
they can't really be summarized in simple equations. You also have to consider even more
factors such as the predations on the microbes from higher organisms and from viruses.
There is still lots of uncharted territory, so it's quite an exciting field.
“Unexplored territory” slide
Here is a graph showing all the different kinds of things we can still do to get a better
understanding.
Today we have mostly been speaking of taking whole communities and subjecting them to
DNA metagenomics, but one can also do transcriptomics and proteomics. One can also try
to lower the complexity of the community and apply the same tools, we can also try to
isolate single cells and do the same things.
That for instance, single cell genome amplification is just starting to get popular, and a new
center specialized in that is just being created here in Uppsala.
“Combine and conquer” slide
One shouldn't forget also to combine all these experiments together, and it's only by
attacking the problem from all sides that we will start to see clearer.
Thanks for listening.