“Introduction” slide OK so you've all read that today's course is about metagenomics. But what does that mean ? Well the answer is that the "meta" prefix is added only to signify that there are more than one. The picture here is supposed to illustrate that fact. In summary: - Genomics studies one genome - Metagenomics studies several genomes collectively “Genomics” slide Before defining metagenomics a bit better, I think we should step back and remind ourselves exactly what is genomics and how is it different from the more usual term "genetics". Genomics is somehow a sub-category of genetics, and studies the whole genetic information of an organism, not just a single gene. The "omics" suffix conveys the notion of a systematic and comprehensive study. A sort of totality. For instance you could say "Genomics is the simultaneous study of all genes, all discrete features in a genome" -- in the same way for proteins you would say: "Proteomics is study of all the proteins in a cell". Then you have other terms such as: Lipidomics, Metabolomics, etc. So, genomics tries to views all the genes in a network or a system, it studies the interactions between them within the context of a cell and via the mechanics of transcription and translation. It tries to study complete processing pathways and attempts to predicts the global function of a cell. To do this, genomics uses tools such as recombinant DNA, DNA sequencing in combination with a lot of bioinformatics to make sense of the data. So metagenomics is doing the same kind of study as genomics only with several genomes at once ! “Metagenomics” slide Still we should properly define the term properly before going any further. Obviously metagenomics is a science. It is the science that studies metagenomes. That doesn't help us much. More precisely metagenomics is concerned with genetic material in samples that are extracted straight from the environment. This means for instance taking a glass of water from a lake, or taking a liter of soil from the ground with a shovel, and just extracting all the DNA from it. You will find many living organisms in kilogram of earth, but the only reason you would want to analyze a collective of all the genomes in there is because you are targeting the very small organisms, the microbes. If you are interested in the worms living in the soil for instance, you don't have to study them collectively and can just separate them by hand to study their genome individually saving you lots of trouble. The purpose of the metagenomic technique is to investigate the minuscules things that you can't separate by hand or manipulate at all. So, most of the time when someone uses the term metagenomics you can replace it with "microbial community genetics". “Microbes” slide So we said we were interested in microbes. What exactly does that term cover exactly ? Most often it's defined by the size. So it could be any living thing that's small enough for you not to see it in a microscope. More precisely anything smaller than 100 µm or the tenth of a millimeter. However, you often see the term used with a bit more restrictive definition where something is not a microbe if it is not a single cell. So in this case it would exclude all multicellular organisms as well as viruses. For us, in this course, we are going to use the second more restrictive definition. It does include however all of the prokaryotes (namely bacteria and archea) as well as some of the unicellular eukaryotes such as found in the protozoa, algae or fungi kingdoms. That's basically what we mean by "microbes". And microbes are a very interesting thing to study, they are basically more important than plants for the production of oxygen as they fill the oceans and lakes. They are essential for the life of plants themselves as they fill all our soils and are responsible for the production of inorganic nutrients. So, now you are all going to tell me that all of this is not right and that you can perfectly well isolate bacteria by growing them in colonies like what happens to your strawberry jam when you leave it too long in the fridge. Or like on this picture. (➤➤➤) “Petridish” slide Even though you can't manipulate them with your hands why would you have to use metagenomics and study them while they are still combined just to make your life harder. Well actually you can't really grow them so easily. Only a very few type of microbes really like to be grown like that and will start multiplying when placed in such a plastic box. Imagine the following experience: if you were to take a glass of water from let's say lake Ekoln here in the south of Uppsala. Mix it with water, vortex it, allow it to settle, dilute the supernatant and take two droplets of equal size from the result. Put the first droplet under your microscope and spread the second droplet on a petri dish with some nice agar and lots of nutrients. Finally, compare the results. What you will see is something like this. (➤➤➤) “Great plate anomaly” slide Because the drops were of equal size and they contained about the same amount of bacteria, you would expect that the same number of colonies would appear on the agar plate as the number of microbes you see under the microscope. But this is not the case. You get between maybe a hundred and a thousand times less colonies on the petri dish. This is one of the older and most profound puzzles in microbiology. We have some hypothesis as to why this is the case but the question is largely unsolved since about 100 years. What's the first explanation that you could think of to explain this difference ? Well, the first possibility is that all but a few of the microbes you see under the microscope are dead since a microbe has to be alive and capable of growing to be detected by the other plate count method. Indeed, the petri dish method will only identify microbes that are viable. This is not a plausible explanation for such a big difference as it would mean that most of the microbes we see would have to be dead or in a dormant state. The basic problem is that an agar plate is a very foreign environment for a bacteria to grow on. Even on dishes that are prepared with minimal nutrients, the concentration of some compounds will be much higher than what the microbes would encounter in their natural environment and might be toxic to them. One must also consider that some microbial species are simply not adapted to grow in aggregates and form colonies. The current view is that the microbial communities you find naturally occurring form a complex network when one type of microbe will produce a substance that another microbe will use and that the group becomes stable because it will regulate it self. This makes it impossible to pull out one just one of the members and have it function alone. At this point you should be telling me that it's just a question of finding the right recipe of nutritional media for every strain of microbe one wants to isolate, and that scientists just don't have enough of a "green thumb" with microbes. Well, the problem has been here for about a century and the tinkering of the growth conditions by those three generations of scientists have only produced minor improvements in the cultureability problem. There are other ways to cultivate them such as growing them inside a container with that is placed in their original environment with semi-permeable membranes, but that is subject to other problems. An other way of that works sometimes is to create very simple communities with only 3 or 4 species of microbes and hope they create a stable system. “Great plate anomaly real pictures” slide OK, it would look more like this in real life. The consequence of all this is that we still don't have access to the majority of the biodiversity contained in the microgransims inhabiting the oceans, soils and air of our planet. And for many of the divisions (the largest taxonomic units in the tree of life) we only know they exists thanks to metagenomic experiments. The conclusion of this slide is that: 1) Microbes don't like to be taken out of their environment. 2) The typical laboratory bacteria such E. coli. is the minority and is not representative at all of the ecologically relevant organisms. 3) This is why we have no choice but to use metagenomics to study them them directly where they all live together. It's a pity we can't isolate them all, but we can still learn lots of things by looking at the bulk-properties of the microbial communities. “Bacterial counts” slide Now that we have modern and accurate counting methods here are some numbers for different environments. The deep sea will have about ten thousands cells per milliliter of water, the open ocean about a million. Highly productive estuaries can go up to 10 million. Soil almost always has high bacteria abundance reaching about a hundred million. Then the last example is part of your own body. In your colon you have concentrations of ten billion cells per ml. If you do the math on that you will quickly realize that you have more microbial cells inside your body than you have of your own cells. This is possible because the microbes are so much smaller than your own human cells. These numbers will be important later when considering how to analyze the data we get. “The lab technique” slide OK, so what is a metagenomics experience in practice ? This is basically the procedure for doing it: 1. You collect a glass of water or dig up some earth from your favorite ecosystem. (➤➤➤) 2. You filter whatever you got through some very small pores. This is a pretty important step as it determines what kind of organisms will be included in the result. Depending on the filter size you choose, you define the cutoff limit for what is too big to be a microbe. You have to consider other things, such as the filamentous forms of bacteria that grow in multicellular colonies may be filtered because they make colonies that are too large. You can also use a second filter with even smaller pore sizes to discard the viruses and define the second cutoff for what is too small to a microbe. (➤➤➤) 3. Then with the cells you have collected you break their membrane and extract all the DNA that comes out. You can do this by either physically beating them, or disrupting them using high energy ultrasounds. At this step you also fragment the DNA into shorter pieces that measure on average maybe 10'000 base pairs. Although this size depends on the protocol you use. (➤➤➤) 4. There are a few lab steps to prepare the extract, which will include amplifying the DNA with PCR, but simply put you can basically just insert your solution of DNA fragments into a sequencer. Here in the picture is a machine called "MiSeq" produced by the "Illumina" company. We don't have these machines in our lab, there is actually a centralized service for Uppsala university where you can just submit your samples called the sequencing facility. (➤➤➤) 5. They often have lots of work to do, so about two or three months later, you receive a file placed somewhere on a server. “The file” slide If you open the resulting file in your text editor here is what it would look like. Just sequences one after the other, endlessly. You can get between ten thousands to ten million sequences per experiment depending on how many samples you run in parallel in the sequencer. After all the sampling expedition, the careful laboratory procedures, the expensive sequencing machines, this is the result produced. Every short DNA sequence here is often called a read. It has first an identifier, then comes the nucleotides within the ATGC alphabet, and finally with every nucleotide is associated a quality score which measures how sure the machine was to have read the correct base. Because indeed, these sequencer machines make mistakes and sometimes give you a G instead of a C, etc. Of course viewing the result like this in a text editor doesn't serve much purpose and you are unable to say anything about the microbes that were living in the environment you sampled. You need to process it and you can't to it by hand, you are going to need a series of programs and tools to extract statistical properties from this mess of DNA fragments. That's the bioinformatics part of things. And it's only with bioinformatics that you can use this data to answer questions such as: - What organisms are present in the community ? - How similar or different are environments with regards to their microflora ? - How are organisms within the community related and how diverse are they ? - Can we attribute ecological traits to groups of organisms ? An ecological trait could be anything directly measurable such as production of methane or speed of turnover. - Can we get an idea of what the different groups of organisms are doing ? For instance from a metabolic view point, can we say if they are producing oxygen or maybe they are breathing and releasing CO2 ? - Can we look into the evolutionary mechanisms and maybe say something about how these microbes have adapted ? When the method was first developed, scientists thought it had the potential to revolutionize understanding of the entire living world. Of course that's not exactly what happened. Those are some examples of interesting problems you can try to tackle some with metagenomics. The most important thing is that metagenomics is great tool because is enables us to view the previously hidden diversity of microscopic life. “Resulting metagenomic data” slide It's important to conceptualize where the sequences you see in that file are coming from. You have to realize that you are getting more or less random shattered parts or microbial chromosomes without any distinction. For instance your first read could be part of the beginning of chromosome of this guy, while your second read is a tiny piece of the end of the chromosome of this other microbe floating further away. etc. By the way, this picture is of course not to scale. A read is maybe only a very small fraction of a complete bacterial chromosome. With current sequencing technology the number of reads you get is roughly on the same order of magnitude as the number of microbes in the original sample. For instance if If you perform an experiment starting with 1 ml of water from your typical river and obtain one million short sequence reads, you can conclude that on average each microbe has contributed one read to your final dataset. “The overview: two step process” slide So let's summarize what we have learned up until now. Metagenomics is really best understood as a two step process with the defining middle point being the sequence file, containing the reads. The first step involves taking your glass of water from the lake, bursting all the cells inside, extracting all the DNA and putting the result into a high throughput sequencing machine. This produces a file somewhere on a server. The second steps involves using the information produced in a meaningful manner. This is where you use the data generated to test a hypothesis or answer a question. It will often include synthesizing the data in some way and visualizing it with different kind of graphs. A simple example here is the breakdown of a community by order-rank. But often you might want to be able to predict what is the metabolism of the bacterial community. What compounds are they using for energy and what compounds are they producing. Then you would need some more complex visualization. The keen eye will notice that these species in the pie chart are not ones that you would find in the environment. Indeed, Burkholderia and Streptococcus look much more like a host-associated community. This particular graph was copied from the study of the bacterial community collected from the lung of an 18 year old patient. You see already that metagenomics is also applied in the medical field. More about that later. The first step of a metagenomic experiment happens in the real world and often includes some exciting sampling expeditions to places that have not been explored already. Microbial-wise of course. The second step happens entirely on the computer and almost exclusively in programming languages and command line interfaces. There are the different kinds of metagenomics and different procedures, but keep in mind that this schema holds for more or less all metagenomics experiments. “The lecture from now on” slide So we have gone through the introductory slides and we are ready to go deeper and get into the juicy parts now. We have many interesting things to cover. - We need to talk about the sequencing technology, because in my previous example you kind of magically got the sequences in a file starting from a DNA solution. - We need to put some context around metagenomics and go over the other "-omics" technologies. - We need some more examples of actual ecological questions that this method can answer, otherwise we will keep wondering why are we doing all this. - We need to talk a bit about how microbial populations form. - We need to see how the bioinformatics help us go from the file with all the sequences to something meaningful and what are the problems to avoid there. - And we will end by taking a bunch of examples from past studies and seeing some of the cool stuff that was done using metagenomics. I hope you guys are excited ! “Overivew: two step” second slide So we saw that metagenomics can be described as a two step process. To complete your understanding of what happens in the laboratory part of things we will now turn briefly to reviewing how the sequencing technology works. After that we will be ready to turn to the bioinformatics and the data analysis. “Sequencing technologies” slide A sequencer is a pretty complicated machine. But it's actually quite interesting to see how it works. I imagine you have probably heard a bit about DNA sequencing technologies before, so I'll go rather quickly over this step. The important part to know are that the price of sequencing has been dropping constantly over the years. If you would put on the Y axis the amount of money you need to pay for sequencing one megabase of DNA or one million nucleotides and on the X axis you put the date, it would look like this. (➤➤➤) You can see that the Y axis is logarithmic, so prices are dropping faster and faster. Companies are continuously trying to make the technology better and come up with sequencers that make longer reads at a lower cost. It is quite a big market and at the annual conference for these things it is funny to see the stock prices of one company go up the second they make impossibly good announcements about the new products they are going to be selling and the stock of other companies that don't innovate fall down. Often the impossibly good announcements never come true of course. The price in itself is completely irrelevant. But what it means for us is that more and more laboratories can get access to the technology. There have been several technologies picked up and then abandoned over the years. As I was looking over last year's slides I realized the information there was already completely outdated. For instance you might have heard of "454 pyrosequencing" in the past ? That was a product sold by the company "Roche" but just four months ago they decided they were shutting down the factory and wouldn't sell anymore of them. So all we had learned about those is useless now. What I am going to do here is just present one of the most popular sequencing methods, the one we are currently using for our research in my group, which is the Illumina method so you can get an idea of how it can work. “Illumina technology video” slide To show you how it works inside I have a video. What you see here are the DNA strands you want to sequence. First, the DNA is fragmented and the broken ends are repaired and adenylated. Special adapter nucleotides are ligated to both ends of every fragment. The fragments are then size selected and purified. Cluster generation: What you see here is the flow cell, to central part of the machine, a dense lawn of short oligonucleotides have been grafted to its surface. These oligo have the complementary sequences to the adapters that where ligated to the fragments. Hence the fragments are going to be able to bind to them. That's what happens, once the fragments are hybridized, they are extended with a polymerase to create copies. Then by changing the temperature you can make them bend and attach again. Repeating this will amplify the starting sequence hundreds of times. In the end you get hundreds of millions of unique little colonies. The reverse strands are cleaved and washed away. Then ends are blocked. And the library is ready for sequencing Sequencing: What happens now is that every little colony is going to be sequenced in parallel at the same time. All four bases are added as a solution to the flow cell. Only they are not normal nucleotides, they have a fluorescent particle attached to them. All Ts have a red dye attached for instance. All the four bases complete with each other to bind to the template. After each round of synthesis the clusters are excited by a laser emitting a color that allows to identify which base was added. The fluorescent label and blocking group and then removed. And the next cycle starts, the four bases are added again, etc. This enables you to read the full sequence. “Illumina flow cell pictures” slide Here you can see what a flow cell actually looks like on the left. it's very small yet has about 15 million small DNA clusters on it. And on the right you can see the kind of pictures that come out of the camera filming the flow cell. By classifying which color was detected for which colony, the software will recompose all the DNA sequences. Of course the automatic image processing program isn't perfect, and makes mistakes. For instance it is hard to make a decision when cluster are right next to each other or overlapping. “Indels and substitutions” slide This leads to abnormalities in the results file. Imagine for instance that the real sequence contained in one of bacteria was the one at the top. The Illumina machine might make a mistake and substitute an A instead of G. (➤➤➤) .This is not too problematic depending on what you want to do with your sequences later. What is much more dangerous is when the machine drops a letter or adds one by mistake. (➤➤➤) In this case, when you try to translate the sequence into the corresponding amino acids, you will get what is called a frame shift. (➤➤➤) And every amino acid predicted after the deletion will be wrong because you will have jumped a letter. But we are getting ahead of ourselves. “Bioinformatics” slide OK, so now that we understand how that file with the DNA sequences is created we slowly start turning to the bioinformatics part of things. Maybe we should start by defining that term because the meaning varies depending on who you ask. What does bioinformatics mean ? - Originally it does not mean biological computer science, even though that's how you will often hear it used. The "informatics" suffix isn't meant to relate to the computer, but simply relates to the science of information. - Here is the definition I use of the word: (➤➤➤) is the science and technology of biological information. It is a theoretical science that develops mathematical, statistical, and algorithmic methods and makes use of experimental data (in digital form), in an effort to discover functions and modalities of information processing in the cell. - The thing is: The management, storage and processing of biological data across large computer networks is an important, although technical, aspect of bioinformatics and is often presented as bioinformatics itself, which it is not. This means for instance that when you writing a program to convert the reads file to from one file format to another because the analysis program you want to use doesn't accept the illumina file format, you are not doing science, you are just loosing your time with irrelevant technical problems. I like to say bioinformatics is concerned with modeling and thinking about biological processes not as chemical processes nor physical ones, but as informational processes. For instance: a biologist could describe DNA in a cell as chromatin and would think of mRNA as smaller more fragile molecules. A bioinformatician would say DNA is a constant piece of information in the cell while the mRNA is a rapidly chaining one that is derived from the first. In analogy to a hard disk and the RAM. “Digits from microbes” slide There are a multitude of ways to observe microbes. But to perform bioinformatics with them can't look at them under a microscope; we need to obtain some type of digital data from them. There are several ways to do that, and we just describe one, but it's important to keep in mind that there are other ways. Bioinformatics is not always done on genomics. We could take all the mRNAs from a cell and put them in the same sequencing machines, in that case we would be looking at the parts of the genome which are transcribed and it would be called transcriptomics. You can also obtain digital information from microbes by extracting the proteins from the cells and inserting it into a mass spectrometer. That would be called metabolomics. We are just going to focus on genomics for the moment, of course. “Genomics from microbes” slide So if we are doing genomics, we are focussing on data that is derived somehow form the sequence content of the genome of microbes. But here again there are several ways of obtaining digits from the genome. And there are actually two different variations of the metagenomic technique we have been talking about. (➤➤➤). We will be describing these two more in detail in the following slides, but it is good to know about the two others. - Microarrays: used to be the next big thing, but the scientific community realized after a few years the numerous problems of this technology and it has more or less disappeared today. - Single cell amplification: using apparatuses like flow cytometers it is possible to isolate a single microbial cell from the environment and extract DNA from it. It is then heavily amplified and sent off to sequencing. This solves some of the problems of metagenomics but has others. We might come back to this at the end of the class. Now let's look at the two types of metagenomic experiments. We will start by talking about "targeted metagenomic sequencing", which is just a slight refinement of the technique we have been discussing. “The lab technique with primers” slide If you remember the lab technique from the first part, we just need to modify it like this (➤➤➤) We add a special pair of primers just after extracting the DNA and before amplifying it. This has the effect that only certain pieces of DNA that match with the primers are going to be amplified. This enables us to target just one gene. Of course we should choose a gene that all microbes have, otherwise we will be ignoring large part of the diversity with this technique. “Amplicon primers” slide Once you have chosen which gene to target, you artificially design two primer sequences that will bind to that unique region on the microbial chromosome and enable the TAQ polymerase to amplify only that part in the PCR. “Resulting targeted data” silde If we go back to this picture, when you use targeted sequencing, in the reads file you are actually getting very similar sequences each time, since you are targeting the same region in every microbe. Of course the sequences are not exactly the same, otherwise you would learn nothing. “A ribosome” slide So what gene are we going to target ? The most popular gene used for targeting is part of the ribosome. As you know, living cells make proteins, and to make proteins they need some apparatus that assembles the correct chain of amino acids by reading the DNA. This is the role of the ribosome. Absolutely all living cells have ribosomes in them. This is nice as it means absolutely all living cells have the genes to make ribosome in their genome. We can use this to our advantage and target one of the particular genes that participates in this process to probe a microbial population. You are looking at a picture of the ribosome from E.coli recomposed from a crystallography experiment. As you can see it is made of several pieces that fit together, some pieces of protein and some of RNA. The particular gene that was chosen by the microbiology community is the piece of RNA colored in blue here. It is called the 16S subunit because when you centrifuge a denatured mixture of ribosome that part comes out at the 16 Svedberg units mark. Not only do we know the structure, we also know the nucleotide sequence of this gene. (➤➤➤) “2D ribosome” slide Here you can see is the full sequence projected onto two dimensions. The 16S rRNA was chosen because the sequence is very conserved and is almost the same in all bacteria we have ever seen. Though, it differs for archea and is totally different in eukaryotes. It's present in all life forms. Whatever genome you are looking at, maybe a bacteria, maybe an archea, you can be sure that somewhere there is a gene to make ribosomes. If you were to unroll it, you would see that it measures about 1500 nucleotides. Most of it is very constant, but there are some parts that are highly variable between species as they are not under the same selective pressure. (➤➤➤) These pieces of the gene are free to mutate and drift randomly in their composition as evolutionary time goes by. This variation is what enables us to extract valuable information from the genetic sequences and say something about the diversity of bacteria living in a particular sample, sometimes even distinguish between species of bacteria. It is considered a good method for classifying bacteria and, generally, the phylogenies built using the 16S gene agree with phylogenies built using other marker genes. “Primer choice” slide Of course the read length of typical sequencing technologies is pretty short, so unless you are using the old Sanger sequencing you are actually only getting a piece of the 16S gene of maybe 400 base pairs. In this case the choice of primers is important, and can greatly influence the outcome of your experiment. If one takes as an example a species like the freshwater bacteria "OD1" which has newly been added to the tree of life by creating a branch, it appears it was previously undetected mostly because of the common choice of primers did not target its specific 16S sequence. Now we realize that is quite abundant in almost every freshwater systems. Several other bacterial species suffer from this problem that they will be underestimated depending on the primers you use. “Reads to diversity” slide After all this we are finally ready to answer the question: How do we get from the sequences to a proportional break down of microbial species ? Let's walk through a typical analysis pipeline. “Typical 16S analysis pipeline” slide In this example we are going to use a reduced data set, instead of having thousands of sequences, we are going to start with these seven fake 16S sequences in a file. [...] You have to consider also that sometimes you don’t find any similarity in the database in which case you cannot classify your sequence. Or you find some similarity, but you can only classify your sequence at a very high level saying for instance that it is part of the alphaproteobacteria kingdom, but not which species it is. This is of course somewhat technical, but it’s important to understand what is going on under the hood. Often someone presents a study and simply says "we collected a sample from this place and here is the composition in bacteria"; they often forget to mention all of the processing between those two steps. You now know have an idea of what is involved in actually producing the measures. “A real plot” slide Of course in a real study you would end up with a plot looking something more like this. It will include several sample of different systems or taken at different times. This is just an example taken from one the studies in my thesis where we collected bacterial sample from lakes with very high pHs, called soda lakes. “Statistics” slide The next step in this process is to use some statistical tools in an attempt to understand what is going on, or to extract some general mechanisms. You could try to make a plot like this one, where you place every one of your lakes on a two dimensional space according to what microbial composition they had, and you add on top the correlation with the environmental variables you measured, such as temperature, pH, conductance and the like. At the end, it's up to you do chose the relevant comparison statistics or models to answer your scientific question. You can do many different things by combining bacterial population abundances from multiple samples with other types of data. For every new study different statistics will be applied. “The 16S rRNA databases” slide As you have seen, a crucial step is to chose a taxonomic database when you want to assign species to your quality filtered reads. These databases contains the result of years of experiments on microbes. Without them, it would be impossible to classify the sequences with the names we are familiar with. There are three main providers for such a database. Silva, greengenes, and RDP. They are comparable but sometime use slightly different names for the same species and are not entirely compatible. The most popular are probably these two I would say. In our studies we use this one as it seems to be the most up to date and has the largest number of sequences. But one can't say there is one that is clearly superior to the other. This is once again a choice that the scientist has to make somewhat arbitrarily. “Software for 16S” slide If you need to work one day with this kind of data, there is software that can help you. But they are mostly command-line interfaces, and require knowledge about the unix operating system to be used. “What is a species” slide This brings us to asking the question what is exactly a species in the microbial world ? Generally two bacteria are considered to be part of the same species if their 16S sequence diverge by no more than 3%. But previous to that definition, one would use DNA hybridization methods and species were defined as any two bacteria which hybridize over 70%. Of course the two definitions agree for some species but don't for others. This shows us that definition of what one species is, can change depending on the scientific community's opinion and the tools they have. For instance when studying diversity, the classification of what is different or on the contrary considered the same is an area of debate and will greatly influence any measure you can make of diversity. “What is a species second” slide In the macrocosmic world and especially with mammals, it is easy to define what is a species: if two individuals can reproduce they are from the same species. If they can't reproduce they are from different species. But bacteria reproduce asexually by just dividing, they grow fast and mutate quickly. So what does a species really mean ? They are also able to exchange genetic material horizontally, by just secreting DNA and taking it up through the cell membrane, making things even more complicated. To develop the idea that the concept of a species is somewhat arbitrary in the microbial world consider the following: If you were to classify the species of mammals in a forest, you could count the different animals and come up with result similar to the one presented here. There are 4 foxes and 6 rabbits in the forest. In this case you can be certain that there is nothing that between the two species. If there ever was something genetically in between a rabbit and a fox is it extinct today and you will never find it in your forest. It is not necessarily the same with bacteria. With the 16S method, depending on what classification algorithm you use and which database you chose, you can end up with something such as the following, It should therefor maybe be better viewed as a continuum and could be drawn as the following. “16S rRNA conclusions” slide To conclude this chapter we could say that the 16S rRNA method is good because it was able to reveal the previously hidden diversity of microbes in the environment. But one has to be careful about the biases of the method such as what primers are chosen, what processing is chosen and what databases is chosen. Also one cannot really compare two studies which have not followed exactly the same protocol. Also one of the shortcomings of this technique is that is doesn't give much information about what the microbes are doing metabolically. “Example studies” slide Now to illustrate what you can do with the 16S rRNA technique, here are three studies that I picked from the literature. 1) In the first study, they sampled 18 different lakes in Wisconsin, and looked at a particular population of the freshwater Actinobacteria. Each lake of course had its unique composition made up of different strains of actinobacteria showing that the microbes don't disperse across the environment very rapidly. Furthermore what best explained the community structure of population was not the distance between the different lakes but the environmental factors of the lake, such as its pH. This is interesting because it supports the general ecological theory "That everything is everywhere, but the environment selects". Meaning that every microbial species can potentially arise anywhere as long as the conditions are good for it and that the limiting force is not dispersal. 2) In second place, here is a study where they tried to question what factors influence the composition of the microbiota in the gut of different mammals and how the host and microbes co-evolve together. Is it what you eat that determines the population of microbes or is your ancestry ? They found that the phylogeny of the host was the dominating effect, and that the diet had a strong secondary effect. Another finding was that the communities of humans living a modern life-style was typical of omnivorous primates. 3) Finally an example of a study from the medical field where they compared the microbial communities of new born babies and tried to determine what factors shape their microbiota in the beginning. They found for instance that the way a baby is born is the biggest factor contributing to the composition and that babies born by c-section don't get inoculated with the same starting population. Indeed, the medical field is also very interested in bacterial communities, as they are linked to many different diseases. They also generally have more money than the ecological field, which is perhaps a questionable decision of our society. But let's not talk politics ! Microbes are mostly linked to gastrointestinal diseases but for instance a recent study showed strong link between the gut microbiota and Parkinson's disease. There is also an interesting procedure to cure some patients from inflammatory bowel disease where, in order to reset the bad microbial population in the colon, the patient is given a fecal transplant from an other healthy patient. “Example studies” references Newton RJ, Jones SE, Helmus MR, McMahon KD: Phylogenetic ecology of the freshwater Actinobacteria acI lineage. Appl Environ Microbiol 2007, 73:7169-7176. http://www.ncbi.nlm.nih.gov/pubmed/17827330 Ley RE, Hamady M, Lozupone C, Turnbaugh PJ, Ramey RR, Bircher JS, Schlegel ML, Tucker TA, Schrenzel MD, Knight R et al.: Evolution of mammals and their gut microbes. Science 2008, 320:1647-1651. http://www.ncbi.nlm.nih.gov/pubmed/18497261 Development of the Human Gastrointestinal Microbiota and Insights From HighThroughput Sequencing Maria Gloria Dominguez-Bello, Martin J. Blaser, Ruth E. Ley, Rob Knight http://www.sciencedirect.com/science/article/pii/S0016508511001600 “The lab technique 3” slide OK you remember how we modified the lab technique to add primers to target the 16S rRNA gene ? We are going to look now at what happens if you remove that step (➤➤➤) “The shotgun metagenomic” slide In that case we don't target any particular gene and we get random pieces from anywhere on the microbial chromosomes as shown on this picture. Because of the randomness of breaking the DNA in the sample, this technique it is called shotgun metagenomics. “Classical shotgun sequencing” slide It's important to remember the difference between shotgun metagenomics that is applied to free living microbes and the classical shotgun sequencing that can be applied to monoclonal microbes growing on a plate. In a classical sequencing experiment you only take microbes that are growing in the same colony, so though you get random parts of the chromosome, you know that your reads are all a fragment coming from a copy of the same chromosome. This makes thing much easier to pieces together afterwards. “The reads” slide So what do you do this time with your reads ? They are not coming at all from the same region of the chromosome, not from the same type of bacteria. Does anyone have an idea ? “Genomic assembly” slide What one can do if one has enough reads is try to assemble them together. That is, try to find overlaps between the sequences and build bigger pieces. That's a pretty easy task if you have three reads. You can simply compare the end of each read with the start of each read and check for matching parts. Finally you can build a longer sequence from the three starting reads. This is often called a contig, standing for continuous piece of DNA. This seams like an easy problem to solve, to find matching parts between sequences, but when you have several million reads you need to find another strategy, you can't just compare every read with every read. An algorithm that would simply try to test all the comparisons would take longer to execute than your life expectancy and is thus useless. Many smart algorithms have been developed to counter this problem and nowadays you can assemble even human sized genomes on a single computer. Of course for a large genome like that you still need a pretty big computer. “Networks and graphs” slide The problem is often solved by using the reads and building a network or a graph with them. For instance in this simple example we have a very small microbial chromosome with 10 nucleotides. This might generate the following reads when randomly broken up. By drawing links between the reads that have an overlap, you can then just follow a path in the graph that goes through each read once and recover the assembled sequence or contig. “De Bruijn graphs” slide However this strategy quickly reaches a limit as the problem or finding a path that goes through every node in a graph once is very demanding computationally when the graph is big. This is why, for data coming from the Illumina machines, a different type of graphs is used, called De Bruijn graphs where instead of putting the reads as nodes in the graph, the reads become edges. This means you have to find a path that goes through every edge once, which is much easier for a computer. We don't need to go into more detail about these complex mathematical tools. Unless you ask me to of course. But it's interesting to point out that as tools for doing biology get more and more technical it becomes increasing harder to understand what is really going on for a student coming from a standard biology curriculum. Quite often biologists are lacking in programming and mathematical skills to deal with this new area of biological data and almost always need to associate or a hire a specialist to conduct such studies. “From contigs to genomes” slide OK, but all you need to understand for the moment is that the result of the assembly operation is that you produce a smaller set of longer sequences. Unfortunately when you assemble your metagenomic data you never end up with contigs that are the size of a typical bacterial genome which is about 3 mega base pairs. You get lots of smaller contigs measuring maybe only kilo base pairs. This is because sometimes a piece of the bacterial chromosome is under-represented in the reads or because of multiple repeated regions in the genome. So how do you decide that two contigs should go together and are part of the same original genome since by definition they don't share any overlap in sequence ? It's a hard problem, especially if your sample contains lots of different species. But there are some smart tricks. Often one is only able to pull out a few semi-complete genomes even with large number of starting reads. If you think about such a problem, a real world analogy would be the following task: take five different english novels and printing a thousands copies of each. Send all the five thousands book you have in a industrial shredder tearing them to pieces. Take all the small pieces of paper left from the process and introduce errors in them. Then burry them in the ground for a week so many of them become unreadable and rotten. Take all of them out and try to recompose the original stories by piecing the papers together. You could use tricks such as identifying the author's style and using that to put together different chapters. In the same way, you can use the information contained in contigs to separate them. A simple statistic is to compute the ratio of GC nucleotides against AT nucleotides. A bacterial genome usually has a constant ratio across its chromosome. You can also use the fact that microbes have different abundance in the sample and hence contribute more or less reads to the dataset. If you have a microbe taking up 20% of a given community and an other taking up only 10%, then all the contigs belonging to the first microbe should be made up, on average by twice as many reads at every position. “Ideal case” slide In an ideal case, you could get a bimodal distribution for both of these measures and are able to put together four different genomes. But that's rarely the case. “Real case” slide Here is a graph that was taken from a study. You can see some clear groups, but most of it is very mixed. “FASTA file to metabolic graph” slide How do you use such data ? How would you go from your contigs file to the conclusion that the microbes in your sample have the genes to degrades chemical compound X but not the ones for its transformation to compound Y. Let's walk through a typical analysis pipeline. “Typical analysis pipeline” slide [...] One of the problems in this approach is that often half or more of your detected open reading frames have no known function. Once again you can do this for every sample, it's then up to the scientist to choose the mathematical methods or statistics to compare samples amongst each other depending on what you are questioning.. “Databases for proteins” slide In such a pipeline an essential part is again the database you choose for inferring function when you detect a protein. There are many such databases, here the main ones are listed. Uniprot for instance is one of the largest databases for proteins with functional information. It has about 30 million entries. But only a few thousand have been confirmed by extracting the protein, about 2% have evidence from some type of published experiment. 20% are inferred by strong homology and the rest is predicted function. You have to keep in mind that most functions are just predictions; so be careful when using these databases. There is a notable problem when new genomes are annotated with information coming from the database and then the same genomes are later used to fill the database with the conclusions. This makes the information stored there a reflection of itself rather than a reflection of reality. “Software for metagenomics” slide As I said earlier, bioinformatics is mostly done with interfaces like the command line and programming languages like Perl and Python. There are no easy user interfaces. There are however some online tools for processing shotgun metagenomic data. And that can automatize large parts of the procedure. There is the MG-RAST website where you can upload a FASTA file and it will give you a page full of statistics from taxonomic distribution to metabolic process including diversity measures. The european bioinformatics institute also promises a similar service, but it has been changing lately and their new version is still in beta version. “Shotgun conlusions” slide To conclude this chapter we can say that the most difficult part of shotgun metagenomics is assembling the reads and then the contigs in a satisfactory manner. Often the procedure gives you incomplete genomes or genomes that are called chimeric because they are the result of wrongfully merging different species together. In fact, the storage and analysis cost of a metagenomics experiment is always higher nowadays than the lab supplies and sequencing part of the costs. If you are a small lab without the bioinformatics knowledge, it's may not be accessible to you. Predicting the function of the microbes is second hardest problem, with about half of proteins having no similarity found in the databases. Even when you are able to predict the function of the protein you found, a problem is always the fact that seeing a gene in the genome of an organism doesn't mean it's being actively transcribed, translated and folded properly. Hopefully, new sequencing technologies will appear in the coming years that produce longer reads making assembly easier. Since we are currently calling today's technology "next generation sequencing", I have written next next generation sequencing on the slide. None the less it enables us to gather lots of information about microbes that otherwise are impossible to isolate and culture. “Everything” slide Doing metagenomics is also quite popular. On this slide are some randomly chosen examples from a metagenomic repository. You can see that everything and anything has been sequenced. You have: - A Rice-straw enriched compost from Berkeley - An Acid Mine Drainage - Leg ulcer microbial communities - Hot spring microbial community - Salinewater from the Dead Sea “Everywhere” slide Here is a screen shot showing the world map and the number of metagenomes recorded in that particular area. This is probably just a fraction of all the metagenome sequencing that has been going on, as many experiments are not submitted to the public databases or tagged with the appropriate GPS data. “Example study 1” first slide To demonstrate the kind of studies that can be made, I picked an article from the literature that represents one of the first success stories of metagenomics. In this study, scientists from Berkley sampled microbes from biofilms growing on the water coming out of an abandoned iron mine. You can see a picture here of the greenish bacterial slime that was collected. This mine drainage water is extremely acid, with one the lowest pHs ever measured in a natural system of -3. This of course is detrimental to the environment and is considered pollution. The microbes living there actually contribute to the pollution as they can catalyze the decomposition of metal ions. I don't want to go into too much detail but what they did is start with a target 16S study that showed that there was only three different bacterial lineages and three archeal lineages in the sample, making it an extremely simple microbial community. Usually you would find thousands of different species in an sample. However it is often the case that in extreme environments, the complexity is lower. Next, they applied shotgun sequencing to the sample and where able to organize their contigs into clear distinct groups using coverage and GC content. In one of the groups they recovered a nearly full genome that was related to the Ferroplasma group of organisms. The 16S gene was similar to a cultivated representative of that group and the total size of the genome matched with that relative. But the rest of the genome was sufficiently different to establish it as a new species. They also found a new organism relating to the Leptospirillum group. But what is really interesting is that by analyzing the contents of the newly retrieved genome they could predict almost all the functions of the cell and draw the following diagram. “Example study 1” second slide By finding what proteins where doing what they are able to reconstitute an almost full metabolic map of the cell. You can also get information by looking at what is missing in the genome. For example, neither the Leptospirillum group II genome, nor the Ferroplasma genomes, seem to be lacking the genes necessary for nitrogen fixation, suggesting that this essential task might be performed by other less abundant members of the community. You can start answering questions like "How do these communities resist the toxic metals and maintain a neutral cytoplasm ?” You can also start looking at the evolutionary events that these organisms have undergone by examining the amount of nucleotide diversity in each part of the genome. For example, in the case of the Leptospirillum group, they appear to have undergone a recent selective sweep, judging by the very low level of nucleotide polymorphism. On the other hand, the new Ferroplasma friend seems to have been subject to homologous recombination events. So now you see how one can predict the metabolism of microbes that don't want to be cultured. “Example study 1” reference Nature. 2004 Mar 4;428(6978):37-43. Epub 2004 Feb 1. Community structure and metabolism through reconstruction of microbial genomes from the environment. Tyson GW1, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF. http://www.ncbi.nlm.nih.gov/pubmed/14961025 “GOS” slides 1.045 billion base pairs of nonredundant sequence -> Fishing expedition problem. -> Found a rhodopsin in the Bacteria kingdom. -> Found unexpected diversity and nitrification in the Archea kingdom. -> Have to make laboratory experiments to confirm findings. These results can only give you clues. -> DeLong et al., Science, 2006 “Targeting larger organism” slide -> Environmental forensics. http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0035868 “Cryptic fluctuations” slide Even after all these studies, the truth is we don't really have any universal models to explain how microbial communities form and change over time. It's something that is very hard to predict, even if you where given all the external inputs (such as the hydrology, the amount of precipitation) and all the other driver variables (such as temperature, pH, amount of sunlight). The internal processes of growth and competition between microbes are so complex that they can't really be summarized in simple equations. You also have to consider even more factors such as the predations on the microbes from higher organisms and from viruses. There is still lots of uncharted territory, so it's quite an exciting field. “Unexplored territory” slide Here is a graph showing all the different kinds of things we can still do to get a better understanding. Today we have mostly been speaking of taking whole communities and subjecting them to DNA metagenomics, but one can also do transcriptomics and proteomics. One can also try to lower the complexity of the community and apply the same tools, we can also try to isolate single cells and do the same things. That for instance, single cell genome amplification is just starting to get popular, and a new center specialized in that is just being created here in Uppsala. “Combine and conquer” slide One shouldn't forget also to combine all these experiments together, and it's only by attacking the problem from all sides that we will start to see clearer. Thanks for listening.
© Copyright 2026 Paperzz