Here

BEBaC user manual
Lu Cheng
12.06.2012
Email: [email protected]
BEBaC [1] (Bayesian estimation of bacterial communities) is a software devoting to estimate
bacterial communities from large amounts of deep sequencing data of 16S rRNA gene. A
dataset which contains 4K~100K reads of length 400bp~2000bp is suitable for BEBaC. Larger
datasets are also possible, but we have not tested on these datasets. Here “reads” means DNA
sequences produced by a sequencing machine.
BEBaC is written in Matlab. Currently BEBaC only supports command lines; graphic mode is not
supported. It is specially designed for parallel usage. Thus we strongly suggest you use a
computer cluster for analyzing your dataset with BEBaC.
Input data
1. The input data is supposed to be DNA reads of similar length, such as 450bp~550bp. Large
difference in reads lengths might result in poor multiple sequence alignment.
2. All reads should only contain bases ‘ACGT’. ‘N’ and other characters are not allowed by the
software.
3. The input file should be in FASTA format. Also Matlab format is supported, i.e. “.mat” format.
The “.mat” file should contain “headers” and “seqs” as cell array, which represents the reads.
BEBaC work flow
This section explains BEBaC work flow. BEBaC contains three main steps: pregroup, crude
clustering (L1cluster), fine clustering (L2cluster). Figure-1 shows the work flow of BEBaC. Note:
here “reads” means sequences in the input file.
Pregroup
1. In pregroup, we first transform the reads to 3-mer counts, thus the data is transfromed to a
nReads*64 matrix.
2. Next we cluster the matrix using K-means, in which the distance is calculated as the linear
correlation between the rows. The K is selected so that the averaged cluster size is around 500,
i.e. K=nReads/500. After k-means clustering, we get K initial clusters.
3. For each initial cluster k, we calculate the pairwise distance matrix. Please refer to the
supplementary material of our paper for details.
4. For each initial cluster k, we cluster the reads using complete linkage algorithm (furthest
neighbourhood). Then the reads are assigned to pregroups by the cutoff 0.1.
1
Figure 1: BEBaC work flow
Crude clustering
In crude clustering, also called L1clustering (level 1 clustering), we utilize the pregroup
information. For each pregroup, we sum the 3-mer counts for all the reads in the pregroup. Thus
a pregroup is represented by a 1*64 vector. We then model the each crude cluster as a Dirichlet
distribution with 64 hyperparameters, pregroups coming from the same distribution will be put
into the same cluster such that the marginal likelihood is maximized.
The user provides the maximum number of possible crude clusters: MAX_K. BEBaC searches
the partition space which satisfies the limit K<MAX_K. It will give the best partition which
maximize the marginal likelihood. The number of crude clusters in the best partition should be
smaller than MAX_K. If not, BEBaC will double MAX_K, then search the best partition again.
BEBaC will double MAX_K at most 4 times, i.e. 16*MAX_K.
Fine clustering
After crude clustering, BEBaC will align all the reads to each crude cluster. Then it uses a codon
linkage model adopted in BAPS [2] to cluster the reads. By default, we set
MAX_K=1,2,3...USER_DEF_K. (USER_DEF_K=10 by default) Then we calculate the DL
(description length). We select the MAX_K which minimizes the DL. In this way we determined
the “best” partition, or the fine clusters. In the end, reads within the same fine cluster will be
2
realigned again to derive the consensus sequences. However, if a fine cluster consists more
than a half of the sequences in its corresponding crude cluster, then it will not be realigned.
Note that if no result file is produced in this stage, please check the standard output and set a
larger USER_DEF_K.
To save computation time, a crude cluster with more than 3000 reads will be aligned with
“FAST” mode using MUSCLE [3], while crude clusters with less than 3000 reads will be aligned
with “SLOW” mode. (see Installation section about “FAST” and “SLOW” mode)
Deriving consensus sequences
For each fine cluster, we derive a consensus sequence from its multiple sequence alignment
(MSA).
For each site (column in the MSA), the majority base is chosen as the base in the consensus
sequence. If the majority base for a site is indel, i.e. “-”, then that site will be eliminated from the
alignment. The consensus is contained in “conseq.fasta”
For each site, the frequency of the majority base is used as the quality score of the consensus
sequence. The quality score is containe din “conseq.qual”.
We also calculated the average deviation of each read to the consensus sequence, i.e. how
much percentage of a read is different from the consensus sequence. The deviation is
calculated by the Hamming distance divided by the length of the whole alignment. This
information is shown as “avgDev” in the headers of “conseq.fasta” and “conseq.qual”.
Installation
BEBaC only supports Linux currently.
1. Add BEBaC directory to path
Here we take ubuntu for example, open a terminal
gedit .bashrc file
append the following line to the end of the file (You have to replace "PATH_TO_BEBaC" with
your BEBaC directory)
export PATH=PATH_TO_BEBaC:$PATH
Now save the file and close the terminal, then start a new terminal.
2. Install MCR
If you have Matlab(2010a), then you do not need to install MCR.
To install MCR, type
chmod u+x MCRInstaller_unix_2010a_64bit.bin
./MCRInstaller_unix_2010a_64bit.bin
3. Install MUSCLE
see instructions here:
http://www.drive5.com/muscle/downloads.htm
4. Configure the BEBaC.sh file
3
gedit BEBaC.sh
specify the installation directory of your MCR (or MATLAB) in the file
specify the path to MUSCLE software on your machine
5. run BEBaC analysis
type:
mkdir OUTPUT_DIRECTORY
cd OUTPUT_DIRECTORY
then copy testseq.fasta to OUTPUT_DIRECTORY
sequences in testseq.fasta should only be "ACGT", other characters such as "N","-" are not
allowed
type:
BEBaC.sh preprocSeq testseq.fasta .
you will get reads.mat
type:
BEBaC.sh preGroup reads.mat . initCluster
you will get pregroup_initial_K=4.mat and a folder "pgdist"
type:
BEBaC.sh preGroup reads.mat . calDisMat 1
you will get distance matrix file pgdist/dismat1.mat of initial cluster 1
type:
BEBaC.sh preGroup reads.mat . calDisMat 2:4
you will get distance matrix file for initial cluster 2,3,4
type:
BEBaC.sh preGroup reads.mat . pregroup
you will get pregroup result file: pregroup_final_K=102.mat
type:
BEBaC.sh clusterL1 20 pregroup_final_K\=102.mat .
perform crude clustering, you will get 4 crude clusters, and the result file is L1_clusters_K=4.mat
"20" is the maximum number of crude clusters.
type:
BEBaC.sh clusterL2 1 L1_clusters_K\=4.mat .
perform fine clustering for crude cluster 1, you will have a new folder "L1-clusters", and the
subfolder "1" contains the fine clustering results.
type:
BEBaC.sh clusterL2 2:4 L1_clusters_K\=4.mat .
perform fine clustering for crude cluster 2 to 4, you will get subfolders "2","3","4" under folder
"L1-clusters"
type:
BEBaC.sh fetchConsensus 4 .
fetch the consensus sequences of OTUs, you will get a folder "results"
4
"conseq.fasta" and "conseq.qual" are the consensus sequence and quality files
"conseq1-4.ps" are a graph which shows the quality of each consensus sequence
"crudeLabels.txt" and "fineLabels.txt" shows the partition of the input sequences
Now BEBaC analysis ended
6. Example to use extra commands
type:
BEBaC.sh viewQuality final_result.mat "1 3:4"
view the quality score for consensus sequence 1,3,4
You will see a figure displaying the quality score and other information
type:
BEBaC.sh calGroupDis final_result.mat testseq.groups groupOTUdistribution.txt
calculate the OTU distribuion for each pre-defined group "testseq.groups" contains the predefined group information of each read "groupOTUdistribution.txt" is the output file, each column
which contains OTU distribution of each pre-defined group
type:
BEBaC.sh seqAlnCluster L1-clusters/1/seqs.aln 4 tmp.mat
perform CT's clustering alogrithm (see fine Clustering section in our paper) to sequences in
crude cluster 1 in this example
"L1-clusters/1/seqs.aln" is the multiple sequence alignment file in FASTA format
"4" means the maximum number of clusters
"tmp.mat" stores the output information in MATLAB format. You will also get a file "tmp.mat.txt",
which shows that partition of the input sequences.
Commands and outputs
NOTE: this section is meant for running BEBaC source code in Matlab. For compiled version of
BEBaC, please transform “cmd(arg1,arg2)” to “cmd arg1 arg2”.
All string input should be in single quote. For example, if the input sequence file is reads.fasta,
then it should be written as ‘reads.fasta’ for input.
1. add_BEBaC_to_path
Function: Add BEBaC commands to the path of the system, so that the commands could be
recognized by the system
Input: null
Output: null
2. outfile = preprocSeq(seqFile,outDir)
Function: Preprocess the sequences in the input file. If “reads.mat” exists in “outDir”, then
“seqFile” will not be processed.
Input: “seqFile” is the file which contains the reads; “outDir” is the output directory for storing all
results. Input sequence should only contain “ACGT”, “N” and “-” characters are not allowed.
5
Output: preprocessed reads will be saved in the file “reads.mat”, “outfile” shows the full path to
it
3. preGroup(procSeqFile,outDir)
Function: assign the preprocessed reads to pregroups. Running this command will overwrite
existing results.
Input: “procSeqFile” is the path to the preprocessed sequence file, i.e. “reads.mat”; “outDir” is
the output directory for storing all results, which is the same as previous step.
Output: “pregroup_initial_K=3.mat” is the result file for the initial kmeans clustering, where “3”
indicates the number of initial clusters; “pgdist” is the folder in which the distance matrixes for
the reads in each initial cluster are stored; “pregroup_final_K=992.mat” is the result of final
pregroups, where “992” indicates that there are 992 pregroups in total.
Note: the computation for the distance matrix might be slow for large dataset, we suggest you to
use parallel commands, please see details in section “Extra Commands”.
4. outfile = clusterL1(N_L1_CLUSTER, pregroupResFile, outDir)
Function: cluster the pregroups into crude clusters. N_L1_CLUSTER will be double up to 4
times, i.e. 16*N_L1_CLUSTER, if the the same number of crude clusters are detected.
Input: “N_L1_CLUSTER” is the maximum number of crude clusters; “pregroupResFile” is the
result file of pregroup, i.e. “pregroup_final_K=#.mat”; “outDir” is the output directory for storing
all results, which is the same as previous step.
Output: “L1_clusters_K=#.mat” contains the results of crude clustering, where “#” indicates the
number of crude clusters learned by the model.
“L1_clusters_size_K=#.txt” contains size of crude clusters. Crude clusters are labeled according
to descending order of crude cluster sizes.
“crudeLabels.txt” shows the crude cluster label for each read.
Note: “N_L1_CLUSTER” might be increased if BEBaC think it is too small. Please check section
“BEBaC work flow”, subsection “Crude clustering”.
5. clusterL2(L1cluster_ids, L1cluster_preproc_file, outDir)
Function: perform fine clustering for the given crude clusters.
Input: “L1cluster_ids” are the index of crude clusters, e.g [5:10] means crude cluster 5 to 10;
“L1cluster_preproc_file” is the result file in crude clustering, i.e. “L1_clusters_K=#.mat”; “outDir”
is the output directory for storing all results, which is the same as previous step.
Output: the results are stored in folder “L1-clusters/cluster_id”,”seqs.fasta” contains the reads in
the crude cluster; “seqs.aln” is the multiple alignment file for sequences in that cluster;
“resK=#.mat” tells the fine clustering results by setting MAX_K=#; “entropy_change.ps” shows
how the DL (description length) changes by K; “cluster.ps” shows the clustering results given by
different K. “final_result#.mat” contains the final fine clustering results for crude cluster #.
6. fetchConsensus(nL1Cluster, outDir)
Function: collect the consensus sequences information, should be used after all fine clustering
results are available.
6
Input: “nL1Cluster” is the number of crude clusters; “outDir” is the output directory for storing all
results, which is the same as previous step.
Output: the results will be saved in “final_result.mat” in “outDir”. Other information are available
in folder “results”.
Extra Commands
Command 1~3 are meant for parallel usage in the pregroup stage. Command 4 is meant for
correction in the fine clustering stage.
1. preGroup(procSeqFile,outDir, 'initCluster') or preGroup(procSeqFile,outDir, 1)
Function: only perform the initial clustering in pregroup
Input: the same as “preGroup” command except for the 3rd parameter, it could be either
‘initCluster’ or 1, or ‘1’
Output: “pregroup_initial_K=3.mat” will be the result file
2. preGroup(procSeqFile,outDir, 'calDisMat', cluster_ids) or preGroup(procSeqFile,outDir, 2,
cluster_ids)
Function: calculate the distance matrix for the given cluster_ids, which are the initial clusters
Input: the same as “preGroup” command except for the 3rd and 4th parameter; the 3rd
parameter could be ‘calDisMat’ or 2, or ‘2’; the 4th parameter indicates the initial clusters, e.g.
[1:3].
Output: distance matrix “dismat#.mat” in the folder “pgdist”, where “#” indicates a cluster_id
3. preGroup(procSeqFile,outDir, 'pregroup') or preGroup(procSeqFile,outDir, 3)
Function: assign the reads to pregroups; should be use after all distance matrixes are available
Input: the same as “preGroup” excepted for the 3rd parameter, which should be ‘pregroup’, 3 or
‘3’.
Output: result file “pregroup_final_K=992.mat”, where “992” means the number of pregroups
4. clusterL2(L1cluster_ids, L1cluster_preproc_file, outDir, MAX_K)
Function: perform fine clustering for the given crude clusters, by setting MAX_K to a larger
value.
Input: the same as “clusterL2” except the 4th parameter, the user should set it to a larger value.
Please check the previous output to set MAX_K.
Output: the same as “clusterL2”
5. h=viewQuality(resFile, conseq_ids)
Function: viewing the quality score for a set of given sequences.
Input: “resFile” is “final_result.mat”
Output: “conseq_ids” are the consensus id, [3 13 50] means draw the quality score and other
information for consensus sequence 3, 13, 50
Note: Here we only handle one gene.
7
6. partition = seqAlnCluster(alnFile, nMaxPops, resFile)
Function: perform the clustering algorithm described in [2], this function could help the user to
determine how many possible fine clusters there could be in a crude cluster. Then the use could
use this number to guide command 4.
Input: “alnFile” is the alignment file of the reads in FASTA format; “nMaxPops” is the maximum
number of clusters; “resFile” output file to store the results.
Output: “partition” is the label of each read after clustering.
Note: Here we only handle one gene.
7. h = calGroupDis(resFile,groupFile,outFile)
Function: If pre-defined groups information is available for each reads, this function calculates
the OTU distribution for each group.
Input: “resFile” is the result file ‘final_result.mat’; “groupFile” is the pre-defined file for each read,
each row of which corresponds to the input reads; “outFile” is the name of output file.
Output: a maxtrix #OTUs * #pre-defined groups will be stored in the outFile; “h” is the figure
handle which shows the OTU distribuions for these pre-defined groups.
8
Parallel work flow
Figure-2: the work flow of parallel usage of BEBaC.
9
Interpretation of outputs
1. entropy_change.ps in folder e.g. L1-clusters/3/
Figure-3: This figure shows how the DL and entropy changes according to different K. The
software calculates from K=1,2,....USER_DEF_K (USER_DEF_K=10 by default). It is also
possible that we only calculate from K=1 to K=4 if the algorithm in [2] thinks there are at most 4
clusters in the data.
10
2. cluster.ps in folder e.g. L1-clusters/3/
Figure-4: This figure shows the fine clusters. x-axis is the reshuffled index of the reads; y-axis is
the number of fine clusters. Each color represent a fine cluster given a specific K. Note that the
same color does not necessarily mean the same cluster with different K. This figure shows that
the reads are partitioned into 2 fine clusters, indicated by red and green.
11
3. conseq1-16.ps in folder “results”
Figure-5: This figure shows the quality score of consensus sequence 1 to 16, and their
proportions in the data, as well as the average deviation to the consensus sequence within each
fine cluster. x-axis means different sites of the consensus sequence. Note: the quality score
could be lower than 0.5.
4. Header of “conseq.fasta” and “conseq.qual”
A sample header looks like this
>conseq1 | counts= 1805 | totalCount= 8961 | percentage=20.1428% | avgDev=1.0517%
“conseq1” means consensus sequence 1, which is derived from fine cluster 1; “counts” means
how many reads are there in fine cluster 1; “totalCount” means the total number of reads in the
input data; “percentage” is “counts” divided by “totalCount”; “avgDev” show the average
deviation for reads in fine cluster 1 to consensus sequence 1.
5. “crudeLabels.txt” and “fineLabels.txt”
The two files contain the labels of the input sequences according to crude clustering results and
fine clustering results. The 1st line corresponds to the 1st read in the input sequence file; the
2nd line corresponds to the 2nd read in the input sequence file; and so on.
12
References
[1] Cheng, L., Walker, A.W. and Corander, J. (2012) Bayesian estimation of bacterial community
composition from 454 sequencing data. Nucleic Acids Research. doi: 10.1093/nar/gks227
[2] Corander, J., and Tang, J. (2007). Bayesian analysis of population structure based on linked
molecular information. Mathematical biosciences, 205:19-31.
[3] Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Research. 32(5):1792-1797.
13