EMMSA Researcher Guide.

EMMSA RESEARCHER GUIDE
OVERVIEW:
16S rRNA has become a fairly popular marker for classifying microbial organisms. As more and more
microbial organisms are discovered, their whole genomes are also made available due to growing
proliferation of the next generation sequencers. This has become some what of national interest to
collect microbial samples of the water and soil for the national sequencing projects in an effort to define
the national microbial environment. Such efforts lead to national repositories of several thousand 16S
rRNA sequences which are then typically organized into an a multi sequence alignment for further
analysis. Since EMMSA does not require prior alignment and provides quasi alignment methods for
modeling, classification and all-against-all analysis, some researchers prefer to classify their sequences
using EMMSA.
This document is intended for those that have collected 16S sequence data from a variety of sources
and would like to perform classification and phylogenetic analyses on them.
There are several components involved in EMMSA based sequence analysis and they are as follows:
1. Preprocess: Transform all the 16S rRNA sequences to Numerical Summarization Vectors (NSVs).
a. Prepare a FASTA file with expanded header as described in the appendix.
b. Group related sequences for an organism into a single FASTA file.
c. Build count files for sequence files of interest which are used to build models (EMMs)
for further analysis. There are two types of count files:
 Numerical Summarization Vector (NSV) files that count 3-mers with/without
overlap. Overlap is useful for Metagenomic fragment classification. For
example, 3-80-79 counts 3-mers in a 80 bp segment with overlap of 79 bps
between successive segments. First segment is from 1-80, second is from 2-81
etc.
Individual Count Vector (ICV) files that count 1-mers, i.e, individual nucleotide
counts for each segment. For example, ICV corresponding to an NSV of 3-80-79
is 1-80-0. Thus vectors of size 4 for each segment of 80 counting occurrences of
the individual letters are created.
Typically, when we say NSVs or count vectors, both NSV and ICV are implied. ICVs are also called
Individual Probability Vectors (IPVs) when summarized to probabilities.
2. Build: Generate EMMs for the training set as well as the query set.
a. Create/edit a descriptor file that defines each count file usable toward model building.
b. Build Models.
3. Classify: Determine the classification for each.
a. Build a descriptor file that divides models into train and test.
b. Run classifier.
c. Extract and interpret classifier results.
4. Differentiate: Build a distance matrix.
a. Create/edit a descriptor file to specify candidates for all-against-all analysis
b. Run differentiator.
c. Extract and create a distance matrix file for phylogenetic analysis.
EMMSA uses and outputs various files; their detailed description is included in the appendix.
PREPROCESS:
Required: R
Sources, binaries and documentation for R can be obtained via CRAN, the “Comprehensive R Archive
Network” whose current members are listed at http://cran.r-project.org/mirrors.html. After you install
“R”, start “R” and load package “seqinr” from the nearest mirror site. In addition to this, you will need a
few more R programs which you can download from
http://lyle.smu.edu/cse/dbgroup/IDA/EMMSA/rnaCounter.zip.
You will need the following:
1. Directory where the FASTA files are kept as .wri files (default: seqfiles)
2. Directory where the additional “R” programs are kept (default: rna_counters)
3. Directory where the output count files can be copied to(default: 3mer)
The RNA_counter zip file contains the following files:
1.
2.
3.
4.
5.
counter.R
//not used directly by user; called from batch_counter below
batch_counter_3mer.R //invoked by user to generate NSV count files
batch_counter_1mer.R //invoked by user to generate ICV count files
sample.R
//invoked for creating train and test partitions
sample10x.R
//invoked for creating 10X cross validation table
Preparing FASTA files for the 16S rRNA sequences:
A microbial organism is associated with one or more 16S rRNA sequences. Typically a 16S rRNA
sequence is approximately 1500 bases long. For example, the NCBI FTP site lists all the 16S rRNA for an
organism. In general, we would like to create one FASTA file containing all the copies of the 16S rRNA
that are found for a given organism. Such complete information is subsequently used for building a
model, i.e, EMM for an organism.
However, you may have one or more 16S rRNA sequences and you are trying to find their classification.
In such cases, you would keep them in separate FASTA files with appropriate headers. The header
format for a FASTA file is described in the appendix.
Some times, you may have only a sequence fragment and its location along a 16S rRNA sequence may
not actually be known, but you still would like to determine its classification. This is usually the case
with Metagenomic or clinical samples. In this case also, you would create a separate FASTA file for each
fragment you are trying to classify.
Thus we have three types of 16S rRNA sequence data – 1) complete copy list of complete 16S rRNAs for
some known/unknown organism, 2) partial list of complete 16S rRNAs for some known/unknown
organism and 3) one or more sequence fragments of unknown position of some unknown organism.
The very first type of sequence data, i.e, the complete list of complete 16S rRNA, are used to build a
training library of models which are used as the classification labels. The remaining two types are used
as the test or query sequences used for classifying.
Generating count (NSV/ICV) files:
What follows is the “R” code for batch-counter_3mer.R located in the rna_counter directory and we will
be using the specific contents of this file for generating counts of different parameters. Note that the
directory "Ssixteen/organismFiles" is where the FASTA files are to be kept. If you have them in a
different directory, you will need to modify the R file accordingly.
source("counter.R")
library("seqinr")
ss <- 99999
window <- 80
overlap <- 0
#overlap = 79
last_window <- FALSE
word <- 3
dir <- "Ssixteen/organismFiles/"
all_sequences <- list()
unlink(paste(dir, "*.txt", sep="/"))
files <- list.files(dir)
for(f in files){
f <- paste(dir, f, sep="/")
cat("doing", f, "\n")
if(!is(try(sequences <- read.fasta(f)), "try-error")){
annot <- getAnnot(sequences)
annot <- sub(">", "", annot)
annot <- strsplit(annot, split=": ")
annot <- matrix(annot[[1]], ncol=2, byrow=TRUE)
desc <- annot[,2]
names(desc) <- annot[,1]
cnt <- count_sequences(sequences,
window=window, overlap=overlap, word=word,
last_window=last_window)
stream <- make_stream(cnt, ss_val=ss)
desc["org name"] <- gsub("\\W", "-", desc["org name"])
f <- paste(desc["genus"], desc["species"], desc["org name"], sep="_")
f <- paste(dir, "/", f, "_counts.txt", sep="")
write.table(stream, file = f,
sep = "\t", col.names=FALSE, row.names=FALSE)
### save stream
all_sequences[[desc["genus"]]][[desc["species"]]][[desc["org name"]]]
<- cnt
}
}
save(all_sequences, file="all_ssequences.rda")
The organism files already in appropriate FASTA format are included at
http://lyle.smu.edu/cse/dbgroup/IDA/EMMSA/emmsaTools.zip. Example Screen shot of this
organismFiles directory is as follows:
To invoke the batch_counter_3mer.R, use the “R” command source(“batch_counter.R”) . You may wish
to consult “R” user guide to know how to execute “R” commands. This will create the following output.
Once all files are processed, the count (NSV) files are then collected in the same directory as tab
delimited text files and the screen shot is as follows:
Since we will be running batch_counter again to generate the 1-mer count files, i.e, the ICVs, you need
to copy the .txt files elsewhere (like a new directory “3mer” which you should have created) as they will
be removed at the start of next run. Copy these to the directory you set aside. Now run the
batch_counter_1mer.R which will generate the required 1-mer count files. Just to keep every thing in
one place, copy the 3mer count files back to the same directory where the FASTA and 1–mer count files
are located.
Creating train and test datasets:
Two types of testing are possible – 1) partitioning 2) 10X cross validation. Partitioning datasets into
training and test is done by Sample.R program which is given below:
The sample.R program randomly selects 10% of the organisms as test organisms of the structure
<phylum.class.organism> and the remainder as training; the training organisms’ sequence data is then
collected as separate files of the structure <phylum.class> which are to be used for generating training
models subsequently by the EMMSA program. Once Sample.R is invoked, the following console output
is generated.
The output files are written out in to the “\EMMSA\rna-counter\data” directory. The output file names
use substring “train” for the files to be used for training models and “test” for the files to be used as test
subjects for classification. Here is a screen shot of the data directory.
Thus once the “sample.R” is run, there is now one directory “data’” where all the training and test are
collected. It is a good practice to separate these in two separate directories containing training and test.
This allows EMMSA user to refer to the training directory for model building and the test directory for
evaluation/classification. This is easily done in DOS.
First step is to rename data to some thing like 2-50-0 to associate the input parameters used for count
generation. Next, create two new empty directories such as 2-50-0-train and 2-50-0-test. Then copy
with \*test* 2-50-0test\*.* . This will prepare the directories ready for use with the EMMSA program.
Here is a sample screenshot in DOS.
With the train and test directories thus created, user may proceed to launch EMMSA for partition
testing; however, such test is not considered rigorous by some. The 10X cross validation testing is more
suited for a rigorous validation of a classifier. This requires generation of a Descriptor file which
describes which organisms are used for test or train in any one of the 10 runs in a 10X cross validation.
The file format for the Descriptor is described in the appendix.
Generating the Descriptor file:
The program given below is run after the count files are generated. Since 10X requires that each class
contain at least 10 members, it first selects only those for training that satisfy such requirement while
making the others test candidates. For the ones selected for training, it sequentially numbers all the
members in each training class from 1 to 10 repeating from 1 again for every 10 members in a loop.
When all the members of a class are thus numbered, their order is randomized in R to remove any
implicit order that might exist in the members.
For example, if there are classes A,B,C with A and B at 15 members each and C with less than 10
members, the program will make all of class C test candidates and make classes A and B training
candidates. It will then number the 15 members sequentially from 1 to 10 such that each member has a
number in range 1-10. Once numbered, the order is then randomized so that the first member in class A
is not necessarily a 1, the second is not a 2 etc.
A sample descriptor file is as shown in the next page.
By the end of this step, i.e, preprocessing, you should have one directory containing:
1. FASTA files(.wri), count files (.txt) of two types, i.e, NSV and ICV
2. Descriptor file (?10x.csv) and
3. Partitioned count files (?_train.txt, ?_test.txt) [OPTIONAL].
Let us refer to this directory simply as DATA in the subsequent sections.
BUILD:
For building models, you need the following:
1.
2.
3.
4.
5.
6.
7.
DATA directory containing count files and the descriptor file.
Model level {4 for strain level, 3 for species, 2 for class, 1 for phylum}
Directory where the model files will be written to.
Real time Visualization [OPT; def= off] //displays the building of EMM in real time graphically
Node scale, link scale and maxcolors // visualization sliding bars (see screenshot in next page)
Delete old models flag [OPT; def = off] //deletes the old models from the model directory first.
Clustering threshold
Download the EMMSA builder from http://lyle.smu.edu/cse/dbgroup/IDA/EMMSA/emmsaTools.zip.
The zip file includes a directory with current set of model files and a few descriptor files which you can
modify if required. The program can be invoked on a PC from MSDOS as described in the Readme file
(also in the zip folder).
The model level (ModelLevel) is set to 2 (genus) by default; at this point, the only way to change it is to
change it in the code at module jemmview:line 1114 and recompile EMMSA. If model level is changed,
model name derivation should also be changed at line 1136 (name mname) for every thing to work
correctly. This will be a configuration file based change in the near future.
Clustering threshold is set to 0.01 by default which can be changed from the GUI. It represents the
percentage of change in the sum total of twice the unity valued p-mer frequencies for a given EMM
state; the sum is referred to as the basis constant. For example, for a 3-mer EMM, the if an EMM state
has all 3-mers, i.e, 64 counts, set to a 2, basis constant becomes 128 and a 0.1 threshold equates to a
value of 0.1*128=12.8. In this example, the 0.01 threshold setting would use a value of 1.28. The
incoming NSV can be conditionally clustered into that state if its sum of 3-mer frequencies is less than
the threshold of 12.8 or 1.28 (respectively for 0.1 and 0.01 threshold settings). The condition is that the
state to cluster into must be the one with which the incoming NSV has the least Euclidean squared
distance with. Reader may refer to EMMSA publications included in the References section of this
document. To change the basis constant, user would need to recompile EMMCluster module in call
determineStateSimilarity after changing the number 130 {approximately 128!}.
Sample Descriptor file {first line is header, shows as 2 lines per entry due to formatting}
"V1" "V2" "V3" "V4" "V5" "fold"
"subfold"
"Firmicutes"
"Bacilli"
"Anoxybacillus"
"Anoxybacillus flavithermus"
"Anoxybacillus-flavithermus-WK1"
3
"4, 0, 3, 0, 1, 2, 0, 5"
"Firmicutes"
"Bacilli"
"Bacillus" "Bacillus amyloliquefaciens"
"Bacillus-amyloliquefaciens-FZB42" 9
"4, 2, 0, 1, 0, 0, 0, 5, 3"
"Firmicutes"
"Bacilli"
"Bacillus" "Bacillus anthracis"
"Bacillusanthracis-str---Ames-Ancestor-"
4
"3, 5, 4, 0, 4, 2, 2, 1, 1, 5, 3"
"Firmicutes"
"Bacilli"
"Bacillus" "Bacillus anthracis"
"Bacillusanthracis-str--Ames"
7
"5, 4, 5, 0, 1, 1, 4, 3, 2, 2, 3"
"Firmicutes"
"Bacilli"
"Bacillus" "Bacillus anthracis"
"Bacillusanthracis-str--Sterne" 6
"5, 2, 1, 2, 4, 5, 0, 1, 3, 3, 4"
"Firmicutes"
"Bacilli"
"Bacillus" "Bacillus cereus" "Bacillus-cereusAH187"
8
"4, 2, 1, 5, 3, 2, 3, 1, 0, 5, 4, 0, 0, 0"
"Firmicutes"
"Bacilli"
"Bacillus" "Bacillus cereus" "Bacillus-cereusAH820"
7
"1, 1, 2, 0, 2, 3, 5, 0, 3, 4, 4, 5"
"Firmicutes"
"Bacilli"
"Bacillus" "Bacillus cereus" "Bacillus-cereusATCC-10987" 5
"5, 1, 4, 0, 5, 2, 2, 4, 1, 3, 0, 3"
"Firmicutes"
"Bacilli"
"Bacillus" "Bacillus
ATCC-14579" 6
"2, 2, 0, 1, 0, 5, 0, 3, 3, 1, 4,
"Firmicutes"
"Bacilli"
"Bacillus" "Bacillus
B4264"
8
"1, 2, 4, 3, 2, 5, 3, 0, 0, 0, 4,
"Firmicutes"
"Bacilli"
"Bacillus" "Bacillus
G9842"
9
"1, 3, 0, 5, 0, 2, 3, 4, 5, 2, 0,
"Firmicutes"
"Bacilli"
"Bacillus" "Bacillus
E33L" 2
"1, 3, 4, 1, 2, 5, 3, 4, 0, 5, 2, 0, 0"
"Firmicutes"
"Bacilli"
"Bacillus" "Bacillus
cereus-subsp--cytotoxis-NVH-391-98" 3
"0, 1, 3,
0, 1"
"Firmicutes"
"Bacilli"
"Bacillus" "Bacillus
clausii-KSM-K16" 4
"4, 5, 0, 0, 3, 2, 1"
cereus" "Bacillus-cereus4, 5"
cereus" "Bacillus-cereus0, 5, 1"
cereus" "Bacillus-cereus1, 4"
cereus" "Bacillus-cereuscytotoxicus" "Bacillus5, 2, 4, 5, 3, 4, 2, 0,
clausii"
"Bacillus-
Invoking the program will bring up the builder GUI; a sample screen shot is shown in the next page.
User selects the descriptor file by clicking on the button for it. After overriding defaults if required, the
user would press start to initiate the building process which will run for a while. User may select the MS
DOS console to see the running output. In case program crashes due to insufficient heap storage, user
may reinvoke the program by adding –Xmx option that allows non-default heap sizes.
Here is an example invocation with new heap size set to 1000MB.
java -jar –Xmx1000m JEMM++.jar
When the GUI comes up, the user may choose to uncheck the Delete models button to continue building
models since the last attempt instead of starting from the start again. User may choose the the
descriptor file to reduce the number of models to build by simply deleting some of the lines. Similarly,
user may also choose to experiment with visualization and thresholds.
CLASSIFY
For classifying sequences, you need the following:
1.
2.
3.
4.
5.
DATA directory containing count files and the descriptor file.
Model level {4 for strain level, 3 for species, 2 for class, 1 for phylum} //set to 3
Directory where the model files were written to. ////default provided in source code.
Directory where the classifier output files are written to. //default provided in source code.
Clustering threshold //set to 0.01
Download from http://lyle.smu.edu/cse/dbgroup/IDA/EMMSA/emmsaTools.zip unless already done
when building models.
The zip file includes a directory with current set of model files and a few descriptor files which you can
modify if required. The program can be invoked on a PC from MSDOS as described in the Readme file
(also in the zip folder).
The model level (ModelLevel) is set to 3 (species) by default; at this point, the only way to change it is to
change it in the code at module jemmview:line 1090 and recompile EMMSA. If model level is changed,
model name derivation should also be changed at line 1105 (name mname) for every thing to work
correctly. This will be a configuration file based change in the near future.
Clustering threshold is set to 0.01 by default which can be changed from the GUI. It represents the
percentage of change in the sum total of twice the unity valued p-mer frequencies for a given EMM
state; the sum is referred to as the basis constant. For example, for a 3-mer EMM, the if an EMM state
has all 3-mers, i.e, 64 counts, set to a 2, basis constant becomes 128 and a 0.1 threshold equates to a
value of 0.1*128=12.8. In this example, the 0.01 threshold setting would use a value of 1.28. The
incoming NSV can be conditionally clustered into that state if its sum of 3-mer frequencies is less than
the threshold of 12.8 or 1.28 (respectively for 0.1 and 0.01 threshold settings). The condition is that the
state to cluster into must be the one with which the incoming NSV has the least Euclidean squared
distance with. Reader may refer to EMMSA publications included in the References section of this
document. To change the basis constant, user would need to recompile EMMCluster module in call
determineStateSimilarity after changing the number 130 {approximately 128!}.
The screenshot for the EMMSA classifier is shown below.
After locating the Descriptor and pressing the start button, the classifier runs through the entries of
descriptor file performing the following:
1. For a given model level, collect all entries in the descriptor; for each entry there-in,add to
the current profile model using an already built organism(strain) model.
2. For each profile model thus built, for each test entry, generate classification metrics.
3. When all models for all test entries are processed, generate the classification metrics report.
4. Whenever two or more models generate the same metric value for the preferred metric,
generate a “ties” report for further analysis by the user.
The output file formats are described in the appendix.
Currently, all entries in the Descriptor file are treated as test entries, i.e, each and every strain is
evaluated against each and every species. There are many tests possible, but requires some study of the
code because it gets very complicated to capture all combinations in an external configuration file.
For example, the module jemmview:line 1172 or thereabout can be changed to specify the fold
numbered entry to consider for test assuming that those entries in the descriptor file are changed to
reflect the chosen fold number.
Similarly, by changing the fold loop counter, the user can also verify 10X cross validation in which case
for every pass, models are generated using all but the fold number set aside for test and only the test
fold (loop counter) numbered entries are used for test.
DIFFERENTIATOR:
For differentiating sequences, you need the following:
1.
2.
3.
4.
5.
DATA directory containing count files and the descriptor file.
Model level {4 for strain level, 3 for species, 2 for class, 1 for phylum} //set to 4
Directory where the model files were written to. ////default provided in source code.
Directory where the differentiator output files are written to. //default provided in source code.
Clustering threshold //set to 0.01
Download from http://lyle.smu.edu/cse/dbgroup/IDA/EMMSA/emmsaTools.zip unless already done
previously.
The zip file includes a directory with current set of model files and a few descriptor files which you can
modify as required. For example, for research purposes, you may limit the number of organisms you
want to include in the distance matrix analysis. The program can be invoked on a PC from MSDOS as
described in the Readme file (also in the zip folder).
The model level (ModelLevel) is set to 4 (strain) by default. Clustering threshold is set to 0.01 by default
which can be changed from the GUI. The screenshot for the EMMSA differentiator is same as shown
previously for the EMMSA Classifier.
After locating the Descriptor and pressing the start button, the differentiator runs through the entries of
descriptor file performing the following:
1. For each entry there-in, access an already built strain model or build a new one using the
threshold specified.
2. For each strain model thus built, for each test entry, generate classification metrics.
3. When all models for all test entries are processed, generate the distance matrix report for
the preferred metric.
The output file formats are described in the appendix.
Currently, all entries in the Descriptor file are treated as test entries, i.e, each and every strain is
evaluated against each and every species. There are many tests possible, but requires some study of the
code because it gets very complicated to capture all combinations in an external configuration file.
For example, the module jemmview:line 1172 or thereabout can be changed to specify the fold
numbered entry to consider for test assuming that those entries in the descriptor file are changed to
reflect the chosen fold number.
Similarly, by changing the fold loop counter, the user can also verify 10X cross validation in which case
for every pass, models are generated using all but the fold number set aside for test and only the test
fold (loop counter) numbered entries are used for test.
APPENDIX
In this section we will describe the files used and output by the EMMSA set of tools.
1. Sequence Files
2.
3.
4.
5.
6.
Count Files
Model Files
Classifier Metric File
Classifier Ties File
Distance Matrix File
Sequence Files:
The FASTA header needs to be in a particular format; here is a sample file {with truncated sequences}:
Acidithiobacillus_ferrooxidans_ATCC_23270.wri
>org name: Acidithiobacillus ferrooxidans ATCC 23270: class:
Gammaproteobacteria: phylum: Proteobacteria: gene name: AFE_3084: rel date:
2008-12-19: mod date: 2009-01-28
CTCAGATTGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGGTAACAGGTCTTCGGATGCTGACGAGTGGCG
GACGGGTGAGTAATGCGTAGGAATCTGTCTTTTAGTGGGGGACAACCCAGGGCCTTCGGGAGGGCGGTTACCACGGT
ATGGTTCATGACTGGGGTGAAGTCGT
>org name: Acidithiobacillus ferrooxidans ATCC 23270: class:
Gammaproteobacteria: phylum: Proteobacteria: gene name: AFE_0384: rel date:
2008-12-19: mod date: 2009-01-28
CTCAGATTGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGGTAACAGGTCTTCGGATGCTGACGAGTGGCG
GACGGGTGAGTAATGCGTAGGAATCTGTCTTTTAGTGGGGGACAACCCAGGGAAGGTATGGTTCATGACTGGGGTGA
AGTCGT
The header that start with “>” until the start of the sequence on a separate line must contain the all the
fields just as specified including the carefully placed delimiter “:”. Multiple sequences may be specified
with each having its own header as shown above. For example, above file contains all the two 16S rRNA
sequences (truncated however for easy reading) for the organism; you may note the organism name is
used in the file name with spaces replaced by underscore.
In case of test sequences, it is often the case that the organism or its particulars are not known. In such
cases, pseudo names may ne used with the same syntax as shown above.
Count Files:
These are tab limited files and are generated by the “R” preprocessors and used by EMMSA software for
generating models. Since the format for the contents in this file is internal, it will not be described here
until this internal format is stabilized in time.
Model Files:
These are binary object files and not readable in text editors.
Classifier Metric File:
This file has the suffix of “.ss.txt”.
For each query-model evaluation, a record of output is generated that contains several metrics for
subsequent analysis (in EXCEL). Here is a partial sample output:
0/Anoxybacillus flavithermus/Anoxybacillus flavithermus/Anoxybacillus/Anoxybacillus/Anoxybacillus
flavithermus/Anoxybacillus flavithermus/Anoxybacillus-flavithermus-WK1/0.99430/0.97390/5/39.62/-2.19/2.19/62.28/1.000000/1.00/0.00/6/6/0.00/7/1.000000
1/Anoxybacillus flavithermus/Bacillus amyloliquefaciens/Anoxybacillus/Bacillus/Anoxybacillus
flavithermus/Bacillus amyloliquefaciens/Bacillus-amyloliquefaciens-FZB42/0.98335/0.98335/4/17.46/-24.35/24.35/40.10/0.002057/0.00/0.00/2/1/0.03/2/0.001404
2/Anoxybacillus flavithermus/Bacillus anthracis/Anoxybacillus/Bacillus/Anoxybacillus flavithermus/Bacillus
anthracis/Bacillus-anthracis-str---Ames-Ancestor-/0.00000/0.65171/0/17.87/-23.95/23.95/40.51/0.002816/0.00/0.00/30/7/0.01/0/0.001760
In the above file, the delimiter “/” separates fields for easy extraction into EXCEL for further
manipulation. The following are the values output for each record.
Index, model, org_model, model_genus, org_genus, model_species, org_species, organism,
single_p’_val, composite_p’_val, EScore, m[0], m[1], m[2], m[3], m[4], m[5], m[6], qsizeQA, msizeQA,
simQA, nSig, m[7]
Where the metrics are identified by m[0] to m[7]. Of these m[4] is the preferred metric. The file is not
sorted by organism, rather ordered by model. Typically, in EXCEL, one would sort these first by
organism and then by a metric to see if the correct classification model comes on top.
Classifier Ties File:
This file has the suffix of “.ties.txt”. This will be described in subsequent releases as it is required only
when strain level classification is not unique.
Distance Matrix File:
Name of the EMM [strain typically] is truncated to the first 10 characters when generating the distance
matrix row by row; it is often that first 10 characters are not sufficient to make them unique. It may be
necessary to manually edit these names. This file may be used as the input to PHYLIP package on the
web, but some times they require that the first row contain the dimension (in terms of rows since it is
NXN where N represents number of rows). Some times, it is also necessary to ensure that the first
number in each row starts exactly at the 11th position.