here

SmProt User’s Guide
1. How to browse in SmProt database?...................................................... 2
2. How to search in SmProt database?....................................................... 3
3. The detailed information of a particular small protein.......................... 5
4. How to find a relevant small protein through sequence?.................... 10
5. The Genome browser in SmProt database........................................... 11
6. How to submit small proteins to SmProt database?............................ 12
7. How to download the data set in SmProt database?........................... 13
8. The name scheme of small proteins in SmProt database.................... 14
9. The genome versions of organism annotation in SmProt database.... 15
1 / 15
1.How to browse in SmProt database?
In the Browse web page, small proteins are listed with basic information
including the ID of the small proteins in SmProt database (SmProt_ID), the length of
the small proteins (SmProt_length), the type of the small proteins (ORF_Type), the
genes encoded the small proteins (Gene), gene type (Gene Type), organism
(Organism), and the data source of the small proteins (Data Source). (Fig 1.1)
Fig 1.1 The Browse web page
In the Browse web page users can browse their interested species, ORF type,
gene type, and data source through Browse button to retrieve results which would
be showed as below (Fig 1.2). For example, if users want to get small proteins about
human, which ORF type is sORF, gene type is protein-coding and data source is
low-throughput literature mining. Users can choose these selections from browse
table. The result would be showed below. The result list can be viewed either by
changing the number of records and by clicking on the page numbers at the bottom
right corner of the table. The results list can also be sorted by clicking the sort button.
Users can export detailed information through export button (Fig 1.2).
2 / 15
Fig 1.2 Example of specific search
If users want to view the detailed information of the item, they can click the
corresponding SmProt_ID to link to detailed information table (Fig 1.3). In the new
page the detailed information about the specific small protein would be showed.
Fig 1.3 Detailed information of the small protein after click SmProt_ID
2. How to search in SmProt database?
In the Search web page, it allows users to search any ID and keyword stored in
the database. (Fig 2.1)
3 / 15
Fig 2.1 The Search web page
The search function is divided into three parts, users can use one of them to
search.
(1) Keyword search:
Users can search by ID. For example, they can use SmProt_ID
(SPROHSA010899), NONCODE ID (NONHSAT131276, NONHSAG052263), PubMed ID
(26364619), ENSEMBL ID (ENST00000414264, ENSG00000227877), or RefSeq ID
(NR_003670) as keywords without pre-selection of the ID category.
4 / 15
Users also can search by other keywords. For example, they can use gene
symbol (e.g., MRPL43), cell line or tissue (e.g., GM12878), ORF type (e.g., sORF) or
gene type (e.g., lincRNA) as keywords.
(2) Location search:
Users can choose chromosome, species, and type the region of interests to
obtain the small proteins which locations are overlapped with the input location.
(3) ID search:
Users can search through NONCODE ID (NONHSAT131276, NONHSAG052263),
ENSEMBL ID (ENST00000414264, ENSG00000227877), RefSeq ID (NR_003670),
PubMed ID (26364619), or SmProt_ID (SPROHSA010899).
3. The detailed information of a particular small protein.
The small proteins are mainly collected from four different sources (described in
the Help web page). According to the different sources, we re-organized the small
proteins and defined 5 different data sources. The small proteins, collected from
different data sources, have different detailed information. The detailed information
is hence divided into three parts: general information, specific information and
reference (Fig 3.1).
5 / 15
Fig 3.1 The detailed information page.
 General information:
It includes small protein ID, sequence, length, genomic location, its ORF type,
transcript ID, gene symbol, gene type, the organism, transcript ID and gene ID in
NONCODE, RefSeq and ENSEMBL database, tissue or cell line, and data source. We
also predicted the functions of the small protein through InterProScan software. Click
the Function button and the functions of the small protein will be shown in a new
web page (Fig 3.2). Importantly, only the small proteins with predicted functions
have the Function button in their general information.
6 / 15
Fig 3.2 The function web page.
 Reference:
The literature may have raw ribosome profiling data, MS data, and
high-throughput or low-throughput information about proteins. It contains the
PubMed ID, title, authors, and the literature published journal.
 Specific information:
The different data sources have different detailed information table.
(1) Low-throughput literature mining
This table includes the start codon of the small protein, the experiment
methods used to obtain or characterize this small protein, the disease this small
protein involved, the function of this small protein, and its description from
literature.
(2) High-throughput literature mining
The detailed information table of the small protein curated from
high-throughput literature mining is different, as various literature provided various
information. Some easily to be confused items are interpreted in below.
7 / 15

sf score, produced by SEQUEST algorithm, used to analyze the acquired MS/MS
spectra using a database derived from three-frame translation from the RNA-Seq
data. Final score indicates how good the protein match is.

ORFscore, reads were counted at each position within the ORF, excluding the first
and last coding codons. To filter out putative artifactual peaks, the most
abundant read position was masked if reads aligned to that position comprised
more than 70% of the total reads in the ORF. This filter was determined
empirically by applying a variable filter and minimizing 3 ′ UTR ORFs that were
misclassified as coding based on such peaks. The ORFscore was then calculated
as:
where Fn is the number of reads in reading frame n, is the total number of reads
across all three frames divided by 3.

coverage, the general equation is: C=LN/G. C stands for coverage. G is the haploid
genome length. L is the read length. N is the number of reads.

coding potential score, calculated by an algorithm sORFfinder, indicated the
coding potential of the small protein.
(3) MS Data
The detailed information table of the small protein curated from MS data
includes Raw Score, Spectrum ID, Peptide Rank, and Peptide Repeat Count.

Raw Score, reflects the strength of the peptide mapping, in contrast to the Score
field which reflects the confidence of the mapping. The Score field is computed
as -100 × log10(E-Value) for the peptide mapping, and scores of 200 or greater
have an estimated 5% false discovery rate (FDR) while scores of 230 or greater
have an estimated 1% FDR. The Raw Score offers an additional level of confidence:
raw scores of 300 or greater have an estimated 5% false discovery rate. Note that
Raw Score is not normalized for the length of the peptide mapping, while Score is.
8 / 15
Consequently, short mappings might have a strong Raw Score but a weaker
Score.

Spectrum ID, is a semi-unique identifier of the spectrum associated with the
peptide mapping, and can be used to track the origins of the mapping.

Peptide Rank, indicates the rank of each peptide/spectrum mapping. A spectrum
can be chimeric, containing more than one peptide, and the spectrum can be
mapped with confidence to two or more distinct peptides. Peptides with ranks
greater than 3 are deleted from the track.

Peptide Repeat Count, indicates the number of places in the genome that match
the peptide sequence. This reflects the uniqueness of the peptide mapping in the
genome. Any mappings to highly-duplicated regions will have a high Peptide
Repeat Count and peptides which were repeated more than 10 times in the
genome were deleted from the track.
(4) Ribosome profiling data
The detailed information table of the small proteins curated from ribosome
profiling data included the information described below:

The TPM (transcripts per million) value of the small protein in ribosome profiling
data.

The TPM (transcripts per million) value of the small protein in RNA-Seq data.

The relative position of the small protein start codon in transcript.

The relative position of the small protein stop codon in transcript.

The relative position of annotated CDS start codon in transcript.

The relative position of annotated CDS stop codon in transcript.

The number of ribosome profiling reads in the small protein.

The number of RNS-seq reads in the small protein.

The P_sites number in the small protein.

The RNA_sites number in the small protein.

The p Value of the small protein calculated by multitaper method using ribosome
profiling data.
9 / 15

The p Value of the small protein calculated by multitaper method using RNA-seq
data.

The exons number of the small protein.

Ribosome profiling dataset ID.
(5) Known databases
The detailed information table of the small proteins curated from UniProt
database included the information described below:

The UniProt ID of the small protein, the ID can link to the UniProt database.

The function of the small protein.

The annotation score provides a heuristic measure of the annotation content of
a UniProtKB entry or proteome.

The evidence of the small protein indicates the type of evidence that supports
the existence of the protein.

Gene alias of the small protein.
The detailed information table of the small proteins curated from CCDS
database included the information described below:

The CCDS ID of the small protein, the ID can link to the CCDS database.

The CCDS project description.

The CCDS project overview.
4. How to find a relevant small protein through
sequence?
Users can use Blast web page to find their interest records through sequences.
The programs provided in the Blast web page:
Blastx: compares a nucleotide query sequence translated in all reading frames
against a protein sequence database.
10 / 15
Blastp: compares an amino acid query sequence against a protein sequence
database.
Fig 4.1 The Blast web page
5. The Genome browser in SmProt database.
SmProt
also
has
integrated
a
local
UCSC
Genome
Browser
(http://genome.ucsc.edu/) for visualization of the genomic locations of the small
proteins in the SmProtTable track (Fig 5.1). Small proteins curated from MS data are
shown as an independent track in the genome browser. For a small protein with no
recognized gene name or IDs, users can also search in SmProt based on its genomic
location in genome browser. Associated tracks like NONCODE lncRNAs, NONCODE
Genes, RefSeq Genes and ENSEMBL Genes are also shown in the genome browser.
11 / 15
Fig 5.1 The Genome browser
6. How to submit small proteins to SmProt database?
Users are encouraged to submit their small proteins in Submit web page with
requested data format.
12 / 15
Fig 6.1 The Submit web page
7. How to download the data set in SmProt database?
Specific information and sequence information of small proteins stored in the
database can be downloaded in TXT or FASTA format in the Download web page (Fig
7.1). The high-confidence data sets also can be downloaded in the Download web
page (Fig 7.1). In addition, we defined a high confidence set of small proteins, which
was obtained from low-throughput literature mining, databases, high-throughput
literature mining supported by MS data, or ribosome profiles supported by MS data,
these representing the highest quality small protein entries in the database.
13 / 15
Fig 7.1 The Download web page
8. The name scheme of small proteins in SmProt
database.
The small proteins is named with: SPRO + organism abbreviation + six numbers.
Organism abbreviation:
Organism
Abbreviation
human
HSA
mouse
MUS
rat
RAT
fruitfly
MET
zebrafish
DAR
yeast
SCE
C.elegans
CEL
Escherichia coli
ECO
14 / 15
9. The genome versions of organism annotation in
SmProt database.
Organism
Genome version
human
hg19
mouse
mm10
rat
rn6
yeast
saccer3
zebrafish
dr7
fruitfly
dm3
C.elegans
ce10
Escherichia coli
EB1
15 / 15