SmProt User’s Guide 1. How to browse in SmProt database?...................................................... 2 2. How to search in SmProt database?....................................................... 3 3. The detailed information of a particular small protein.......................... 5 4. How to find a relevant small protein through sequence?.................... 10 5. The Genome browser in SmProt database........................................... 11 6. How to submit small proteins to SmProt database?............................ 12 7. How to download the data set in SmProt database?........................... 13 8. The name scheme of small proteins in SmProt database.................... 14 9. The genome versions of organism annotation in SmProt database.... 15 1 / 15 1.How to browse in SmProt database? In the Browse web page, small proteins are listed with basic information including the ID of the small proteins in SmProt database (SmProt_ID), the length of the small proteins (SmProt_length), the type of the small proteins (ORF_Type), the genes encoded the small proteins (Gene), gene type (Gene Type), organism (Organism), and the data source of the small proteins (Data Source). (Fig 1.1) Fig 1.1 The Browse web page In the Browse web page users can browse their interested species, ORF type, gene type, and data source through Browse button to retrieve results which would be showed as below (Fig 1.2). For example, if users want to get small proteins about human, which ORF type is sORF, gene type is protein-coding and data source is low-throughput literature mining. Users can choose these selections from browse table. The result would be showed below. The result list can be viewed either by changing the number of records and by clicking on the page numbers at the bottom right corner of the table. The results list can also be sorted by clicking the sort button. Users can export detailed information through export button (Fig 1.2). 2 / 15 Fig 1.2 Example of specific search If users want to view the detailed information of the item, they can click the corresponding SmProt_ID to link to detailed information table (Fig 1.3). In the new page the detailed information about the specific small protein would be showed. Fig 1.3 Detailed information of the small protein after click SmProt_ID 2. How to search in SmProt database? In the Search web page, it allows users to search any ID and keyword stored in the database. (Fig 2.1) 3 / 15 Fig 2.1 The Search web page The search function is divided into three parts, users can use one of them to search. (1) Keyword search: Users can search by ID. For example, they can use SmProt_ID (SPROHSA010899), NONCODE ID (NONHSAT131276, NONHSAG052263), PubMed ID (26364619), ENSEMBL ID (ENST00000414264, ENSG00000227877), or RefSeq ID (NR_003670) as keywords without pre-selection of the ID category. 4 / 15 Users also can search by other keywords. For example, they can use gene symbol (e.g., MRPL43), cell line or tissue (e.g., GM12878), ORF type (e.g., sORF) or gene type (e.g., lincRNA) as keywords. (2) Location search: Users can choose chromosome, species, and type the region of interests to obtain the small proteins which locations are overlapped with the input location. (3) ID search: Users can search through NONCODE ID (NONHSAT131276, NONHSAG052263), ENSEMBL ID (ENST00000414264, ENSG00000227877), RefSeq ID (NR_003670), PubMed ID (26364619), or SmProt_ID (SPROHSA010899). 3. The detailed information of a particular small protein. The small proteins are mainly collected from four different sources (described in the Help web page). According to the different sources, we re-organized the small proteins and defined 5 different data sources. The small proteins, collected from different data sources, have different detailed information. The detailed information is hence divided into three parts: general information, specific information and reference (Fig 3.1). 5 / 15 Fig 3.1 The detailed information page. General information: It includes small protein ID, sequence, length, genomic location, its ORF type, transcript ID, gene symbol, gene type, the organism, transcript ID and gene ID in NONCODE, RefSeq and ENSEMBL database, tissue or cell line, and data source. We also predicted the functions of the small protein through InterProScan software. Click the Function button and the functions of the small protein will be shown in a new web page (Fig 3.2). Importantly, only the small proteins with predicted functions have the Function button in their general information. 6 / 15 Fig 3.2 The function web page. Reference: The literature may have raw ribosome profiling data, MS data, and high-throughput or low-throughput information about proteins. It contains the PubMed ID, title, authors, and the literature published journal. Specific information: The different data sources have different detailed information table. (1) Low-throughput literature mining This table includes the start codon of the small protein, the experiment methods used to obtain or characterize this small protein, the disease this small protein involved, the function of this small protein, and its description from literature. (2) High-throughput literature mining The detailed information table of the small protein curated from high-throughput literature mining is different, as various literature provided various information. Some easily to be confused items are interpreted in below. 7 / 15 sf score, produced by SEQUEST algorithm, used to analyze the acquired MS/MS spectra using a database derived from three-frame translation from the RNA-Seq data. Final score indicates how good the protein match is. ORFscore, reads were counted at each position within the ORF, excluding the first and last coding codons. To filter out putative artifactual peaks, the most abundant read position was masked if reads aligned to that position comprised more than 70% of the total reads in the ORF. This filter was determined empirically by applying a variable filter and minimizing 3 ′ UTR ORFs that were misclassified as coding based on such peaks. The ORFscore was then calculated as: where Fn is the number of reads in reading frame n, is the total number of reads across all three frames divided by 3. coverage, the general equation is: C=LN/G. C stands for coverage. G is the haploid genome length. L is the read length. N is the number of reads. coding potential score, calculated by an algorithm sORFfinder, indicated the coding potential of the small protein. (3) MS Data The detailed information table of the small protein curated from MS data includes Raw Score, Spectrum ID, Peptide Rank, and Peptide Repeat Count. Raw Score, reflects the strength of the peptide mapping, in contrast to the Score field which reflects the confidence of the mapping. The Score field is computed as -100 × log10(E-Value) for the peptide mapping, and scores of 200 or greater have an estimated 5% false discovery rate (FDR) while scores of 230 or greater have an estimated 1% FDR. The Raw Score offers an additional level of confidence: raw scores of 300 or greater have an estimated 5% false discovery rate. Note that Raw Score is not normalized for the length of the peptide mapping, while Score is. 8 / 15 Consequently, short mappings might have a strong Raw Score but a weaker Score. Spectrum ID, is a semi-unique identifier of the spectrum associated with the peptide mapping, and can be used to track the origins of the mapping. Peptide Rank, indicates the rank of each peptide/spectrum mapping. A spectrum can be chimeric, containing more than one peptide, and the spectrum can be mapped with confidence to two or more distinct peptides. Peptides with ranks greater than 3 are deleted from the track. Peptide Repeat Count, indicates the number of places in the genome that match the peptide sequence. This reflects the uniqueness of the peptide mapping in the genome. Any mappings to highly-duplicated regions will have a high Peptide Repeat Count and peptides which were repeated more than 10 times in the genome were deleted from the track. (4) Ribosome profiling data The detailed information table of the small proteins curated from ribosome profiling data included the information described below: The TPM (transcripts per million) value of the small protein in ribosome profiling data. The TPM (transcripts per million) value of the small protein in RNA-Seq data. The relative position of the small protein start codon in transcript. The relative position of the small protein stop codon in transcript. The relative position of annotated CDS start codon in transcript. The relative position of annotated CDS stop codon in transcript. The number of ribosome profiling reads in the small protein. The number of RNS-seq reads in the small protein. The P_sites number in the small protein. The RNA_sites number in the small protein. The p Value of the small protein calculated by multitaper method using ribosome profiling data. 9 / 15 The p Value of the small protein calculated by multitaper method using RNA-seq data. The exons number of the small protein. Ribosome profiling dataset ID. (5) Known databases The detailed information table of the small proteins curated from UniProt database included the information described below: The UniProt ID of the small protein, the ID can link to the UniProt database. The function of the small protein. The annotation score provides a heuristic measure of the annotation content of a UniProtKB entry or proteome. The evidence of the small protein indicates the type of evidence that supports the existence of the protein. Gene alias of the small protein. The detailed information table of the small proteins curated from CCDS database included the information described below: The CCDS ID of the small protein, the ID can link to the CCDS database. The CCDS project description. The CCDS project overview. 4. How to find a relevant small protein through sequence? Users can use Blast web page to find their interest records through sequences. The programs provided in the Blast web page: Blastx: compares a nucleotide query sequence translated in all reading frames against a protein sequence database. 10 / 15 Blastp: compares an amino acid query sequence against a protein sequence database. Fig 4.1 The Blast web page 5. The Genome browser in SmProt database. SmProt also has integrated a local UCSC Genome Browser (http://genome.ucsc.edu/) for visualization of the genomic locations of the small proteins in the SmProtTable track (Fig 5.1). Small proteins curated from MS data are shown as an independent track in the genome browser. For a small protein with no recognized gene name or IDs, users can also search in SmProt based on its genomic location in genome browser. Associated tracks like NONCODE lncRNAs, NONCODE Genes, RefSeq Genes and ENSEMBL Genes are also shown in the genome browser. 11 / 15 Fig 5.1 The Genome browser 6. How to submit small proteins to SmProt database? Users are encouraged to submit their small proteins in Submit web page with requested data format. 12 / 15 Fig 6.1 The Submit web page 7. How to download the data set in SmProt database? Specific information and sequence information of small proteins stored in the database can be downloaded in TXT or FASTA format in the Download web page (Fig 7.1). The high-confidence data sets also can be downloaded in the Download web page (Fig 7.1). In addition, we defined a high confidence set of small proteins, which was obtained from low-throughput literature mining, databases, high-throughput literature mining supported by MS data, or ribosome profiles supported by MS data, these representing the highest quality small protein entries in the database. 13 / 15 Fig 7.1 The Download web page 8. The name scheme of small proteins in SmProt database. The small proteins is named with: SPRO + organism abbreviation + six numbers. Organism abbreviation: Organism Abbreviation human HSA mouse MUS rat RAT fruitfly MET zebrafish DAR yeast SCE C.elegans CEL Escherichia coli ECO 14 / 15 9. The genome versions of organism annotation in SmProt database. Organism Genome version human hg19 mouse mm10 rat rn6 yeast saccer3 zebrafish dr7 fruitfly dm3 C.elegans ce10 Escherichia coli EB1 15 / 15
© Copyright 2025 Paperzz