GeneLooper 3.0 User’s Manual Revised in 2014 © 2014 GeneHarbor, Inc. GeneLooper User’s Manual Copyrights © 2002 GeneHarbor, Inc. All rights reserved. Revised in 2014 GeneHarbor and GeneLooper are the trademarks of GeneHarbor, Inc. All other trademarks and registered trademarks are property of their respective owners. License Agreement GeneHarbor, Inc. grants a license to use the accompanying software and printed material to you, the original purchaser. This is a binding Agreement between you and GeneHarbor, Inc.. Use of the software shall constitute your acceptance of this Agreement. The copying of the software is strictly prohibited and adherence to this requirement is your sole responsibility. GeneHarbor, Inc. reserves the right to modify and update the software and printed material without obligation to notify you, the original owner, of any change in the software and printed material. Limited Warranty If the performance of the software does not meet the standard described in the documentation, GeneHarbor, Inc. will replace the software if notified within 30 days of purchase. In the event of a replacement agreed by GeneHarbor, Inc., the original software, accessories, and documentation must be received by GeneHarbor, Inc. in order for a replacement to be sent to you. Under no circumstances shall GeneHarbor, Inc. and its officers or its distributors be liable for any indirect, incidental, consequential or exemplary damages arising from the use, or the inability to use the software even if they were aware of the possibility of such damages. GeneHarbor, Inc. [email protected] www.geneharbor.com II GeneLooper User’s Manual Contents GeneLooper 3.0 ............................................................................................................................... I Copyrights ..................................................................................................................................... II License Agreement ....................................................................................................................... II CD-ROM Installation.................................................................................................................... 2 Sequence Formatting .................................................................................................................... 3 Entrez Information Extraction .................................................................................................... 7 Sequence Collection ..................................................................................................................... 12 Sequence Separation ................................................................................................................... 14 Sequence Retrieving .................................................................................................................... 17 Open Reading Frame Detection ................................................................................................. 19 Multi-Sequence Alignment ......................................................................................................... 23 Restriction Site Search ................................................................................................................ 27 Translation and Reverse Complement ...................................................................................... 29 Batch Oligo Design ...................................................................................................................... 32 Sequence Viewing and Editing ................................................................................................... 42 Restriction Analysis ..................................................................................................................... 47 Sequence Alignment .................................................................................................................... 51 Local Sequence Similarity Search.............................................................................................. 55 Primer Design .............................................................................................................................. 59 Oligo Database Search ................................................................................................................ 61 ePCR ............................................................................................................................................. 64 Sequence Formatting .................................................................................................................. 65 Document Viewer ........................................................................................................................ 67 III GeneLooper User’s Manual Introduction In the new molecular genetic era, scientists are frequently using multiple genes in their research such as studying the function of a gene family or a pathway. Therefore, they have the needs to analyze hundreds of DNA sequences. Currently, analyzing multiple sequences relies on bioinformaticists. Some scientists like to have the ability to analyze multiple sequences themselves because they want to have the first hand data and reduce the cost. So far, as we know, there is no efficient and low-cost tools which can be used by scientist themselves to analyze multiple sequences. Most multi-sequence analysis solutions are serverbased, high-cost and often require the involvement of professionals who understand programming languages such as Unix and Perl. GeneHarbor, Inc. is a company dedicated to providing innovative, versatile, user-friendly, and low-cost bioinformatic tools that directly serve fieldscientists. GeneLooper 3.0 was developed based upon our principles. GeneLooper 3.0 is a Windows operating system based sequence analysis package that can be installed into a personal computer. Yet, it is powerful enough to deal with as much as the entire transcriptome of any organism. Its high capacity enables users to analyze thousands of sequences at a time. GeneLooper 3.0 is multi-functional. It performs routine operations in bioinformatics study such as sequence retrieval, open reading frame detection, multi-sequence similarity search, translation, sequence alignment, and much more. From data preparation to data processing to data reporting, the concept of user-friendliness is implemented throughout the entire design. It has no complex database structure, therefore it is very flexible and easy to maintain. Operation is as easy as clicking a mouse button, and data output is instant and requires no programming or scripting. In addition to its high-throughput capability, GeneLooper 3.0 is also meticulously designed to analyze individual sequences in detail. There are many innovative and unique features needed by molecular biologists. The authors of GeneLooper 3.0 purposely streamlined the analysis process, therefore providing the most straightforward interfaces. They also try to provide results that make biological sense and give users more power to adjust parameters to accommodate a variety of conditions. GeneLooper 3.0 is a total package for multi-sequence analyses. To enjoy its power is to use it in your daily work. GeneHarbor, Inc. will continue to improve its products and services. Your input and feedback are greatly appreciated. Geneharbor, Inc. November 11, 2014 1 GeneLooper User’s Manual Installation System Requirements The recommended system and configuration for GeneLooper 3.0: Component Minimum Requirement Processor Intel Pentium III or compatible 1000 MHz or greater RAM 1000 MB or greater Display 800 x 600 resolution, 256 color depth, small fonts setting, and 256 colors or greater. Operating System Windows XP, Window 7, Window 10 Drives CD-ROM drive CD-ROM Installation If you have Autoplay turned on your computer will automatically run the CD-ROM interface, otherwise follow these directions: 1. 2. 3. 4. Insert the CD-ROM into your CD-ROM drive. From the Windows desktop, double-click the My Computer icon. Double-click the CD-ROM icon. Double-click the setup icon to start the installation interface. Follow the instruction of each step during the installation process. GeneLooper 3.0 will be installed into a default folder assigned by this program. In case you do not want to use the default folder, you may also install it to another location by choosing the custom installation method. All components are required for the program and should be kept in the designated folder all the time. After installing GeneLooper 3.0, the interface will automatically continue to install HASP device driver required to run GeneLooper 3.0. Follow the instructions to finish the process. An Icon for GeneLooper 3.0 will be placed on the desktop of your computer. After the installation, attach the HASP key to the USB port of your computer. Double-click on the GeneLooper 3.0 icon on the desktop to run GeneLooper 3.0. 2 Part I Multi-Sequence Utilities GeneLooper User’s Manual Chapter 1 Sequence Formatting Objective: Format sequences for using the utilities in GeneLooper 3.0. To avoid using a complex database structure for storing and retrieving sequence data, GeneLooper 3.0 utilizes a modified fasta format for its routine operations. Fasta format is a simple and frequently used format for storing, transferring and viewing of multiple DNA or protein sequences. Basically, it is a file containing many sequences. The information for each sequence has two parts: part one is a brief description beginning with a “>” sign, and part two is its sequence. The description potrtion usually includes an accession number (ACCN), a Genbank identification number (GI) and a brief description about the sequence. These items are separated by several concatenation signs (“|”). A sample of a fasta file is shown below: fasta format >gi|11056007|ref|NM_021634.1| Homo sapiens leucine-rich repeat-containing G protein-coupled receptor 7 (LGR7), mRNA ATGACATCTGGTTCTGTCTTCTTCTACATCTTAATTTTTGGAAAATATTTTTCTCATGGGGGTGGACAGG ATGTCAAGTGCTCCCTTGGCTATTTCCCCTGTGGGAACATCACAAAGTGCTTGCCTCAGCTCCTGCACTG TAACGGTGTGGACGACTGCGGGAATCAGGCCGATGAGGACAACTGTGGAGACAACAATGGATGGTCCATG CAATTTGACAAATATTTTGCCAGTTACTACAAAATGACTTCCCAATATCCTTTTGAG….. >gi|21489938|ref|NM_145066.1| Mus musculus G protein-coupled receptor 85 (Gpr85) mRNA ATGGCGAACTATAGCCATGCAGCCGACAACATTTTGCAAAATCTCTCGCCTCTAACAGCCTTTCTGAAAC TGACTTCCCTGGGTTTCATAATAGGAGTCAGCGTTGTGGGCAACCTTCTGATCTCCATTTTGCTAGTGAA AGATAAGACCTTGCATAGAGCTCCTTACTACTTCCTGCTGGATCTGTGCTGCTCAGACATCCTCAGATCT GCAATTTGTTTTCCATTTGTATTCAACTCTG….. The major advantage of the fasta format is its simplicity. It can be opened, edited and saved with many text editors including Microsoft Word and Notepad. GeneLooper 3.0 uses a modified fasta, i.e. adding quotes to both the descriptions and the sequences, thereafter we call it Q-fasta. The main reasons for the change are to speed up operations and avoid errors. A sample of a Q-fasta file is shown below: Q-fasta format ">gi|17986271|ref|NM_000795.2| Homo sapiens dopamine receptor D2 (DRD2) transcript variant 1 mRNA" "GGCAGCCGTCCGGGGCCGCCACTCTCCTCGGCCGGTCCCTGGCTCCC GGAGGCGGCCGCGCGTGGATGCGGCGGGAGCTGGAAGCCTCAAGCAG CCGGCGCCGTCTCTGCCCCGGGGCGCCCTATGGCTTGAAGAGCCTGGC CACCCAGTGGCTCCACCGCCCTGATGGATCCACTGAATCTGTCCTGGT ATGATGGAACTCCTTGGCCTCGAGAGCCCCTGGGGCCTAGACTCTGTA ACATCACTATCCATGCACCAAACTAATAAAACTTTGACGAGTCACCTT CCAGGACCCCTGGGTAAAAAAAAAAAAAAA…………" ">gi|4504136|ref|NM_000839.1| Homo sapiens glutamate receptor metabotropic 2 (GRM2) mRNA" "CCATGGGATCGCTGCTTGCGCTCCTGGCACTGCTGCCGCTGTGGGGTGCTGTGGCTGAGGGCCCAGCCAAGAAGG TGCTGACCCTGGAGGGAGACTTGGTGGCGGCTCCGTGGTGCTTGGCTGCCTCTTTGCGCCCAAGCTGCACATCATC CTCTTCCAGCCGCAGAAGAACGTGGTTAGCCACCGGGCACCCACCAGCCGCTTTGGCAGTGCTGCTGCCAGGGCC AGCTCCAGCCTTGGCCAAGGGTCTGGCTCCCAGTTTGTCCCCACTGTTTGCAATGGCCGTGAGGTGGTGGACTCGA CAACGTCATCGCTTTGA…………" 3 GeneLooper User’s Manual Preparation of a Q-fasta file: To convert a fasta file to a Q-fasta file is an easy but critical process since Q-fasta files will be used throughout the utilities in GeneLooper 3.0. Any non-Q-fasta files will cause errors in GeneLooper 3.0. Many public web sites allow free-downloading of sequences in fasta format. We will use the curated human mRNAs (accession number started with a “NM_” or so called refseq) as an example to demonstrate the download and conversion processes. 1. To download the curated human mRNA in fasta format from the National Center for Biotechnology Information (NCBI), with your Internet Explorer or Netscape browser go to the FTP download site in the human genome project of NCBI, find the file named hs.fna.gz: ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Protein/hs.fna.gz and save the file to your local drive. This file is about 13 MB and it needs a few minutes to download. 2. The downloaded file is usually compressed or “zipped” and must be decompressed before use. This can be accomplished by using a piece of free software, WinZip, or other equivalents. If you don’t have a WinZip pre-installed in your PC, you can download an evaluation version from http://www.Winzip.com. Unzip the downloaded file “hs.fna.gz”. Consequently, it generates a new file called hs.fna. The unzipped file contains about 16,000 human mRNA sequences and is about 45 MB in size. 3. Start GeneLooper 3.0 and then click on the Sequence Formatting button. The Sequence Formatting and Appending form will appear (Fig. 1). Use the browse button to locate hs.fna which you have just created and then click on the New file button. The new file name, “hs_fmt_txt”, will automatically appear in the recipient text box. If you already have a Q-fasta file and want to append the downloaded sequences to the existing Q-fasta file, use the Existing File button to locate that file. 4. There are two filter options. The first is designed to remove non-mRNA sequences such as those accessions beginning with a “NG_” and “NC_”. The truncation option is designed to shorten those very long transcripts such as the transcripts of the titin gene*. The options allow you to remove or truncate a few very long sequences which usually add computing time and slow down the processes in the other high-throughput utilities. 4 GeneLooper User’s Manual Figure1. Sample of working interface for creating a Q-fasta file. 5. Click on the Start button to begin the formatting process. It usually takes few minutes to finish. Wait until completed and then locate the formatted sequence file following the onscreen instructions. You can open the Q-fasta file, hs_fmt.txt, using the Document Viewer in GeneLooper 3.0 to see the formatted sequences. The Q-fasta file is now ready for use by GeneLooper 3.0. Note*: 1. Both nucleotide and protein fasta files can be used for formatting. 2. Very long sequences sometimes freeze the program due to the limilted processing capacity of a PC. Fortunately, only a few titin mRNAs (80 kb) are, so far, found longer than 30,000 bp. In this file, NCBI also includes some non-mRNA sequences such as those accessions beginning with a “NC_” (mitochondrial genome) or “NG_” (special genomic sequences and pseudogenes). These sequences are very long and may not needed by the users. Sources of sequence files: Many sets of mRNA sequences from NCBI are good collections, such as all refseq (accessions beginning with “NM_” or “XM_”) and sequences predicted by GenomeScan (GS_mRNA.fsa.gz). Mammalian gene collection (MGC) clone sequences are also excellent for collection. Considering the low cost of hard drive space, you shouldn’t have any problem storing all mRNA and protein sequences from many organisms on your PC. It is a good habit to organize your sequence files well, such as giving meaningful names, dating it clearly and putting them in a few sub-folders. 5 GeneLooper User’s Manual Internet Links for downloading fasta sequences: 1. Refseqs from NCBI: ftp://ftp.ncbi.nih.gov/refseq 2. Genbank FTP download site: ftp://ftp.ncbi.nih.gov/genbank/ 3. MGC clone sequences from Mammalian Gene Collection: http://mgc.nci.nih.gov/ 4. Download from NCBI Entrez site by using keyword search. Nucleotide or protein from http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=Limits&DB=Nucleotide http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?CMD=Limits&DB=protein. Be sure to use the limit function to define your search and select the fasta format to save the data in your local drive. 6 GeneLooper User’s Manual Chapter 2 Entrez Information Extraction Objective: Extract useful information from Entrez pages prepared by NCBI and tabulate extracted data into spreadsheets. Public databases contain tremendous information about sequences. The Entrez page is one of the most important information sources. A typical Entrez page is shown in Fig. 1. It contains annotated sequence with detailed information. In general, an Entrez page is designed for viewing one sequence at a time, however it is sometimes necessary to examine a large quantity of sequences such as sequences from a gene family. Entrez Information Extraction in GeneLooper allows you to extract important information from multiple Entrez pages and put them into spreadsheets. With the tabulated data, you can classify genes into groups by sorting the data based on the items extracted. Figure 1. A sample of Entrez page from NCBI LOCUS NP_000471 776 aa linear PRI 31-OCT-2000 DEFINITION adenosine monophosphate deaminase (isoform E) [Homo sapiens]. ACCESSION NP_000471 PID g4502079 VERSION NP_000471.1 GI:4502079 DBSOURCE REFSEQ: accession NM_000480.1 KEYWORDS . SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (residues 1 to 776) AUTHORS Mahnke-Zizelman,D.K. and Sabina,R.L. TITLE Cloning of human AMP deaminase isoform E cDNAs. Evidence for a third AMPD gene exhibiting alternatively spliced 5'-exons JOURNAL J. Biol. Chem. 267 (29), 20866-20877 (1992) MEDLINE 93015995 PUBMED 1400401 REFERENCE 2 (residues 1 to 776) AUTHORS Yamada Y, Goto H and Ogasawara N. TITLE Cloning and nucleotide sequence of the cDNA encoding human COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI review. The reference sequence was derived from U29926.1. FEATURES Location/Qualifiers source 1..776 /organism=Homo sapiens /db_xref=taxon:9606 /chromosome=11 /map=11p15 /cell_type=T-lymphocyte, cytotoxic /clone_lib=RPMI 8402, lambda 2001 library of R. Baer Protein 1..776 /product=adenosine monophosphate deaminase (isoform E) /EC_number=3.5.4.6 Region 313..401 /region_name=Adenosine/AMP deaminase /db_xref=CDD:pfam00962 /note=A_deaminase Region 471..716 /region_name=Adenosine/AMP deaminase /db_xref=CDD:pfam00962 /note=A_deaminase CDS 1..776 /gene=AMPD3 /db_xref=LocusID:272 /db_xref=MIM:102772 /coded_by=NM_000480.1:345..2675 7 GeneLooper User’s Manual Procedure: I. File preparation: There are several ways to download a large amount of entrez pages. 1. Download the original Entrez data from the NCBI FTP site. For example, the curated human nucleotide and protein sequence pages can be downloaded from ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Protein/hs.gbff.gz and ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Protein/hs.gnp.gz. Unzip the downloaded files with WinZip (See Chapter 1 on how to use the WinZip). 2. Selectively download the original Entrez data from NCBI Entrez batch download site by inputting a list of GI or Accession numbers: http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide . After downloading the list, it returns a listing of the brief descriptions of the retrieved sequences. Select Genbank from the pull-down menu and click on the save button. Save the file in your local drive. 3. Selectively download the original Entrez data from NCBI Entrez site by using keyword search. Nucleotide and protein Entrez pages can be downloaded from: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=Limits&DB=Nucleotide and http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?CMD=Limits&DB=protein. Be sure to use the limit function to define your search. Save the file to your local drive as described in step 2. II. Extraction: 1. Turn on GeneLooper 3.0 and then click on the Entrez Information Extraction button to show the entrez information extraction form (Fig. 2). 2. Use the browse button to locate the original entrez data file. 3. Select nucleotide or protein based on the type of sequences in the file. The default setting is for nucleotide. There are twelve predefined fields both for nucleotide and protein files. When the protein option is selected, the fields will be changed to the fields for protein. 4. Check any field from the twelve check buttons that you need to extract. 5. Check the Turn on MS Excel button if you need to monitor the process. The results are also saved in the same folder. 6. Click on the OK button to start. Locate the saved data following the on-screen instructions after it completed. Notes: 1. A total of twelve fields is pre-designed both for nucleotide and protein pages. The fields selected for nucleotide are different from that for protein. The selected fields contain most important information about the sequence. If a particular field is not in the list, please contact GeneHarbor, Inc. for a customized version. 8 GeneLooper User’s Manual 2. For the nucleotide page, only one row of records will be generated for each page. However, for protein page, multiple rows of records are allowed to accommodate those proteins with more than one domain (region). The samples of the results both from nucleotide and proteins are shown in Fig. 3 and Fig. 4. Figure 2A. Entrez information extraction form for nucleotides. Figure 2B. Entrez information extraction form for proteins. 9 GeneLooper User’s Manual Figure 3A. Samples of tabulated information from nucleotide pages. Figure 3B. continued from A. Column Legend: A. B. C. D. E. F. Accession Definition Version Organism Tissue Chromosome G. H. I. J. K. L. Cytogenetics Morbid ID Gene Size CDS Protein ID Vector 10 GeneLooper User’s Manual Figure 4A. Samples of tabulated information from protein pages. Figure 4B. continued from A. Column Legend: 1 Accession Definition Version Organism Tissue Cell Type Chromosome Cytogenetics Morbid ID Coded by Signal Peptide Region Domain Name Domain ID Domain Code 11 GeneLooper User’s Manual Chapter 3 Sequence Collection Objective: Collect all “.seq” files in a folder and put them into a Q-fasta file or combine all small Q-fasta files in a folder into a large Q-fasta file. You have probably worked on many genes and the saved related sequences are scattered. To find a particular sequence takes time and requires good memory. It will become even more difficult when the number of “.seq” files reaches thousands. The Sequence Collection function of GeneLooper 3.0 can help you put all your “.seq” files into a Q-fasta file. Each of the sequences can then be opened and studied by using the Sequence View and Editing with a proper ID, and they can also be batch-retrieved by using the Sequence Retrieve function. Procedure: 1. Put your individual “.seq” files or small Q-fasta file to be collected or combined into a file folder for this sequence collection. Start GeneLooper 3.0 and then click on the Sequence Collection button. The Sequence Collection form will appear (Fig. 1). 2. Use the drive selection menu to find the drive and use the folder navigator to locate the folder containing the sequence files to be collected. When double-clicking on the folder name, all names under the folder will appear in the file list box located in the right side of the form. 3. You have three options to collect the sequence files; 1) “.seq” files, 2) small Q-fasta files and 3) all “.seq” and Q-fasta files. Select the “.seq” from the file-type option menu located above the file list box to collect only the “.seq” files, select “.txt” to collect all Qfasta files and select “all” to collect both. Make sure that no other types of files are in the folder. 4. The collected sequences can be saved to an existing Q-fasta file or as a new Q-fasta file. In the first case, locate the existing Q-fasta file using the “Browse…” button. In the second case, use the same button to locate the folder you want the new file to be stored and then type a file name with “.txt” at the end, or you can use the default path and name: “combined.txt” in the same folder as the sequence folder. 5. Click on the Collect File button. All of the sequence files will be quickly put into a Qfasta file in an order based on the alphabetic order of the file names. You can examine the collected sequences in the saved file using Document Viewer in GeneLooper 3.0 or other text editors. A sample of the Q-fasta sequence created is shown in Fig. 2. Notes: 1. Each file name of the collected “.seq” file serves as its “Accession” number or identification number (ID) in the descriptive section of the Q-fasta sequences and is flanked by two “|” signs. In the case of collecting small Q-fasta files, the order and content of sequences in the files are not changed. 12 GeneLooper User’s Manual 2. The above procedure only shows how to collect sequence files in one folder. To collect all your “.seq” files on your personal computer to a Q-fasta file, you can simply put all of them into one folder. To do so, first find all your “.seq” files by using the windows search function by typing “*.seq” in the search box and performing the search. Second, select and copy all the “.seq” files and paste them to a new folder. Using the method described above, you can collect all of your “.seq” files and convert them to a Q-fasta file. 3. The Sequence Collection function is also very effective to collect “.seq” files created by many sequencing machines and convert them into Q-fasta file. Figure 1. Sequence Collection Interface. Figure 2. A sample of collected sequences in Q-fasta. "> |NM_000795| description" "GGCAGCCGTCCGGGGCCGCCACTCTCCTCGGCCGGTCCCTGGCTCCCGGAGGCGGCCGCGCGTGGATGCGGCG GGAGCTGGAAGCCTCAAGCAGCCGGCGCCGTCTCTGCCCCGGGGCGCCCTATGGCTTGACCCTGAGGAAGGAG GGGAAGCTGCAGCTTGGGAGAGCCCCTGGGGCCTAGACTCTGTAACATCACTATCCATGCACCAAACTAATAA AACTTTGACGAGTCACCTTCCAGGACCCCTGGGTAAAAAAAAAAAAAAA" "> |NM_000839| description" "CCATGGGATCGCTGCTTGCGCTCCTGGCACTGCTGCCGCTGTGGGGTGCTGTGGCTGAGGGCCCAGCCAAGAA GGTGCTGACCCTGGAGGGAGACTTGGTGCTGGGTGGGCTGTTCCCAGTGCACCAGAAGGGTGGCAGTGCTGCT GCCAGGGCCAGCTCCAGCCTTGGCCAAGGGTCTGGCTCCCAGTTTGTCCCCACTGTTTGCAATGGCCGTGAGGT GGTGGACTCGACAACGTCATCGCTTTGA" 13 GeneLooper User’s Manual Chapter 4 Sequence Separation Objective: Create individual sequence files (.seq) from a Q-fasta file. This process is the opposite of Sequence Collection (Chapter 2). When performing large-scale sequence analysis, the creation of many individual sequence files is often required, the so-called dot seq (.seq) files that are accepted by many sequence analysis tools. One of the sequence sources used to generate these files is Q-fasta files. The Sequence Separation function in GeneLooper 3.0 provides very efficient and flexible ways to create “.seq” files. The sequence files can be created from any portion of a Q-fasta file, stored in a common folder or their own folders, and readily used by GeneLooper 3.0 or other sequence analysis tools from many vendors. Procedure: 1. Start GeneLooper 3.0 and then click on the Sequence Separation button. The Sequence Separation form will appear.(Fig. 1). 2. Use the Browse button to locate the Q-fasta file where sequences are about to be separated. It takes a few seconds to load the file if the file contains more than thousands of sequences. The program automatically detects the number of sequences in the loaded file. 3. If you want to separate all sequences in the loaded Q-fasta file, skip the selection box labeled “Define a portion…”. If you don’t want to separate all sequences in the loaded Q-fasta file, you may indicate a portion of the file to be separated by using the pair of pull-down menus to select the starting sequence and the ending sequence. 4. Choose the save file option to save the resulting “.seq” files in a common folder (see Fig.2) or save each of the “.seq” files into its own folder (see Fig.3). 5. Click on the Separate Sequences button to begin. 6. Locate the newly created sequence files following the on-screen instructions after the process ends. Sample files saved in a common folder are shown in Fig. 2, while sequence files saved in their own folders are shown in Fig. 3. 14 GeneLooper User’s Manual Figure 1. Sample of the Sequence Separation interface. Notes: 1. All sequences are stored in a newly created folder named after the input file with an appendix, a combination of the ending number of the separation and a “_sep”. If you had previously created a folder with the same name, be sure to remove it prior to the operation to avoid overwriting files and other errors. 2. The accession number or ID of a sequence is used as its folder and file name. 3. It is a good idea not to store too many separated sequence files in a single folder because to locate a file sharing a folder with too many is quite slow. You can avoid this by storing them in few different folders. Figure 2. Sample files saved in a common folder. 15 GeneLooper User’s Manual Figure 3. Sample files saved in their own folders. 16 GeneLooper User’s Manual Chapter 5 Sequence Retrieving Objective: Selectively batch-retrieve sequences from a Q-fasta file using a list of Accession (ACCN), Genbank identification (GI) numbers or your own sequence IDs. The ability to batch-retrieve sequences from your own sequence collections gives you more power to deal with a large number of sequences and saves your time compared to online retrieval of sequences and downloading them from other resources. The Sequence Retrieving utility in GeneLooper 3.0 provides a fast and easy way to batch-retrieve sequences from your sequence collections, even including those created by yourself or your organization so long as you have the sequences in a Q-fasta file and a list of IDs. (Fig. 1). Figure 1. Diagram of the sequence retrieving process Procedure: 1. Prepare a text file (.txt) containing a list of accession numbers (GI’s) or the ID’s of your own sequences. To generate the list file, you can use Microsoft Excel and put the ID’s in a single column (column A only) and save the file as a “.txt” file. See below: NM_014165 NM_014164 NM_014163 NM_014162 NM_014161 NM_014160 2. Start GeneLooper 3.0 and then click on the Sequence Retrieving button. The Sequence Retrieving form will appear (Fig. 2). 17 GeneLooper User’s Manual Figure 2. A sample of Sequence Retrieving interface. 3. Use the upper browse button to locate the Q-fasta file you think contains the sequences to be retrieved. Use the lower browse button to locate the list file containing the ACCN’s, GI’s or ID’s. (User Sequence Format utility to create a Q-fasta file as described in its procedure). 4. Select an ID type from the file type options. Click on ACCN if you have a list of Accession numbers or your own ID’s, otherwise select GI if you have Genbank identification numbers. 5. Clink on the Retrieve button to start. It usually takes a few minutes to retrieve a few thousand sequences. Once the process is complete, the form will display the file paths of a Q-fasta file containing the retrieved sequences and a text file containing missed ID’s if any ID’s failed in retrieving sequence. Notes: 1. To ensure a successful retrieving, both the sequence file and the ID list file must be in their correct formats. 2. A single Q-fasta file may not contain all of the sequences you desired. In case a failed retrieving occurred, you can directly use the missed ID list file to retrieve them from other Q-fasta files. Repeat the process until you have exhausted all of your collections. 18 GeneLooper User’s Manual Chapter 6 Open Reading Frame Detection Objective: Detect the longest open reading frames (ORFs) of multiple sequences in a Q-fasta file and tabulate the ORF data along with other general information. An ORF is an important piece of information about an mRNA sequence. From an ORF sequence, a predicted protein sequence can be derived and subsequently used for studying its functional domains. In general, for a full-length mRNA, the longest ORF is the region coding for its protein. The sequence upstream the first ATG of the longest ORF is called the 5’ untranslated region (5’UTR), and the sequence downstream the stop in the same frame is called the 3’ untranslated region (3’UTR) (Fig. 1). Figure 1. The ORF map of an mRNA for demonstrating the terms used in the program. Pre_Stop 5’ UTR Longest ORF First ATG Stop 3’ UTR Frame 1: Frame 2: Frame 3: The ORF data illustrated in Fig. 1 represents a perfect case of a full-length mRNA. It has a long stretch of region without any termination codon, and has at least one in-frame termination codon before the first ATG. Therefore, one can be sure that this sequence represents a full-length mRNA. It must be pointed out that not all real full-length mRNAs have an in-frame stop codon. Most of the mRNA sequences in the databases are derived from cDNA sequences which are frequently truncated at the 5’ end. It requires a great deal of effort to determine whether a cDNA sequence is full-length. In hunting for new genes, it often requires work with raw data or incomplete sequences such as EST sequences, which are normally a combination of DNA sequences of coding regions, UTRs, partially spliced transcripts and genomic DNA contamination. These sequneces are about 300 to 800 bp in length. If there is a high-throughtput way to detect partial ORFs, it will help researchers to prioritize their effort in the gene discovery process. The ORF Detection function in GeneLooper 3.0 is a high-throughput program. The program allows users to analyze either full-length or partial mRNA sequences and gives the most informative and straightforward presentation of ORF data to the users. Fig. 2 demonstrates the methods used in detecting ORFs’ and the nomenclatures used for data reporting. 19 GeneLooper User’s Manual Figure 2. Diagram demonstrating the two methods used in detecting the longest ORF. 1. Complete ORF method Pre_length ORF_length Pre_stop: yes Stop-pos Pre-stop ATG-pos Pre_length ORF_length Stop-pos ATG-pos Pre_stop: no mRNA_length 2. Incomplete ORF method 5_ stop: yes In-frame ATG: yes 3_ stop: yes Stop Stop ORF_length ATG ATG 5_stop: no In-frame ATG: yes 3_stop: no 5_stop: no In-frame ATG: no 3_stop: no 5_stop: yes In-frame ATG: yes 3_stop: no ORF_length ORF_length Stop ORF_length ATG mRNA_length Procedure: 1. Start GeneLooper 3.0 and then click on the Open Reading Frame Detection button. The ORF detection form will appear (Fig. 3). 2. Use the browse button to locate the Q-fasta file to be analyzed. (User Sequence Format utility to create a Q-fasta file as described in its procedure.) 20 GeneLooper User’s Manual 3. Select a method for detecting ORF in the ORF option panel. There are two options: complete ORF method and incomplete ORF method. A complete ORF is defined if it has an ATG at the 5’ end and a stop at the 3’ end. This option is suitable for fulllength mRNA’s. An incomplete ORF is defined as a region without any stop codon disregarding whether there is an ATG and/or a termination at the end. This option is designed for raw sequence data or partial sequences such as EST’s for an initial screen of potential coding sequences. In both methods, the longest ORF for each sequence will be reported (see Fig.2 for detail). 4. Check the Turn on MS Excel if you want to monitor the detection process. A copy of the results will be automatically saved in the same folder as the input file and the same file name with an appendix of “_Lorf.txt” (complete ORF) or “_Porf.txt” (incomplete ORF) based on the method used. If the Turn on MS Excel is not checked, the saved data file will appear in a spreadsheet of the Data Viewer after the process is completed. 5. Click on the Start button to start. Samples of ORF data are shown in Fig. 4 and 5. Figure 3. A sample of the Open Reading Frame Detection interface. Notes: 1. An ORF reported here is the longest ORF of a sequence. It may differ from its coding sequence (CDS) described in the NCBI records because NCBI may not use the first ATG as the initiation codon and may not use the longest ORF as the coding sequence. To compare your obtained data with NCBI’s data, you can see chapter 12 on how to extract information from the Entrez records. 2. The ORF detection function also extracts other information from a Q-fasta file, such as accession, GI and description. Therefore it is useful when you want to get a list of GI or ACCN from a Q-fasta file. 21 GeneLooper User’s Manual 3. Detecting the ORF’s of all available full-length mRNA sequences from human takes about an hour, while the returned data is extremely informative. From this type of tabulated data and by a few calculation steps, you can answer some fundamental biological questions, such as the average size, the size-distributions of human transcripts, the average sizes of the coding regions and 5’ untranslated regions. The incomplete ORF detection method allows you to study the coding regions of partial cDNA’s, for example, to screen those with relatively long incomplete ORF’s because they are more likely to be genuine transcripts coding for proteins. The resulting table can be also used as the general information of your gene collections. Figure 4. Sample of the data file obtained by using the complete ORF method: Figure 5. Sample of the data file obtained by using the incomplete ORF method: 22 GeneLooper User’s Manual Chapter 7 Multi-Sequence Alignment Objective: Perform multi sequence to multi-sequence similarity search. Provide both a tabulated summary of similarity data as well as detailed whole sequence alignments of homologous sequences in document files. Many public domains and private sectors provide free-access, web-based sequence blasting services using the Blast program (Basic local alignment tools). It has been proven to be very effective in searching sequence homology, especially against a large data set. However, most of these service sites only allow single sequence blast and provide a standard alignment report. To set up a high-throughput blast facility requires a great deal of equipment and human resources. Furthermore, to tabulate the results of multiple-sequence blasting data to a summary is quite challenging. The Multi-Sequence Similarity Search utility in GeneLooper 3.0 is designed for a search algorithm that is more flexible in selection of data sets, high-throughput, low-cost and with a better data presentation. It is very effective when performing multiple sequence similarity searches against a relatively small data set such as the whole transcripts of all human genes (35,000 sequences). Features: 1. Search sequences in flexible data sets with adjustable searching parameters. 2. High-throughput: performing thousands of sequences at once. 3. Instant data report: the tabulated data containing useful information including accessions, lengths, ORF, and proximal homologous locations of the two sequences. 4. Completed sequence alignments instead of fragmented alignments. 5. Saved sequences of query and returned sequences in a common folder for further study. 6. A search process can be interrupted and resumed easily. Procedure: 1. Start GeneLooper 3.0 and then click on the Multi-Sequence Blast button. The MultiSequence Similarity Search interface will appear (Fig. 1). 2. Use the browse button in the top box to locate the sequence database file in Q-fasta. (User Sequence Format plus strand and the other with the reverse complement strand. This program only searches one strand against the sequences of a database, in order to reduce computing time. 3. Select the Save sequence option if you need the individual sequence files to do further homologous sequence analysis. 4. Select the Save alignment option if you need a detailed, print-out-ready sequence alignment to review later. The alignment of each queried sequence and its homologues 23 GeneLooper User’s Manual will be saved in individual document files. A search with this selection will take a slightly longer time than that without this selection. 5. Check the Turn ON MS Excel button if you would like to monitor the search process. 6. Click on the Start button to begin. Locate the saved data following the on-screen instructions after the process is finished. Samples of returned data are shown in Fig. 25.utility to create a Q-fasta file as described in its procedure.) 7. Use the browse button in the lower box to locate the query sequence file in Q-fasta. 8. Select search stringency. The default setting is 30 for reporting low homologous sequences. See the notes below for more details. 9. Select a strand for a similarity comparison. The default setting is the plus strand. To compare sequence with both strands, you can perform two separate searches. Figure 1. Multi-Sequence Similarity Search interface. Notes: 1. This program can be used both for nucleotide and protein sequences, but is not suitable for a large-sized DNA such as genomic DNA (>60,000 bp). 24 GeneLooper User’s Manual 2. The search speed is dependent upon the speed of the PC, the search stringency and the size of the database. It will take about 10 hours to search one thousand sequences against twenty thousands sequences using a regular PC (1.5 GHz, 500 RAM). 3. A lower setting on the search stringency (<50) results in sequences with low similarities and usually slows down the program, while a higher setting (>100) only reports those with higher similarities. You may get a proper setting for your needs by testing some sequences you know. 4. The scoring system used in the resulting report is quite different from that used by the Blast 2 program. The higher the score, the higher similarity. Figure 2. The tabulated similarity report. (A) (B) continued from A. Column specification: A. Query_ID: sequence search order. B. Query_Accn: the query accession number. C. Query_Size: the length of query sequence. D. Mtch_ACCN: the accession of returned sequence. E. Mtch_Name: the description of returned sequence. F. Mtch_Seq_Size: the length of returned sequence. G. Mtch_ORF: the ORF length of returned sequence. H. Mtch_atgpos: the initiation codon (ATG) position of the returned sequence. I. Mtch_stoppos: the termination codon (stop) position of the returned sequence. 25 GeneLooper User’s Manual J. K. L. Q_beginpos: the proximal location of matching from 5’ end of the query sequence. M_beginpos: the proximal location of matching from 5’ end of the returned sequence. Score: the similarity value used by GeneLooper 3.0. Figure 3. Sample of saved sequences. Figure 4. Sample of saved alignment files. Figure 5. A sample of the saved whole sequence alignments opened with Document Viwer 26 GeneLooper User’s Manual Chapter 8 Restriction Site Search Objective: Search for restriction sites of multiple sequences in a Q-fasta file. When dealing with a large number of sequences, it sometimes requires analysis of the restriction sites for many nucleotide sequences. The restriction information of DNA sequences is important for designing experiments such as subcloning. Traditionally, searching restriction sites is done on one sequence at a time, and conceivably, it is inefficient when multiple sequences are involved in an experiment. The Restriction Search function of GeneLooper 3.0 provides a simple solution for the purpose. Procedure: 1. Start GeneLooper 3.0 and then click on the Restriction Search button. The Restriction Site Search form will appear (Fig. 1). Figure 1. The Restriction Site Search interface. 2. Use the browse button to locate the sequence file in Q-Fasta format. (Use Sequence Formatting utility to create a Q-fasta file as described in its procedure). 3. Select the restriction enzymes you want in the frequently used enzyme panel. If a desired enzyme is not in the panel, it can be entered from the custom sites by selection from the pull-down menu. If the enzyme is not in the pull-down menu, it may be manually entered by typing the enzyme name and its recognition site into the name and site boxes respectively. For custom sites, it only allows one enzyme for each search. If multiple 27 GeneLooper User’s Manual customized enzyme sites are needed, you can easily do multiple searches and combine all results in one. 4. Select the Turn on MS Excel check box if you wish to monitor the search process. The results of a search are always saved in the input file folder using the input file name with an appendix of “_res”. 5. Click on the Detect sites button to start. The results will update in an Excel sheet immediately (Fig. 2) if the function is selected. Otherwise, after the process is complete, the results will be shown in the DATA Viewer sheet. Figure 2. Sample of the restriction search data. Notes: 1. The data from each selected enzyme will occupy a column in the resulting spreadsheet, and the location of each site in a sequence is put in a pair of brackets in the corresponding columns. 2. The ORF information is presented because the ATG and Stop positions are often taken into account when deciding whether a particular enzyme site is usable. How to use the restriction data is merely based on user’s needs. However, simple sorting of the enzyme column is one of the effective ways to group DNA sequences based on a restriction enzyme site. 28 GeneLooper User’s Manual Chapter 9 Translation and Reverse Complement Objective: 1. Translate mRNA sequences into protein sequences. 2. Generate reverse complement strand. Determination and translation of the coding regions of mRNA sequences are important steps in gene annotation and gene mining since protein sequences are rich in information that leads to the clues about the biological functions of a protein. For example, conserved domains are derived from protein sequences. The ability to directly translate multiple DNA sequences into proteins using a PC gives researchers an advantage in studying a large number of genes. Features: 1. Translate thousands of sequences in a short period of time. 2. Multiple ways to translate: direct translation, translation of the longest ORF or translation of all three frames. 3. Instant data report including protein length and molecular weight. 4. Translated protein sequences are saved in a Q-fasta file. Procedure: 1. Start GeneLooper 3.0 and click on the Translation and Reverse Complement button. The Translation and Reverse Complementing form will appear (Fig. 1). 2. Select a method to translate the sequences. There are three options: 1) to translate the longest ORF, which tells the program to automatically detect the longest ORF and then translate it; 2) to translate the sequence from the beginning of each sequence, and 3) to translate all three frames. With the last option, Frame 1 is to start translation at the position of base 1, frame 2 is to start at the base position 2 and frame 3 at the base position 3. 3. Select the Turn on MS Excel check box if you wish to monitor the translation process. Once it starts, a summary of the results of a translation is immediately displayed in an Excel sheet (Fig. 2). 4. Click on the Translation button to start the program. 5. When the program is complete, follow the on-screen instructions to locate the translated protein sequences in a Q-fasta file stored in the input file folder. The file name is the DNA file name with an appendix of “Lorf_pep”. In addition to the protein sequence file, a summary file is also saved in the input file folder and is also named after the input file with an appendix of “_pep_info”. The summary file can be opened in the Data Viewer of GeneLooper. A sample of the translated protein is shown in Fig. 3A and B. 29 GeneLooper User’s Manual Figure 1. The Translation and Reverse Complement interface. Figure 2. Sample of the summary of a translation process. Notes: 1. The descriptive lines of a translated protein in the Q-fasta file are the same as that of its DNA sequence except that the word “protein” is added at the end of each description. 2. When direct translation or three-frame translation is selected, internal terminations may occur. Each termination codon is represented by a “*”. 30 GeneLooper User’s Manual 3. The names for all three proteins translated from a DNA sequence by the three-frame method are the same except a plus number (+1, +2 or +3) is added after the “>” sign to indicate which frame the protein is encoded. See Fig 3B. Figure 3A. Sample of protein sequences translated by the ORF method. ">gi|4758283|ref|NM_004441.1| Homo sapiens EphB1 (EPHB1) mRNA" "MALDYLLLLLLASAVAAMEETLMDTRTATAELGWTANPASGWEEVSGYDENLNTIRTYQVCNVFEPNQ NNWLLTTFINRRGAHRIYTEMRFTVRDCSSLPNVPGSCKETFNLYYYETDSVIATKKSAFWSEAPYLKVDT IAADESFSQVDFGGRLMKVNTEVRSFGPLTRNGFYLAFQDYGACMSLLSVRVFFKKCPSIVQNFAVFPETM TGAESTSLVIARGTCIPNAEEVDVPIKLYCNGDGEWMVPIGRCTCKPGYEPENSVACKACPAGTFKASQEA EGCSHCPSNSRSPAEASPI……" ">gi|4503680|ref|NM_003890.1| Homo sapiens IgG Fc binding protein (FC(GAMMA)BP) mRNA" "MGALWSWWILWAGATLLWGLTQEASVDLKNTGREEFLTAFLQNYQLAYSKAYPRLLISSLSESPASVSIL SQADNTSKKVTVRPGESVMVNISAKAEMIGSKIFQHAVVIHSDYAISVQALNAK….." Figure 3B. Sample of protein sequences translated by the three-frame method ">+1|4758283|ref|NM_004441.1| Homo sapiens EphB1 (EPHB1) mRNA (protein)" "EFHMHTHTHARPHRPTRTHSCPRPRSAPGSPVRARARKDTEKPPAESAAAPWDAALSRRCCLGLVSACG PSAGDGPGLSTTAPPGIRSGCDGRNVNGHQNGYCRAGLDGQSCVRVGRSQWLR*KPEHHPHLP…… " ">+2|4758283|ref|NM_004441.1| Homo sapiens EphB1 (EPHB1) mRNA (protein)" "NSTCTPTPTRARTAPRAHTPAHAHAALREVRSGRERERIPRSHPRRAQRRPGTRRSPGAAASAWSRPAGR RPAMALDYLLLLLLASAVAAMEETLMDTRTATAELG….. " ">+3|4758283|ref|NM_004441.1| Homo sapiens EphB1 (EPHB1) mRNA (protein)" "IPHAHPHPRAPAPPHAHTLLPTPTQRSGKSGPGESAKGYREATRGERSGALGRGALPALLPRLGLGLRAV GRRWPWIIYYCSSWHPQWLRWKKR*….. " ">+1|4503680|ref|NM_003890.1| Homo sapiens IgG Fc binding protein (FC(GAMMA)BP) mRNA (protein" "LQPWVPYGAGGYSGLEQPSCGD*PRRLQWTSRTLAERNSSQPSCRTISWPTARPTPASLSPVCQRAPLQSP SSARQTTPQRRSQ*GPGSRSWSTSVPRLR**AARSSSMRW*SILTMPSLCRH*MPSLTQRS*HCCGPSRP* Reverse Complement Occasionally, one needs to study the reverse complement strand such as sequence alignment or antisense. The function to generate a reverse complement strand is added in conjunction with the translation function. To derive a reverse complement strand, select Reverse complement from the option panel, consequently, activating the Reverse complement button and inactivating the translationrelated buttons. Use the browse button to input the sequence file in Q-fasta format, and then click on the Reverse complement button. Once the process is complete, the reverse complement sequence file in Q-fasta will be saved in the input file folder and named after the input file name with the appendix of “_rev.txt”. The descriptive lines for the reverse complement strand are the same as that of its input sequence with the additional words “reverse complement strand” at the end. 31 GeneLooper User’s Manual Chapter 10 Batch Oligo Design Objective: Design optimal oligos for multiple sequences. When studying genes in a large quantity, it is not unusual to design many oligos. For instance, subcloning cDNA fragments from the members of a gene family to a particular vector or generating multiple cDNA fragments for the production of a DNA array is dependent upon a large amount of oligos used in polymerase chain reaction (PCR). Automation and optimization of the designing process can save time and reduce the cost. Features: 1. 2. 3. 4. 5. Design optimal oligos for multiple sequences with flexible parameter settings. Innovated algorithm: select a best pair of oligos from predefined regions. Adjustable designing regions based on the ORF information. Automatically add restriction sites and protect bases if it’s needed. Instant data report. Procedure: 1. Start GeneLooper 3.0 and then click on the Batch Oligo Design button to show the design form (Fig. 1). 2. Use the browse button to locate the input DNA sequence file in Q-fasta format. (Use Sequence Formatting utility to create a Q-fasta file as described in its procedure). 3. Check either Forward or Reverse Oligo if only one oligo is needed for a sequence. Check both if a pair of oligos is needed. The Set Parameter buttons underneath become enabled after the selection. 4. Click on the Set parameter button for the forward oligo to define the region in which the forward oligo is located. Click on the Set parameter button for the reverse oligo to define the region in which the reverse oligo is located. The parameter setting forms will appear as shown in Fig. 2 A and B. 5. Set the parameters for a forward oligo: 5.1. Define the left boundary. There are four starting points in a typical mRNA sequence: 5’ end, ATG, Stop and 3’ end. You can select any of these for the starting point. For example, let’s select the ATG as the reference point and then set the left boundary at a point which is 100 bases upstream of the reference point, here we type in –100. The minus sign indicates that it is the upstream of the ATG. See Fig. 3 for the terms used in the parameter forms. 5.2. Define the right boundary. Let’s again use the ATG as the reference point and input 1 in the input box. With the settings, the program will design an oligo in the region from 100 bases upstream of the ATG to the letter A of the ATG in every sequence. 32 GeneLooper User’s Manual Figure 1. The Batch Oligo Design interface. Figure 2A. Forward Oligo Parameter form. Figure 2B. Reverse Oligo Parameter form. Figure 3. Diagram referencing location- and direction-terms used in the parameter settings. 5’ end ATG Stop - 3’ end + 33 GeneLooper User’s Manual Note: There are four frequently used points for a typical mRNA sequence: the 5’ end, ATG, Stop and 3’ end. Any of the four can be used as a reference point. The counting bases from the point towards the 5’ end is considered as minus, and plus if towards the 3’ end. 5.3. Add a restriction site from the pull-down menu if needed and type the protection bases if it is necessary. Click the OK button to close the forward parameter setting form. 6. If a reverse oligo is needed, click on the Set parameter button to define the region in which the reverse oligo is located. The reverse parameter setting form will appear. (Fig. 2 B). 6.1. First, similar to set the forward parameters, set the left boundary. Let’s use the Stop as the starting point and type “1” in the box next to the Stop label. 6.2. Second, define the right boundary, let’s again use the Stop and type 200 in the box next to the “Stop” label. With the settings, the oligo will be designed in the region between the stop and 200 bases downstream of the stop. 6.3. Add a restriction site from the pull-down menu if needed and type the protection bases if it is necessary. Click on the OK button to close the reverse parameter setting form. 7. Set the annealing temperature (Tm) of the oligos from the pull-down menu in the batch oligo design form. The annealing temperature will be used for every oligo. 8. Select the Turn on MS Excel button if you need to monitor the designing process. Disregarding the selection, the results are always saved in the input file folder with a file name that is the input file name and an appendix “_oligo.txt”. 9. Click on the Design Oligo button to start. After it finished, locate the data by following the on-screen instructions. A sample of the designed oligos in a resulting file is shown in Fig. 4. Figure 4A. Sample oligos designed by the program. 44 GeneLooper User’s Manual B. Continue from A. Column Legends: A. Number: the accession of the input sequence. B. Oligo1Name: the accession and “_F” for the forward oligo name. If the region is too short to design an oligo, it will add the word “impossible” in the cell. C. Oligo1Seq: the sequence of the forward oligo. D. Score1: the penalty score for the forward oligo. The lower the score, the better the oligo. E. Fpos: the position (5’ end ) of the forward oligo in the input sequence. F. Oligo2Name: use the accession and “_R” for the reverse oligo name. G. Oligo2Seq: the sequence of the reverse oligo. H. Score2: the penalty score for the reverse oligo. I. Rpos: the position (5’ end) of the reverse oligo in the input sequence. J. FragmentSize: the result of Rpos minus Fpos. 45 Part II Single Sequence Utilities GeneLooper User’s Manual Chapter 11 Sequence Viewing and Editing Objective: View, edit and manipulate nucleotide and protein sequences. Procedure: Start GeneLooper 3.0 and then click on the Sequence Viewing and Editing button to show the form (Fig. 1). Figure 1. Sequence Viewing and Editing Interface. 1. Load sequence: There are three ways to load sequences to the sequence pages. Click on the File menu to reveal the sub-menus (Fig. 1). The Sequence Viewer can open sequence files created by most sequencing machines and software and allows the loading of multiple sequences. You may load up to 50 individual sequences into 50 viewer sheets at once. A. Option 1: Click on Open to load a single file which can be a sequence file with a “.seq” extension or a Q-fasta file. The first 50 sequences in the Q-fasta file can be loaded on 50 viewer sheets. B. Option 2: If you want to work on many individual sequence files located in a folder, click on Open a folder of sequences on the File menu and use the folder navigator to locate the sequence folder to open. Then click on OK. It can load up to 50 “.seq” files in this folder. (Fig 2) C. Option 3: If you want to choose sequences in a Q-fasta database file to view, click on Open from a local database to open the dialog box shown in Fig. 3. Locate the database file (Q-fasta) after clicking on the browse button. Click on Get all IDs button and all ID’s 42 GeneLooper User’s Manual (ACCN) of the sequences in the file will be loaded into the pull-down menu. From this you may select a sequence from the pull-down menu to view, or type the ID of the sequence in the Input sequence ID box. Click Ok to load the sequence onto a viewer sheet. The typein ID can also be the GI of a sequence. Figure 3. After opening a sequence on the viewer sheets, define the sequence type by selecting the Nucleotide or Peptide Sequence option. The default setting is nucleotide. The screen above the sequence box dynamically shows the sequence information such as the size, GC content if it is nucleotide, or molecular weight if peptide and the selected location. A sample of sequences loaded on the viewer is shown in Fig. 4. Figure 4. 2. Sequence Editing: Use the sub-menus under the Edit menu or the icon shortcuts to perform standard sequence editing, such as Cut, Paste, Copy, Select all and Change cases. Under the Edit menu, the Filter function removes any characters other than 26 alphabetic letters. If the filter option on the form is checked, a sequence will be auto-filtered when pasted from the clipboard. 43 GeneLooper User’s Manual 3. Detection: Click on the Detect menus to show the detection functions (Fig. 5). A. First Open Reading Frame: detects the first ORF after the cursor position. B. Longest ORF: detects the longest open reading frame of the entire sequence. C. Longest Incomplete ORF: detects the longest region without a termination codon. See chapter 3 for details about ORF definition. This method is useful for examining the 5’ UTR and raw data. D. ORF Map: detects all ATG’s and termination codons in all three frames and presents them in a map (Fig. 6). Note: Use the shortcuts to run each function. Figure 5. The sub-menus under the Detect menu. Figure 6. A sample of an Open Reading Frame (ORF) map. Each of the long vertical lines represents a termination codon and the short ones represent ATG locations. The adjacent nucleotides of the sequence can be navigated at the bottom when moving the cursor on each of the sequence lines. 44 GeneLooper User’s Manual E. Short Sequence search: Click on the Search menu to show the search box shown in Fig. 7. Type in the sequences to be searched and then click on the OK button. The match is highlighted on the viewer sheet. The search is case-sensitive and proceeds after the cursor position Figure 7. The Search Input box. F. Restriction Site: Click on the menu to export the current sequence to the Restriction Analysis form. See Chapter 14 for details. G. Oligo Location: Click on the menu to export the current sequence to the Oligo Database Search form. See Chapter 15 for details. 4. Function: A. Reverse Complement Strand: makes a reverse complement sequence to replace selected sequence or whole sequence if no selection. Figure 8. Sub-menus under the Function menu B. Protein Sequence: translates the selected sequence, or whole sequence if no selection, to protein and places it in a new form. C. Sequence Similarity Search: exports the current sequence to the Local Sequence Similarity Search form. See Chapter 16 for details. D. Align Sequence: exports all openedsequences to the Sequence Alignment form. E. Oligo Design: exports the current sequence to the Oligo Design form. F. Hydropath Plot: generates a hydropath plot of the current protein sequence. The mechanism is based on the Kyte and Doolitte method (Fig. 9, see Chapter 10 for more information). Move the cursor on the curve on the left panel to view its zoom-in picture on the right panel. 45 GeneLooper User’s Manual Figure 9. A sample Hydrophobocity Plot of a protein sequence. G. Journal Format: exports the current sequence to the Sequence Journal Format form. See Chapter 16 for details. 46 GeneLooper User’s Manual Chapter 12 Restriction Analysis Objective: Restriction analysis of DNA sequence, virtual enzyme digestion, gel electrophoresis, and vector map drawing. Restriction digest and agarose gel electrophoresis are routine experiments in a molecular biology laboratory. Easy-to-use and informative software can assist researchers to design a successful experiment. The Restriction Analysis function is one of the innovative designs in GeneLooper 3.0. It has an all-in-one interface on which users can see all enzyme sites grouped by the number of cuts in the DNA sequence and test any enzyme digestion with a single click. It also provides the open reading frame (ORF) information of a sequence, which is an important factor in deciding which enzymes can be used. A typical restriction form is shown in Fig. 1. Figure 1. The Restriction Analysis interface. Procedure: I. Load a DNA sequence: Click on the Browse button and then select a DNA sequence file with a “.seq” extension to open. Any letters other than 26 alphabetic letters in the sequence will be filtered 47 GeneLooper User’s Manual automatically. A sequence can also be imported from the Sequence Viewing and Editing form you are using. The restriction search is case-insensitive. Once a sequence is loaded to the sequence box, the program automatically detects the longest ORF and draws the positions of the ORF in the picture box. It also searches all restriction sites in the sequence and groups the enzymes into four classes based on the number of sites found in the DNA sequence. The names of the enzymes are stored in the four pull-down menus, namely no site, 1 site, 2 site and 3 and more. The sequence in the box can be cut and pasted, and any change in the sequence will trigger a new round of ORF detection and restriction search process. II. Enzyme Selection: 1. Select any enzyme from one or more of the three pull-down menus (1 site, 2 site and 3 and more). The names and locations of the selected enzyme will appear both in the lower text box and the picture box. If the enzyme has multiple cutting sites it will appear multiple times at the distinct locations. 2. If you want to choose all enzymes in a pull-down menu, click on the Show All button under the menu (optional). The names and locations of all enzymes that can digest the sequence will be shown in both the text box and the picture box. 3. If the name labels in the picture box are overlapping each other, use your mouse to separate them by holding down the left mouse button and dragging a label up or down. 4. Click on the Undo Pick button to unload all selected enzymes. 5. Click on the Hide ORF button to hide ORF information. 6. Click the Copy map button to copy the restriction map to the other document programs such as MS PowerPoint. III. Virtual Gel: 1. Select any combination of restriction enzymes from the pull-down menus. 2. Choose the Linear or Circle option based on the nature of your DNA. 3. Click on the Virtual Gel button to see a computer-generated gel picture similar to the one shown in Fig. 2. Separate any overlapping band-labels by holding down the left mouse button and dragging a label up or down. 4. Click on the Reset button to perform another digest by repeating the step 1 to 3. Note: only the 1 kb ladder (Courtesy of Invitrogen) is used as the DNA Marker. The virtual gel picture can be copied and pasted into other document programs. 48 GeneLooper User’s Manual Figure 2. Sample Virtual Agarose Gel electrophoresis. Draw a plasmid map: 1. Select the restriction enzymes with which you wish to label the plasmid map. 2. Click on the Show Circle Map button. The plasmid map drawing form will appear. 3. Use the line width menu to adjust the line width of the circle. 4. To change the circle size, check the Change the circle size box, move your cursor to the circle, hold down the left mouse button and move it. Moving it toward the center will reduce the circle size, whereas moving it away from the center will increase the circle size. Once the size is settled, uncheck the Change size box immediately to fix the drawing. 5. Separate any overlapping labels by holding down the left mouse button and then moving the label to the desired location. 49 GeneLooper User’s Manual 6. To add a new label, click on the Add label button. The label enter box will appear. Select font and font size, type the label and hit the Enter key. Use the mouse point to move the label to a desired location as described above. 7. Click on the Undo label button to remove all newly added labels if needed. 8. The picture can be copied and pasted to other programs such as MS PowerPoint. Figure 3. The Vector Map Interface 50 GeneLooper User’s Manual Chapter 13 Sequence Alignment Objective: Perform one-on-one sequence alignment and present alignment data in multiple ways. Sequence alignment is one of the most frequently used operations in DNA and protein sequence analysis. It can give detailed information about the sequence relationships between two DNA or protein sequences. Therefore, from the alignment results, it is possible to identify conserved regions that may suggest their similarities in biological functions. The Sequence Alignment function in GeneLooper 3.0 is innovative, sophisticated and informative. Features: 1. Easy to input sequences: a pool of sequences can be loaded simultaneously. 2. Sophisticated alignment algorithm: searches the best alignment between two sequences. 3. Protein sequences can be translated directly from the loaded DNA sequences and aligned to immediately test the consequence of a difference in nucleotide sequence. 4. Align the reverse complement strands derived from loaded DNA sequences. 5. Adjustable and detailed alignment diagram. 6. Editable whole sequence alignment report. 7. Tabulated base-to-base comparison: ideal for SNP analysis. Figure 1. The Sequence Alignment interface. 51 GeneLooper User’s Manual Procedure: 1. Input sequences: A. Click on the Import Seq button to import a single sequence. B. Click on the Import Folder to import all sequences in a folder. C. Import from the Sequence Viewing and Editing form. Note: The sequences can be proteins or nucleotides. All names of imported sequences are stored in the pull-down menu and will be kept available until the Clear All button is clicked or the form is closed. 2. Load two sequences for alignment: A. Select the first sequence name from the pull-down menu as SeqI, consequently, its sequence will be loaded to the SeqI row in the sequence box. B. Select the second sequence name from the pull-down menu as SeqII, and the sequence will go to the SeqII row in the sequence box. Note: The proteins encoded by the two DNA sequences can be aligned after translating the DNA into protein. There are fours methods in the pull-down menu on the Translation panel to choose for the translation process. Use the Rev complement function to generate the reverse complement of a loaded sequence. 3. Click on the Align button to align the two sequences. The identical base pair in the two sequences will be connected by a “|”. The alignment box only can show a portion of the alignment. To see other regions of the alignment, move the cursor left or right within the lower picture box to view the entire aligned sequences (see below). To stop the sequence from moving, click on the left mouse button. To resume sequence alignment movement, click on the left mouse button again. 4. Click on the Draw Diagram to create an alignment diagram in the lower picture box (Fig. 1). In a diagram, the thin horizontal line indicates a dissimilar region between the aligned sequences, while the blue-colored bar represents the homologous region. The location of a homologous region is labeled precisely, and the entire drawing is in proportion to the lengths of the sequences. Note: A significant homology displayed in the sequence diagram is adjustable by changing the setting in the Draw parameters panel. The default setting is 50 bp long and 100 percent identities. Any region fulfilling the requirements will be represented by a blue bar. In the case of 100 percent identity, the labels at the junction of blue bars (match) and the thin line (mismatch) indicate the exact transition positions. If this parameter is set less than 100 percent, they only indicate approximate positions of the transition. The default setting is very useful to identify the junctions of exons when comparing two splice variants. The percent of identities is defined as the number of matched bases of SeqI divided by the length of SeqI in a defined region by the two guidelines. 5. To separate any overlapping labels, point the cursor to a label, hold down the left mouse button and move it. Click on the Show guide labels button to see two guidelines for dynamically calculating the identities of the region flanked by them. The guidelines can be moved by moving the cursor when holding down the left mouse button. The two guidelines can be hid by clicking on the Hide guide labels and revealed by clicking the same button again. 52 GeneLooper User’s Manual 6. Click on the Copy Diagram button to copy the diagram to the clipboard and paste it to other document programs such as MS PowerPoint. 7. Click on the Page View to view the whole sequence alignment (Fig. 2). You can set the row width by changing the number of bp/row in the pull-down menu above the Page View button. Figure 2. Sample of whole sequence alignment in a page view. Note: The alignment is shown in a document editor so that it can be modified as a document file. The identities are defined as the number of matched bases in SeqI divided by the total length of SeqI. The SeqI and SeqII can be exchanged during the loading sequence process (step 2). 8. Click on the Table View button to view a tabulated base-to-base alignment (Fig. 3). All bases of the two sequences will be aligned vertically in two columns, and each base-position relative to its 5’ end is precisely marked in the adjacent columns. The Table View function is very useful for SNP study. A. B. C. Click on the Show Mismatch button to show all mismatched bases. Click on the Show Gap to show all bases in the gap regions Click on the Show Match button to show all matched bases. 53 GeneLooper User’s Manual Figure 3. A sample of base-base alignment in a table view. 9. Click on the Reset button to clear all current alignment data and repeat the above steps to perform a new alignment for another pair of loaded sequences. Click on the Clear All button to unload all the loaded sequences. 54 GeneLooper User’s Manual Chapter 14 Local Sequence Similarity Search Objective: Perform a sequence similarity search against sequences stored in the hard drive of your personal computer. Sequence similarity search is usually done through public domains such as NCBI or your organization’s facility. NCBI provides a free-access service and does a great job. However, since the workload for NCBI is huge, it may take a long period of time to retrieve data. There are many cases in which you only want a quick similarity search against a relatively small data set. The Local Sequence Similarity Search in GeneLooper can do just this in your personal computer and has some unique features. It is extremely useful to search one member of a gene family against all other members of the family saved in a common folder, or search similarities among the members of a cluster saved in a folder created by the Sequence Clustering utility (see Sequence Clustering for detail). Features: 1. Quickly run a similarity search for a selected sequence against your sequence databases that are saved in Q-fasta files and/or “.seq” files in your PC. The searchable database can be many “.seq” files, or mixed with many Q-fasta files so long as the files are all in a common folder. 2. Provide a whole sequence alignment instead of fragmented alignments for every positive sequence. 3. Adjustable search stringency. 4. Reliable: it uses your own computer so you can count on it. 55 GeneLooper User’s Manual Figure 1. A sample of the local sequence blast. Procedure: 1. Start GeneLooper 3.0 and then click on the Local Sequence Blast button to open the search form (Fig. 1). 2. Loading database files: click on the Browse button in the Select database panel to open the file explorer form (Fig 2). Use the file explorer to select the folder containing the sequence files that will be used as the search database. The files can be “.seq”, Q-fasta or a mixture of the two. Files other than these two formats must be removed to avoid operating errors. Click on the OK button to close the form and all file names from the folder will appear in the pull-down menu on the Select Database panel as well as the pull-down menu in the Query Sequence box (Fig. 1). Figure 2. The file explorer for loading database files with (.seq) and (.txt) extensions. 56 GeneLooper User’s Manual 3. Input a query sequence in one of the four ways: A. Select a sequence name from the pull-down menu. This option is designed for searching sequence similarity among sequences in the chosen folder. B. Open a sequence from a saved sequence file by using the file open browser. C. Import from the Sequence Viewing and Editing form. D. Copy and paste a sequence directly from an opened sequence document. 4. Select a database name in the pull-down menu loaded in step 1. You can select any one of the files or all files by selecting ALL. 5. Select forward, reverse or both strands for a similarity search. 6. Select search stringency from the pull-down menu in the panel. A higher number corresponds to a higher stringency and a faster search. 7. Click on the Search button to begin. It takes a few seconds to a few minutes to finish, based upon the speed of your PC, the size of the data set and the search stringency. Note: The names of matched sequences will appear in the lower text box in order from the highest score to the lowest score. The score system is used for comparing homologues in each search and different from that used in Blast 2 (a third party program). The stringency setting will affect the search score. 8. Click on the Page View button to see the details of the whole sequence alignments that are kept in a single document file following each sequence search (Fig. 3). 9. Click on the Export to Align button to export all related sequences to the alignment form for further alignment analysis (see Chapter 15 for details). Note: The local sequence similarity search function is an alterative to the BLAST program provided by NCBI, but won’t replace the latter. Since the search capability in GeneLooper is largely dependent upon the features of a PC, such as its processing speed and memory, it will be able to search all mRNA sequences from an organism. It can be used to compare sequences from a gene family or your own collections by querying one against the rest. The process is quick, and the data are very informative. It is not suggested to analyze or search extremely long sequences like genomic DNA with the current program. 57 GeneLooper User’s Manual Figure 3A. Sample of the alignment in page view (part A) Figure 3B. Sample of the alignment in page view (part B) 58 GeneLooper User’s Manual Chapter 15 Primer Design Objective: Design PCR primers from a nucleotide sequence. A successful DNA amplification by PCR is partly dependent upon the pair of oligos used. Optimized oligos can increase the yield of the amplified DNA and reduce the background caused by non-specific reactions. The Oligo Design function in GeneLooper 3.0 uses a reliable algorithm to ensure selection of the best pair of oligos from defined regions in a DNA sequence. Features: 1. Easy to setup: all-in-one design interface. 2. Diagram with ORF information, sliding guidelines and pull-down menus for precisely defining a region for the selection of an oligo. 3. Automatically detect internal restriction sites to guide enzyme site selection for addition to the 5’ ends of the oligos. 4. One step addition of restriction site and protection bases. 5. Return multiple pairs of oligos in a spreadsheet. Figure 1. A sample of the Primer Design interface 59 GeneLooper User’s Manual Procedure: 1. Load a sequence to the sequence box using the browse button. You can also paste sequence from the clipboard or import a sequence from the Sequence Viewing and Editing form. 2. Define the boundaries of the forward oligo and reverse oligo. The program automatically detects the ORF and displays it in the diagram. Use the two pairs of guidelines to set the boundaries or directly set precise boundaries to a single base using the four pull-down menus. 3. Set the annealing temperature for the oligos from the Tm pull-down menu. 4. As an option, you may select enzyme sites from the enzyme selection pull-down menu (RE site) for forward or reverse oligos. These enzymes listed in the menu don’t have any internal sites between the 5’ of the forward boundary and the 3’ end of the reverse boundary and can be dynamically updated following any change of the two guidelines. Add a few bases to the 5’ end of the restriction site to ensure a complete digestion of the PCR product with the selected enzyme. 5. Select the number of oligo pairs to be designed on the Oligo Returned pull-down menu. 6. Click on the Design Oligo button to begin. It takes a few seconds for an oligo pair to be displayed in the resulting spreadsheet. The best pairs are listed at the top of the table. 7. Clicking on the oligo sequence in the table will highlight the oligo sequence in the sequence box so that you can verify the oligo sequence and examine the adjacent bases. 8. Highlight the oligos you want to copy and then click on the Copy Oligo button. The copied oligos can be pasted to other document programs such as MS Excel. 9. Click on the Print Oligo button to print the spreadsheet containing the oligo information. Notes: 1. Any change in the sequence will trigger an update of the ORF information and the restriction enzymes in the pull-down menu. 2. The program uses an arbitrary scoring system to evaluate the oligos. The lower the penalty score, the better the oligo. 3. There are sixteen pre-selected, frequently used enzymes for the 5’-addition. BamH I/GGATCC Sac I/GAGCTC Bgl II/AGATCT Sac II/CCGCGG EcoR I/GAATTC Sal I/GTCGAC Hind III/AAGCTT Sca I/AGTACT Kpn I/GGTACC Sma I/CCCGGG Not I/GCGGCCGC Spe I/ACTAGT Pst I/CTGCAG Xba I/TCTAGA Pvu II/CAGCTG Xho I/CTCGAG 60 GeneLooper User’s Manual Chapter 16 Oligo Database Search Objective: Search all your existing oligos based on a nucleotide sequence. When studying multiple genes, it is not unusual to use many oligos. Managing the oligo data you have accumulated can sometimes be a daunting task. GeneLooper 3.0 provides an easy and efficient solution for you to organize the oligos of your own or even your organization and make them a searchable database. This function reduces the time for looking an existing oligo and avoids redundant oligo orders. Procedure: 2 1. Create a searchable oligo database. GeneLooper 3.0 uses a simple data structure to store your oligo information. The oligos allowed in the file is about 60,000. Basically it is a spreadsheet having the following three columns (Fig. 1): Column A is for the oligo names, Column B is for the oligo sequences (5’ to 3’) and Column C is for general information about the oligos such as owners and locations. Figure 1. Sample of oligo database file 61 GeneLooper User’s Manual Save the work sheet as a .csv file using the name “oligo_db.csv” in the same folder as GeneLooper 3.0.exe (Fig. 2) which is usually in “C:\program file\GeneLooper\” folder. Note that the name, the type and location of the oligo file are extremely critical in order for the search to work. Please follow the instructions precisely. If a file template with the same name is already in the folder, simply replace it with your oligo file. 3 Figure 2. Save the file as “oligo_db.csv” using the Excel Save as dialog box. 2. Oligo Search: A. Start GeneLooper 3.0 and then click on the Oligo Database Search button. The Oligo Database Search form will appear (Fig. 3). B. Use the Browse button to load the sequence, or paste the sequence to the sequence box from the clipboard. Also, it can be imported from the Sequence Viewing and Editing forms. C. The program searches for an exact match between query sequence and oligos, but can allow for mismatch at the 5’ end of an oligo in order to accommodate those with a linker or adaptor sequences at the 5’ end. The default setting is for a whole sequence match. Adjust the number of bases from the pull-down menu to perform a 5’ end mismatch search. D. Click on the Search Oligo button to start. It takes a few seconds to finish. The matched oligo information is tabulated in a spreadsheet. For each oligo matched, a localized line with its location label is shown in the diagram in the picture box. Move the cursor to separate any overlapping labels. E. Click on an oligo sequence to highlight the oligo sequence in the query sequence box. F. Click on the Print form, Print Oligo, Copy Oligo or Copy picture button to print entire form, print spreadsheet, copy highlighted oligos or copy the diagram in the picture box respectively. 62 GeneLooper User’s Manual Figure 3. A sample of Primer Database Search interface: 63 GeneLooper User’s Manual Chapter 17 ePCR Objective: To check if a PCR design is correct and to generate the sequence of a PCR amplicon Procedure: 1. Click on ePCR button to turn the ePCR form (Figure 1). Load a sequence through File Open or by pasting. The sequence will be in the sequence box. 2. Paste a forward primer sequence to the box next to “Paste forward Primer Seq”. A short line is drawn above the long line (DNA template) if there is match between the forward primer and the DNA template sequence. Otherwise no line will be drawn and a popup message indicating no match between the two. 3. Paste a forward primer sequence to the box next to “Paste reverse Primer Seq”. A short line is drawn below the long line (DNA template) if there is match between the reverse primer and template sequence. Otherwise no line will be drawn and a popup message indicating no match between the two. 4. If both match, click on RUN PCR button. A line indicating PCR amplicon will be drawn below. 5. Click on Export PCR Seq button. The sequence will be exported to Sequence view. 6. You can check the sequence and any restriction sites and adaptors you added in the 5’ PCR primers. Figure 1. Sample of ePCR Interface 64 GeneLooper User’s Manual Chapter 19 Sequence Formatting Objective: Format sequence for publication or presentation. We often need to create a standard sequence format (Fig. 1) for publication or presentation since it is one of the most informative ways to present an mRNA sequence. GeneLooper 3.0 provides a very flexible solution to generate this format. Features: 1. 2. 3. 4. Multiple options for selecting sequence to translate into protein. Adjustable row width. Single or triple letter code for amino acid residues. Standard document editing environment for further editing. Figure 1. A sample of working interface 65 GeneLooper User’s Manual Procedure: 1. Start GeneLooper 3.0 and then click on the Sequence Journal Format button. The formatting form will appear (Fig. 1). 2. Use the Open icon button to load the sequence, import it from the Sequence Viewing and Editing form, or paste it directly from the clipboard. 3. Select an option to translate your mRNA sequence. See Chapter 4 for details about the definitions of different ORFs. 4. Set the formatting parameters including font size, font name, number of amino acid residues per row and single letter or triple letter code for representing the amino acid residues. 5. Click on the Get format button to generate the format and click on the Undo button to start again. 6. Modify the formatted sequence using the functions provided by the document editor. The format can be saved and copied. Note that the added modifications (color, under line and so on) may be lost when copying the file onto the clipboard. To transfer the sequence with all modifications intact, save the file and re-open it using Microsoft Word. 66 GeneLooper User’s Manual Chapter 20 Document Viewer I. Document Viewer The Document Viewer in GeneLooper 3.0 is a text editor for viewing and editing files used for or created by GeneLooper 3.0. It is recommended that the following files be opened by the Document Viewer: Fasta files. Q-fasta files. Downloaded Genbank files (Entrez Pages). Alignment reports created by the Multi-Sequence Similarity Search utility. Saved alignment reports by the Sequence Alignment and Local Sequence Similarity Search. Formatted sequences saved in the Sequence Journal Formatting utility. The following are supported operations: Editing: Cut, Copy, Paste, Print, and Save. Find: Find, Replace, Replace all. (The Find function is case sensitive.) Figure 1. A sample of fasta file opened by the Document Viewer 67 GeneLooper User’s Manual II. Data Viewer The Data Viewer in GeneLooper 3.0 is designed for viewing the data created by the high-throughput utilities in GeneLooper 3.0. It is recommended that the following files be opened by the Document Viewer: Utility Appendix of Data file Open Reading Frame Detection Sequence Clustering Multi-Sequence Blast Restriction Site Search Translation Hydrophobic Domain detection Batch Oligo Design Entrez Information Extraction _Lorf.txt (complete ORF), _Porf.txt (incomplete ORF) _clust.txt _mtch.txt res_site.txt _pep_info.txt _tmd.txt _oligo.txt _entz_info.txt The following are supported operations: Print, Copy, Sort. Figure 1. A sample of data file opened by the Data Viewer Note: An alternative to the Data Viewer is using MS Excel to open these data files. When using MS Excel, select the delimited option and Comma as the delimiter. 68
© Copyright 2026 Paperzz