Lecture 3.1 MySQL and JDBC “Lets talk databases” Sohrab Shah UBC Bioinformatics Centre [email protected] http://bioinformatics.ubc.ca/people/sohrab Lecture 3.1 1 Objectives 1. Learn the basics of relational databases 2. Learn how to use MySQL 3. Learn how to use the Structured Query Language (SQL) 4. Learn to communicate with MySQL through the Java Database Connectivity (JDBC) protocol Lecture 3.1 2 Outline • • • • • Why are databases important in bioinformatics? Brief background in databases Introduction to the Structured Query Language A worked example – a Sequence database in MySQL The JDBC protocol – using Connector/J for MySQL Lecture 3.1 3 What is a database? • Collection of information – Spreadsheet – Filing cabinet – Oracle database • Biology is abound with collections of data – ‘Tsunami’, ‘deluge’, ‘avalanche’, ‘flood’ • Databases help us efficiently organise, integrate and query data in order to make scientific inferences Lecture 3.1 http://bioteach.ubc.ca 5 Databases and bioinformatics Nucleotide records Protein sequences 3D structures Interactions & complexes Human Unigene Cluster Maps and Complete Genomes Different taxonomy Nodes Human dbSNP Human RefSeq records bp in Human Contigs > 5,000 kb (116) PubMed records OMIM records Lecture 3.1 36,653,899 4,436,362 19,640 52,385 118,517 6,948 283,121 13,179,601 22,079 2,487,920,000 12,570,540 15,138 6 Molecular biology needs databases! • High volume + complex data structures = HELP! Lecture 3.1 7 NAR Database Issue - 2004 142 articles Lecture 3.1 8 RELATIONAL DATABASES Lecture 3.1 9 Relational Databases • A brief history – Developed by E.F. Codd (IBM) 1969-70 • Died 2003 – Awarded the Turing prize for his work – Developed 12 rules to define a RD that call for a language to define, manipulate and query the data in the database – 1 rule led to the Structured Query Language (SQL) that is used in every RDMBS system on the market • ANSI standard (92,99) Lecture 3.1 10 SQL Lecture 3.1 11 Relational Model • All data stored in tables – Table is a ‘relation’ made up of columns (fields) and rows (records) – Intersection of a column and a row is a typed ‘value’ • Integer, Real, Varchar, Text, Blob, etc… – Operations on tables produce tables Lecture 3.1 12 Advantages of the relational model • Data independence – Shielding the data from the application • Efficiency – Storage, retrieval, integration • Data integrity/security – Constraints, access controls Lecture 3.1 13 ACID test • Atomicity – ‘all or nothing’ transaction – If one operation fails, all fail • Consistency – data integrity – constraints • Isolation – Every transaction has a consistent view of the database regardless of what other transactions are being processes • Durability – Once a transaction is complete, the newly updated data will survive failures of any kind – logs Lecture 3.1 14 Research fuelled by corporate databases gives us great technology for biological science • 30+ years of research into robust systems • Industry standards for databases • Vendors committed to high-quality products – Oracle, DB2, Sybase, MS SQLserver, etc… • Emergence of the internet and database driven webcontent set the stage for bioinformatics • Data mining tools for creating statistical associations – Diapers and beer? – Teradata, a division of NCR Corporation Lecture 3.1 15 SQL Lecture 3.1 16 What drives a database? SQL Lecture 3.1 17 SQL • Structured Query Language (ANSI 92,99) – Used in virtually every RDBMS product • Has operations for: – – – – – – – – Creating tables Modifying tables Relating tables Inserting data Updating data Retrieving sets of data Deleting sets of data Deleting tables Lecture 3.1 18 SQL • Not all implementations consistent • WARNING: – MySQL CREATE TABLE statements != PostgreSQL CREATE TABLE statements Lecture 3.1 19 Commercial RDBMS • Oracle – According to Forbes, Larry Ellison is the 9th richest person in the US ($18 billion) • DB2 – IBM’s solution – free for academics • Microsoft SQL server – For Windows Lecture 3.1 20 Open Source RDBMS • PostgreSQL – http://www.postgresql.org/ – ‘the worlds most advanced Open Source database software’ – Began in 1986 at UC Berkeley – For many years considered the most ‘sophisticated’ OS RDBMS – Performance? – Comes with most Linux distros – Small but loyal user community Lecture 3.1 21 MySQL • http://www.mysql.com/ • ‘The world's most popular open source database ‘ – > 5,000,000 active installations • Easy to use • Very fast retrieval due to architecture – Considered by many to be a ‘toy’ database – For years – no row-level locking – Did not handle transactions well Lecture 3.1 22 MySQL • Free – As in ‘free beer’ • Dual license • Commercial: http://www.mysql.com/products/licensing/commercial-license.html • OpenSource: http://www.mysql.com/products/licensing/opensource-license.html – As in ‘free speech’ • Fast – Extremely fast reads for certain table types – Outperforms any RDMBS for reads • Functional – – – – Ease of use APIs in Perl, C, C++, Java Client/server architecture Works well with Apache/PHP for very popular OS dynamic web solution Lecture 3.1 23 MySQL versions • 3.23.* (http://dev.mysql.com/doc/mysql/en/News-3.23.x.html) – Introduces row-level locking – Introduces full-text indexing • 4.0.* (http://dev.mysql.com/doc/mysql/en/News-4.0.x.html) – Transactions, foreign keys with InnoDB – Improved Full-text indexing • 4.1.* (http://dev.mysql.com/doc/mysql/en/News-4.1.x.html) – Subqueries • 5.0.* (http://dev.mysql.com/doc/mysql/en/News-5.0.x.html) – Stored procedures Lecture 3.1 24 MySQL – examples in bioinformatics • Free, fast and functional have made MySQL pervasive in bioinformatics: • Ensembl (http://www.ensembl.org) – Automated eukaryotic annotation database • Gene Ontology (http://www.geneontology.org) – Controlled vocabulary for genes and functions • UCSC Genome Browser (http://genome.ucsc.edu) – Human and other genome browser • BASE (http://base.thep.lu.se) – BioArray Software Environment – a web-based database solution for microarrays Lecture 3.1 25 Worked example – a relational model for sequences and features • Create a relational model • Tables to store: – data • Sequence strings – Meta-data • Data about the data – features and their locations • Insert some records • Query the data to pull out useful subsets Lecture 3.1 26 Creating a Relational Database • Start with a data set • Divide data set into ‘records’ – The data • Divide records into useful fields that describe the particular record – The meta-data • • • • Create a model based on the useful fields Create a database from the model Insert the data into the database The data is now ‘computable’ Lecture 3.1 27 LOCUS DEFINITION YSCITRSA2 2075 bp DNA linear PLN 26-APR-2004 Saccharomyces cerevisiae isoleucyl tRNA synthetase (LAF1) gene, partial cds; and unknown gene. ACCESSION L32174 VERSION L32174.1 GI:46561769 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. REFERENCE 1 (bases 1 to 2075) AUTHORS Chen,E. and Bretscher,A.P. TITLE The LAF1 open reading frame encodes a second isoleucyl tRNA synthetase in the yeast Saccharomyces cerevisiae JOURNAL Unpublished FEATURES Location/Qualifiers source 1..2075 /organism="Saccharomyces cerevisiae" /mol_type="genomic DNA" /db_xref="taxon:4932" gene <1..1204 /gene="LAF1" CDS <1..1204 /gene="LAF1" /note="disruption results in an abnormal actin cytoskeleton; putative" /codon_start=2 /product="isoleucyl tRNA synthetase" /protein_id="AAT01099.1" /db_xref="GI:46561770" /translation="SLKLSKLPSPLYQVCLEGSDQHRGWFQSSLLTKVASSNVPVAPY EEVITHGFTLDENGLKMSKSVGNTISPEAIIRGDENLGLPALGVVGLRYLIAHSNFTT DIVAGPTVMKHVGEALKKVRTNFRYLLSNLQKSQDFNLLPIEQLRRVDQYTLYKINEL LETTREHYQKYNFSKVLITLQYHLNNELSAFYFDISKDILYSNQISWSWQEGRSNNAC PYTNAYRAILAPILPVMVQEVWKYIPEGWLQGQEHIDINPMRGKWPFLDSNTEIVTSF ENFELKILKQFQEEFKRLSLEEGVTKTTHSHVTIFTKHHLPFSSDELCDILQSSAVDI LQMDSNNNSHPTIELGRGINVQILVNVQILVERSKRHNCPRCWKANSAEEDKLCDRCK EAVDHLMS" CDS 1452..2075 /note="putative" /codon_start=1 /product="unknown" /protein_id="AAT01100.1" /db_xref="GI:46561771" /translation="MTVMNLFFRPCQLQMGSGPLELMLKRPTQLTTFMNTRPGGSTQI RFISGNLDPVKRREDRLRKIFSKSRLLTRLNKNPKFSHYFDRLSEAGTVPTLTSFFIL HEVTANTTTVLLWWLLYNLDLSDDFKLPNFLNGLMDSCHTAMEKFVGKRYQECLNKNK LILSGTVAYVTVKLLYPVRIFISIWGAPYFGKWLLLPFQKLKHLIKK" ORIGIN 1 aagcttaaag ttgtcaaaac tcccatcccc cctgtaccaa gtttgtctag aaggatctga 61 tcaacataga ggatggtttc aaagttcact gctaacaaaa gtagcatcaa gtaatgtccc 121 tgttgcacca tatgaagaag tgattactca tggttttacc ctagatgaga atggtctgaa 181 aatgtcaaaa tctgtgggaa atacaatttc tcccgaagca ataattcgag gcgatgaaaa 241 cttaggctta ccagctttgg gtgttgtagg cttgaggtat ctgatagcac attcgaattt 301 cacaactgat atagttgctg gcccgactgt gatgaaacat gtaggagaag ctctaaaaaa 361 ggttaggact aactttcgct atttattgag taatttacag aagtcccaag atttcaacct 421 tttgccgatt gaacaattac gccgtgttga tcaatatacc ttgtataaga taaacgaact 481 gctggaaacg acgagagaac actaccaaaa gtacaacttt tccaaggttc tcattactct 541 acaatatcat ttaaataacg agctatcggc gttttatttt gatatctcaa aggatatttt 601 atattccaac caaatatctt ggtcatggca agaaggcagg tcaaacaacg cttgtccata 661 tactaatgca tatagggcaa ttcttgcacc aatattaccc gttatggtcc aagaagtatg 721 gaagtatata ccagaaggat ggttacaagg acaagaacat atagacatta atccgatgcg 781 tggaaaatgg ccgtttttgg actcaaatac ggaaatcgtc acctcctttg aaaactttg 2075 bp L32174 26-APR-2004 ACCESSIONGenbank sequence record Example gene Data Lecture 3.1 <1..1204 /gene="LAF1" 28 Simple example: a relational model for biological sequences and features Lecture 3.1 29 Remove the last page from your binder Lecture 3.1 30 CREATE Sequence CREATE TABLE Sequence ( sequence_id INT NOT NULL AUTO_INCREMENT, sequence LONGTEXT NOT NULL, defline TEXT, accession VARCHAR(255) NOT NULL, version INT DEFAULT 0, length INT DEFAULT 0, moltype INT NOT NULL, PRIMARY KEY(sequence_id) ); Lecture 3.1 31 CREATE Ontology CREATE TABLE Ontology ( ontology_id INT NOT NULL AUTO_INCREMENT, term VARCHAR(255) NOT NULL, description TEXT NOT NULL, PRIMARY KEY (ontology_id) ); Lecture 3.1 32 CREATE Feature CREATE TABLE Feature ( feature_id INT NOT NULL AUTO_INCREMENT, sequence_id INT NOT NULL, ontology_id INT NOT NULL, FOREIGN KEY (sequence_id) REFERENCES Sequence, FOREIGN KEY (ontology_id) REFERENCES Ontology, PRIMARY KEY(feature_id) ); Lecture 3.1 33 CREATE Location CREATE TABLE Location ( location_id INT NOT NULL AUTO_INCREMENT, feature_id INT NOT NULL, start INT NOT NULL, stop INT NOT NULL, strand INT NOT NULL, FOREIGN KEY (feature_id) REFERENCES Feature, PRIMARY KEY(location_id) ); Lecture 3.1 34 CREATE Qualifier CREATE TABLE Qualifier ( qualifier_id INT NOT NULL AUTO_INCREMENT, feature_id INT NOT NULL, ontology_id INT NOT NULL, value TEXT NOT NULL, FOREIGN KEY (feature_id) REFERENCES Feature, FOREIGN KEY (ontology_id) REFERENCES Ontology, PRIMARY KEY (qualifier_id) ); Lecture 3.1 35 INSERT an ontology mysql> INSERT INTO Ontology (term, description) VALUES -> ('start codon', 'denotes an Methionine codon of a transcript'); Query OK, 1 row affected (0.00 sec) mysql> SELECT * FROM Ontology; +-------------+-------------+---------------------------------------------+ | ontology_id | term | description | +-------------+-------------+---------------------------------------------+ | 3 | start codon | denotes an Methionine codon of a transcript | +-------------+-------------+---------------------------------------------+ 1 row in set (0.01 sec) Lecture 3.1 36 INSERT some more ontologies INSERT INTO Ontology (term, description) VALUES -> ('exon', 'an exon in genomic sequence'); mysql> Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO Ontology (term, description) VALUES -> ('exon type', '3\'UTR, initial, internal, terminal, 5\'UTR'); Query OK, 1 row affected (0.00 sec) mysql> SELECT * FROM Ontology; +-------------+-------------+---------------------------------------------+ | ontology_id | term | description | +-------------+-------------+---------------------------------------------+ | 3 | start codon | denotes an Methionine codon of a transcript | | 4 | exon | an exon in genomic sequence | | 5 | exon type | 3'UTR, initial, internal, terminal, 5'UTR | +-------------+-------------+---------------------------------------------+ 3 rows in set (0.00 sec) Lecture 3.1 37 INSERT a sequence mysql> DESC Sequence; +-------------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +-------------+--------------+------+-----+---------+----------------+ | sequence_id | int(11) | | PRI | NULL | auto_increment | | sequence | longtext | | | | | | defline | text | YES | | NULL | | | accession | varchar(255) | | | | | | version | int(11) | YES | | 0 | | | length | int(11) | YES | | 0 | | | moltype | int(11) | | | 0 | | +-------------+--------------+------+-----+---------+----------------+ 7 rows in set (0.00 sec) mysql> INSERT INTO Sequence (sequence, defline, accession, version, length, moltype) -> VALUES ('ATGACGATCAGCATCAGCTACAGCTG', '> seq1', 'seq1', 1, 26, 1); Query OK, 1 row affected (0.00 sec) Lecture 3.1 38 INSERT a Feature on a sequence mysql> SELECT * FROM Sequence; +-------------+----------------------------+---------+-----------+---------+--------+---------+ | sequence_id | sequence | defline | accession | version | length | moltype | +-------------+----------------------------+---------+-----------+---------+--------+---------+ | 2 | ATGACGATCAGCATCAGCTACAGCTG | > seq1 | seq1 | 1 | 26 | 1 | +-------------+----------------------------+---------+-----------+---------+--------+---------+ 1 row in set (0.03 sec) mysql> SELECT * FROM Ontology; +-------------+-------------+---------------------------------------------+ | ontology_id | term | description | +-------------+-------------+---------------------------------------------+ | 3 | start codon | denotes an Methionine codon of a transcript | | 4 | exon | an exon in genomic sequence | | 5 | exon type | 3'UTR, initial, internal, terminal, 5'UTR | +-------------+-------------+---------------------------------------------+ 3 rows in set (0.00 sec) INSERT INTO Feature (sequence_id, ontology_id) -> VALUES (2, 3); mysql> Query OK, 1 row affected (0.00 sec) Lecture 3.1 39 INSERT a Location mysql> SELECT * From Feature; +------------+-------------+-------------+ | feature_id | sequence_id | ontology_id | +------------+-------------+-------------+ | 1 | 2 | 3 | +------------+-------------+-------------+ 1 row in set (0.01 sec) mysql> DESC Location; +-------------+---------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +-------------+---------+------+-----+---------+----------------+ | location_id | int(11) | | PRI | NULL | auto_increment | | feature_id | int(11) | | | 0 | | | start | int(11) | | | 0 | | | stop | int(11) | | | 0 | | | strand | int(11) | | | 0 | | +-------------+---------+------+-----+---------+----------------+ 5 rows in set (0.00 sec) INSERT INTO Location (feature_id, start, stop, strand) -> VALUES(1,1,3,1); mysql> Query OK, 1 row affected (0.02 sec) Lecture 3.1 40 Queries using SELECT mysql> SELECT * FROM Sequence; +-------------+----------------------------+---------+-----------+---------+--------+---------+ | sequence_id | sequence | defline | accession | version | length | moltype | +-------------+----------------------------+---------+-----------+---------+--------+---------+ | 2 | ATGACGATCAGCATCAGCTACAGCTG | > seq1 | seq1 | 1 | 26 | 1 | | 3 | SLKLSKLPSPLYQVCLE | > seq2 | L32174 | 1 | 17 | 3 | +-------------+----------------------------+---------+-----------+---------+--------+---------+ 2 rows in set (0.00 sec) mysql> SELECT sequence FROM +----------------------------+ | sequence | +----------------------------+ | ATGACGATCAGCATCAGCTACAGCTG | +----------------------------+ 1 row in set (0.12 sec) mysql> SELECT length +--------+ | length | +--------+ | 17 | +--------+ 1 row in set (0.03 sec) Lecture 3.1 Sequence WHERE accession = 'seq1'; FROM Sequence WHERE sequence_id = 3; 41 Joining tables mysql> SELECT * FROM Feature; +------------+-------------+-------------+ | feature_id | sequence_id | ontology_id | +------------+-------------+-------------+ | 1 | 2 | 3 | | 2 | 2 | 4 | | 3 | 2 | 4 | +------------+-------------+-------------+ 3 rows in set (0.04 sec) Return me the descriptions of the features in the Feature table mysql> SELECT feature_id, description -> FROM Feature, Ontology -> WHERE Feature.ontology_id = Ontology.ontology_id; +------------+---------------------------------------------+ | feature_id | description | +------------+---------------------------------------------+ | 1 | denotes an Methionine codon of a transcript | | 2 | an exon in genomic sequence | | 3 | an exon in genomic sequence | +------------+---------------------------------------------+ 3 rows in set (0.04 sec) Lecture 3.1 42 Setting up a complex query • Consider sequence ‘seq1’ with the following features: – Initial exon from 1..6 – Internal exon from 15..20 • Note that with relational model the term ‘exon’ only appears once in the database Lecture 3.1 43 mysql> SELECT * FROM Sequence WHERE sequence_id = 2; +-------------+----------------------------+---------+-----------+---------+--------+---------+ | sequence_id | sequence | defline | accession | version | length | moltype | +-------------+----------------------------+---------+-----------+---------+--------+---------+ | 2 | ATGACGATCAGCATCAGCTACAGCTG | > seq1 | seq1 | 1 | 26 | 1 | +-------------+----------------------------+---------+-----------+---------+--------+---------+ 1 row in set (0.04 sec) Complex query mysql> SELECT * FROM Feature WHERE sequence_id = 2; +------------+-------------+-------------+ | feature_id | sequence_id | ontology_id | +------------+-------------+-------------+ | 1 | 2 | 3 | | 2 | 2 | 4 | | 3 | 2 | 4 | +------------+-------------+-------------+ 3 rows in set (0.03 sec) mysql> SELECT * FROM Location; +-------------+------------+-------+------+--------+ | location_id | feature_id | start | stop | strand | +-------------+------------+-------+------+--------+ | 1 | 1 | 1 | 3 | 1 | | 2 | 2 | 1 | 6 | 1 | | 3 | 3 | 15 | 20 | 1 | +-------------+------------+-------+------+--------+ 3 rows in set (0.20 sec) The relational model stores data efficiently and optimises the modifiablility of the data. What if ‘exon’ changes to something else? mysql> SELECT * FROM Ontology; +-------------+-------------+---------------------------------------------+ | ontology_id | term | description | +-------------+-------------+---------------------------------------------+ | 3 | start codon | denotes an Methionine codon of a transcript | | 4 | exon | an exon in genomic sequence | | 5 | exon type | 3'UTR, initial, internal, terminal, 5'UTR | +-------------+-------------+---------------------------------------------+ 3 rows in set (0.00 sec) mysql> SELECT * FROM Qualifier; +--------------+------------+-------------+----------+ | qualifier_id | feature_id | ontology_id | value | +--------------+------------+-------------+----------+ Lecture 3.1 | 1 | 2 | 5 | initial | | 2 | 3 | 5 | internal | 44 Complex query Return me the sub-sequences and coordinates of the ‘exon’ features of ‘seq1’ mysql> SELECT SUBSTRING(sequence, start, stop), start, stop, term -> FROM Sequence, Ontology, Feature, Location -> WHERE accession = 'seq1' AND term = 'exon' AND -> Feature.ontology_id = Ontology.ontology_id AND -> Feature.sequence_id = Sequence.sequence_id AND -> Feature.feature_id = Location.feature_id; +----------------------------------+-------+------+------+ | SUBSTRING(sequence, start, stop) | start | stop | term | +----------------------------------+-------+------+------+ | ATGACG | 1 | 6 | exon | | CAGCTACAGCTG | 15 | 20 | exon | +----------------------------------+-------+------+------+ 2 rows in set (0.04 sec) Lecture 3.1 45 Even more complex… Return me the sub-sequences, coordinates feature name and qualifier value of the ‘internal exon’ features of ‘seq1’ mysql> SELECT SUBSTRING(sequence, start, stop), start, stop, o1.term, value -> FROM Sequence, Feature, Ontology o1, Ontology o2, Location, Qualifier -> WHERE accession = 'seq1' AND -> o1.term = 'exon' AND -> o2.term = 'exon type' AND -> value = 'internal' AND -> Feature.ontology_id = o1.ontology_id AND -> Qualifier.ontology_id = o2.ontology_id AND -> Qualifier.feature_id = Feature.feature_id AND -> Feature.sequence_id = Sequence.sequence_id AND -> Location.feature_id = Feature.feature_id; +----------------------------------+-------+------+------+----------+ | SUBSTRING(sequence, start, stop) | start | stop | term | value | +----------------------------------+-------+------+------+----------+ | CAGCTACAGCTG | 15 | 20 | exon | internal | +----------------------------------+-------+------+------+----------+ 1 row in set (0.05 sec) Lecture 3.1 46 Aggregate queries mysql> SELECT * FROM Sequence; +-------------+----------------------------+---------+-----------+---------+--------+---------+ | sequence_id | sequence | defline | accession | version | length | moltype | +-------------+----------------------------+---------+-----------+---------+--------+---------+ | 2 | ATGACGATCAGCATCAGCTACAGCTG | > seq1 | seq1 | 1 | 26 | 1 | | 3 | SLKLSKLPSPLYQVCLE | > seq2 | L32174 | 1 | 17 | 3 | | 4 | MASQQQCGAR | > seq | seq3 | 1 | 10 | 3 | +-------------+----------------------------+---------+-----------+---------+--------+---------+ mysql> SELECT count(*), +----------+---------+ | count(*) | moltype | +----------+---------+ | 1 | 1 | | 2 | 3 | +----------+---------+ 2 rows in set (0.08 sec) Lecture 3.1 moltype from Sequence GROUP BY moltype; 47 Using LIMIT SELECT * FROM Sequence LIMIT 2; mysql> +-------------+----------------------------+---------+-----------+---------+--------+---------+ | sequence_id | sequence | defline | accession | version | length | moltype | +-------------+----------------------------+---------+-----------+---------+--------+---------+ | 2 | ATGACGATCAGCATCAGCTACAGCTG | > seq1 | seq1 | 1 | 26 | 1 | | 3 | SLKLSKLPSPLYQVCLE | > seq2 | L32174 | 1 | 17 | 3 | +-------------+----------------------------+---------+-----------+---------+--------+---------+ 2 rows in set (0.08 sec) Lecture 3.1 48 UPDATING a table mysql> SELECT * FROM Qualifier; +--------------+------------+-------------+----------+ | qualifier_id | feature_id | ontology_id | value | +--------------+------------+-------------+----------+ | 1 | 2 | 5 | initial | | 2 | 3 | 5 | internal | +--------------+------------+-------------+----------+ 2 rows in set (0.00 sec) UPDATE Qualifier SET value = 'terminal' -> WHERE qualifier_id = 2; mysql> Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 Lecture 3.1 49 DELETING from a table DELETE FROM Qualifier -> WHERE qualifier_id = 2; mysql> Query OK, 1 row affected (0.04 sec) mysql> SELECT * FROM Qualifier; +--------------+------------+-------------+---------+ | qualifier_id | feature_id | ontology_id | value | +--------------+------------+-------------+---------+ | 1 | 2 | 5 | initial | +--------------+------------+-------------+---------+ 1 row in set (0.03 sec) Lecture 3.1 50 Optimisation • Perking up MySQL – Queries – Database server Lecture 3.1 52 Indexing • In general, indexing your data makes retrieval orders of magnitude faster • Consider a list of 1000000 sequences with accession numbers • You need to find the one sequence with accession number ‘AC123456’ • Response time requires O(1000000) operations if the accession field is not indexed – Equivalent to scanning through a list • Response time requires O(log(1000000)) = O(6) operations if the accession field is indexed – Somewhat like a hashtable lookup Lecture 3.1 53 Types of indexes • PRIMARY KEY – To identify the main accessor field of the table • UNIQUE – Constraint to ensure that all entries in a field are different • INDEX – Creates a way to quickly search on a given field • FULLTEXT – For large TEXT fields > 255 characters • Compound indexes – (column1, column2, …) • NOTE – index is synonymous with KEY Lecture 3.1 54 Drawbacks to indexing • Need more disk space • Can slow down inserts • Know your data and the queries you will perform on the data – Only index fields you think you will query on – Requires spending time in the design phase to define ‘requirements’ of the database Lecture 3.1 55 Creating an index mysql> CREATE INDEX acindex ON Sequence (accession); Query OK, 1 row affected (0.18 sec) Records: 1 Duplicates: 0 Warnings: 0 Lecture 3.1 56 Tuning the database • > mysqladmin variables • > mysqld --help DBA Lecture 3.1 57 Variables (--variable-name=value) and boolean options {FALSE|TRUE} Value (after reading options) --------------------------------- ----------------------------basedir /raid/db/mysql/mysql-max4.0.14-pc-linux-i686/ bdb-home (No default value) bdb-logdir (No default value) bdb-tmpdir (No default value) bind-address (No default value) console FALSE chroot (No default value) character-sets-dir /raid/db/mysql/mysql-max4.0.14-pc-linux-i686/share/mysql/charsets/ datadir /raid/db/mysql/mysql-max4.0.14-pc-linux-i686/data/ default-character-set latin1 enable-locking FALSE enable-pstack FALSE gdb FALSE innodb_data_home_dir (No default value) innodb_log_group_home_dir (No default value) innodb_log_arch_dir (No default value) innodb_flush_log_at_trx_commit 1 innodb_flush_method (No default value) innodb_fast_shutdown TRUE innodb_max_dirty_pages_pct 90 init-file (No default value) log (No default value) language /raid/db/mysql/mysql-max4.0.14-pc-linux-i686/share/mysql/english/ local-infile TRUE log-bin (No default value) log-bin-index (No default value) log-isam myisam.log log-update (No default value) log-slow-queries (No default value) log-slave-updates FALSE low-priority-updates FALSE master-host (No default value) master-user test master-port 3306 Lecture 3.1 58 master-connect-retry 60 master-retry-count 86400 master-info-file master.info master-ssl FALSE master-ssl-key (No default value) master-ssl-cert (No default value) master-ssl-capath (No default value) master-ssl-cipher (No default value) myisam-recover OFF memlock FALSE disconnect-slave-event-count 0 abort-slave-event-count 0 max-binlog-dump-events 0 sporadic-binlog-dump-fail FALSE new FALSE old-protocol 10 old-rpl-compat FALSE pid-file /raid/db/mysql/mysql-max4.0.14-pc-linux-i686/data/watson.pid log-error port 3306 report-host (No default value) report-user (No default value) report-password (No default value) report-port 3306 rpl-recovery-rank 0 relay-log (No default value) relay-log-index (No default value) safe-user-create FALSE server-id 1 show-slave-auth-info FALSE concurrent-insert TRUE skip-grant-tables FALSE skip-slave-start FALSE relay-log-info-file relay-log.info slave-load-tmpdir /raid/tmp/ socket /tmp/mysql.sock sql-bin-update-same FALSE sql-mode OFF temp-pool TRUE tmpdir /raid/tmp Lecture 3.1 59 external-locking use-symbolic-links symbolic-links log-warnings warnings back_log bdb_cache_size bdb_log_buffer_size bdb_max_lock bdb_lock_max binlog_cache_size connect_timeout delayed_insert_timeout delayed_insert_limit delayed_queue_size flush_time ft_min_word_len ft_max_word_len ft_max_word_len_for_sort ft_stopword_file innodb_mirrored_log_groups innodb_log_files_in_group innodb_log_file_size innodb_log_buffer_size innodb_buffer_pool_size innodb_additional_mem_pool_size innodb_file_io_threads innodb_lock_wait_timeout innodb_thread_concurrency innodb_force_recovery interactive_timeout join_buffer_size key_buffer_size long_query_time lower_case_table_names max_allowed_packet max_binlog_cache_size max_binlog_size max_connections max_connect_errors max_delayed_threads max_heap_table_size Lecture 3.1 FALSE TRUE TRUE FALSE FALSE 50 8388600 0 10000 10000 32768 5 300 100 1000 0 4 254 20 (No default value) 1 2 5242880 1048576 8388608 1048576 4 50 8 0 28800 131072 402653184 10 FALSE 1047552 4294967295 1073741824 100 10 20 16777216 60 max_join_size max_relay_log_size max_seeks_for_key max_sort_length max_tmp_tables max_user_connections max_write_lock_count bulk_insert_buffer_size myisam_block_size myisam_max_extra_sort_file_size myisam_max_sort_file_size myisam_repair_threads myisam_sort_buffer_size net_buffer_length net_retry_count net_read_timeout net_write_timeout open_files_limit query_cache_limit query_cache_size query_cache_type read_buffer_size read_rnd_buffer_size record_buffer relay_log_space_limit slave_compressed_protocol slave_net_timeout read-only slow_launch_time sort_buffer_size table_cache thread_concurrency thread_cache_size tmp_table_size thread_stack wait_timeout default-week-format 18446744073709551615 0 4294967295 1024 32 0 4294967295 8388608 1024 268435456 2147483647 1 67108864 16384 10 30 60 0 1048576 33554432 1 2093056 262144 2093056 0 FALSE 3600 FALSE 2 2097144 512 8 8 33554432 196608 28800 0 To see what values a running MySQL server is using, type 'mysqladmin variables' instead of 'mysqld --help'. Lecture 3.1 61 Tuning the system to your needs • Need to think about uses of the database – – – – How many concurrent connections? Will there be large records? Will there be repetitive queries? Will I need large indexes? • Tuning the system can give huge gains in performance – lets you get the most out of the system Lecture 3.1 62 Important parameters • max_allowed_packet – Largest amount of data to be transmitted to the client in 1 packet • max_connections – The largest number of concurrent connections to the database server • datadir – The location of the data files on the system • query_cache – Size of cache for repetitive queries • Many, many others….. Lecture 3.1 63 Lecture 3.1 64 COMMUNICATING WITH MySQL Lecture 3.1 65 Communicating with MySQL • • • • Through a GUI – MySQL ControlCentre • http://www.mysql.com/products/mysqlcc/ • Standalone application supported by MySQL Through the web – PhpMyAdmin • http://www.phpmyadmin.net/home_page/ • Works with Apache web server Through the Unix command line – MySQL client – Comes with MySQL Through APIs (Application Programming Interface) – MySQL C API – Perl DBI – MySQL++ (C++) • http://dev.mysql.com/downloads/other/plusplus/ – JDBC (Java Database Connectivity) • Java protocol and API for RDBMS communication Lecture 3.1 66 Communicating with MySQL • Choose the method that is ‘right’ for the job • Administration – MySQL CC – PHP MyAdmin • Standalone Application – APIs • Web Application – PHP/Java servlets • ‘Low – throughput’ queries – Command line client Lecture 3.1 67 Working with JDBC • JDBC is a standard API that provides databaseindependent connectivity to allow a Java application to interact with a database http://java.sun.com/products/jdbc/overview.html Lecture 3.1 68 Connector/J • JDBC implementation for MySQL is Connector/J – http://www.mysql.com/products/connector/j/ • Installation: – $ export CLASSPATH=/path/to/mysql-connector-java-[version]-bin.jar:$CLASSPATH Lecture 3.1 69 Connector/J Steps 1. 2. 3. 4. 5. Establish a connection Prepare one or more queries Execute one or more queries Process the results (if applicable) Destroy connection Lecture 3.1 70 Connector/J – Establishing a connection // we need the following 6 variables to make a jdbc connection String DBSERVERNAME = “mysql”; String JDBCDRIVERNAME = “com.mysql.jdbc.Driver”; String host = “my.database.com”; String databaseName = “sequence”; String user = “me”; String password = “mypwd”; //load the driver into memory Class.forName(JDBCDRIVERNAME).newInstance(); // create the connection URL String connectionURL = "jdbc:” + DBSERVERNAME + “://" + host+ "/" + database + "?" + "user=“ + user + "&password=" + password; // get the connection from the driver manager Connection connection = DriverManager.getConnection(connectionURL); Lecture 3.1 71 Preparing and executing a query // object required to execute the query Statement statement = null; // object to store results of the query ResultSet resultSet = null; // create the query string String query = "SELECT sequence_id FROM Sequence"; // initialise the statement statement = connection.createStatement(); // execute the query – the results are returned in resultSet resultSet = statement.executeQuery(query); Lecture 3.1 72 Process the results & close connection // iterate through the ‘rows’ returned by the query while (resultSet.next()) { int sequenceId = resultSet.getInt("sample_id"); // do something with the sequenceId } // destroy the connection if its no longer needed connection.close(); Lecture 3.1 73 Topics not covered… • MySQL tools – mysqldump • Tool to dump a schema, all the data and/or both – mysqlimport • Tool to import delimited files • Look before you parse! – mysqladmin • For DBAs to create database, change passwords, etc… – Read the mysql documentation Lecture 3.1 74 Topics not covered… • Setting connection parameters in JDBC – Consult Connector/J docs • Database design – Extremely important process – Many courses at univ/college Lecture 3.1 75 Summary • Relational databases are necessary in bioinformatics • Relational databases allow us to efficiently store and query large amounts of data • MySQL is a good choice for RDBMS engine because it is highly functional at no cost • JDBC provides a way to access MySQL from within a Java program Lecture 3.1 76 Resources • MySQL – – – – http://www.mysql.com http://dev.mysql.com.mysql/en/index.html http://www.mysql.com/products/mysqlcc/ http://dev.mysql.com/doc/connector/j/en • NAR Database Issue 2004 – http://nar.oupjournals.org/content/vol32/suppl_1 • JDBC – http://java.sun.com/products/jdbc/ • Me – [email protected] Lecture 3.1 77
© Copyright 2025 Paperzz