PPT - Bioinformatics.ca

Lecture 3.1 MySQL and JDBC
“Lets talk databases”
Sohrab Shah
UBC Bioinformatics Centre
[email protected]
http://bioinformatics.ubc.ca/people/sohrab
Lecture 3.1
1
Objectives
1. Learn the basics of relational databases
2. Learn how to use MySQL
3. Learn how to use the Structured Query
Language (SQL)
4. Learn to communicate with MySQL through
the Java Database Connectivity (JDBC)
protocol
Lecture 3.1
2
Outline
•
•
•
•
•
Why are databases important in bioinformatics?
Brief background in databases
Introduction to the Structured Query Language
A worked example – a Sequence database in MySQL
The JDBC protocol – using Connector/J for MySQL
Lecture 3.1
3
What is a database?
• Collection of information
– Spreadsheet
– Filing cabinet
– Oracle database
• Biology is abound with
collections of data
– ‘Tsunami’, ‘deluge’, ‘avalanche’,
‘flood’
• Databases help us
efficiently organise,
integrate and query data
in order to make scientific
inferences
Lecture 3.1
http://bioteach.ubc.ca
5
Databases and bioinformatics
Nucleotide records
Protein sequences
3D structures
Interactions & complexes
Human Unigene Cluster
Maps and Complete Genomes
Different taxonomy Nodes
Human dbSNP
Human RefSeq records
bp in Human Contigs > 5,000 kb (116)
PubMed records
OMIM records
Lecture 3.1
36,653,899
4,436,362
19,640
52,385
118,517
6,948
283,121
13,179,601
22,079
2,487,920,000
12,570,540
15,138
6
Molecular biology needs databases!
• High volume + complex data structures =
HELP!
Lecture 3.1
7
NAR Database Issue - 2004
142 articles
Lecture 3.1
8
RELATIONAL DATABASES
Lecture 3.1
9
Relational Databases
• A brief history
– Developed by E.F. Codd (IBM) 1969-70
• Died 2003
– Awarded the Turing prize for his work
– Developed 12 rules to define a RD that call for a
language to define, manipulate and query the
data in the database
– 1 rule led to the Structured Query Language
(SQL) that is used in every RDMBS system on
the market
• ANSI standard (92,99)
Lecture 3.1
10
SQL
Lecture 3.1
11
Relational Model
• All data stored in tables
– Table is a ‘relation’ made up of columns (fields) and
rows (records)
– Intersection of a column and a row is a typed ‘value’
• Integer, Real, Varchar, Text, Blob, etc…
– Operations on tables produce tables
Lecture 3.1
12
Advantages of the relational model
• Data independence
– Shielding the data from the application
• Efficiency
– Storage, retrieval, integration
• Data integrity/security
– Constraints, access controls
Lecture 3.1
13
ACID test
• Atomicity
– ‘all or nothing’ transaction
– If one operation fails, all fail
• Consistency
– data integrity
– constraints
• Isolation
– Every transaction has a consistent view of the database
regardless of what other transactions are being processes
• Durability
– Once a transaction is complete, the newly updated data will
survive failures of any kind
– logs
Lecture 3.1
14
Research fuelled by corporate
databases gives us great technology
for biological science
• 30+ years of research into robust systems
• Industry standards for databases
• Vendors committed to high-quality products
– Oracle, DB2, Sybase, MS SQLserver, etc…
• Emergence of the internet and database driven webcontent set the stage for bioinformatics
• Data mining tools for creating statistical associations
– Diapers and beer?
– Teradata, a division of NCR Corporation
Lecture 3.1
15
SQL
Lecture 3.1
16
What drives a database?
SQL
Lecture 3.1
17
SQL
• Structured Query Language (ANSI 92,99)
– Used in virtually every RDBMS product
• Has operations for:
–
–
–
–
–
–
–
–
Creating tables
Modifying tables
Relating tables
Inserting data
Updating data
Retrieving sets of data
Deleting sets of data
Deleting tables
Lecture 3.1
18
SQL
• Not all implementations consistent
• WARNING:
– MySQL CREATE TABLE statements != PostgreSQL CREATE TABLE statements
Lecture 3.1
19
Commercial RDBMS
• Oracle
– According to Forbes, Larry Ellison is the 9th
richest person in the US ($18 billion)
• DB2
– IBM’s solution – free for academics
• Microsoft SQL server
– For Windows
Lecture 3.1
20
Open Source RDBMS
• PostgreSQL
– http://www.postgresql.org/
– ‘the worlds most advanced Open
Source database software’
– Began in 1986 at UC Berkeley
– For many years considered the
most ‘sophisticated’ OS RDBMS
– Performance?
– Comes with most Linux distros
– Small but loyal user community
Lecture 3.1
21
MySQL
• http://www.mysql.com/
• ‘The world's most popular open
source database ‘
– > 5,000,000 active installations
• Easy to use
• Very fast retrieval due to
architecture
– Considered by many to be a ‘toy’
database
– For years – no row-level locking
– Did not handle transactions well
Lecture 3.1
22
MySQL
• Free
– As in ‘free beer’
• Dual license
• Commercial: http://www.mysql.com/products/licensing/commercial-license.html
• OpenSource: http://www.mysql.com/products/licensing/opensource-license.html
– As in ‘free speech’
• Fast
– Extremely fast reads for certain table types
– Outperforms any RDMBS for reads
• Functional
–
–
–
–
Ease of use
APIs in Perl, C, C++, Java
Client/server architecture
Works well with Apache/PHP for very popular OS dynamic web solution
Lecture 3.1
23
MySQL versions
• 3.23.* (http://dev.mysql.com/doc/mysql/en/News-3.23.x.html)
– Introduces row-level locking
– Introduces full-text indexing
• 4.0.* (http://dev.mysql.com/doc/mysql/en/News-4.0.x.html)
– Transactions, foreign keys with InnoDB
– Improved Full-text indexing
• 4.1.* (http://dev.mysql.com/doc/mysql/en/News-4.1.x.html)
– Subqueries
• 5.0.* (http://dev.mysql.com/doc/mysql/en/News-5.0.x.html)
– Stored procedures
Lecture 3.1
24
MySQL – examples in bioinformatics
• Free, fast and functional have made MySQL
pervasive in bioinformatics:
• Ensembl (http://www.ensembl.org)
– Automated eukaryotic annotation database
• Gene Ontology (http://www.geneontology.org)
– Controlled vocabulary for genes and functions
• UCSC Genome Browser (http://genome.ucsc.edu)
– Human and other genome browser
• BASE (http://base.thep.lu.se)
– BioArray Software Environment – a web-based
database solution for microarrays
Lecture 3.1
25
Worked example – a relational model
for sequences and features
• Create a relational model
• Tables to store:
– data
• Sequence strings
– Meta-data
• Data about the data – features and their locations
• Insert some records
• Query the data to pull out useful subsets
Lecture 3.1
26
Creating a Relational Database
• Start with a data set
• Divide data set into ‘records’
– The data
• Divide records into useful fields that describe the
particular record
– The meta-data
•
•
•
•
Create a model based on the useful fields
Create a database from the model
Insert the data into the database
The data is now ‘computable’
Lecture 3.1
27
LOCUS
DEFINITION
YSCITRSA2
2075 bp
DNA
linear
PLN 26-APR-2004
Saccharomyces cerevisiae isoleucyl tRNA synthetase (LAF1) gene,
partial cds; and unknown gene.
ACCESSION
L32174
VERSION
L32174.1 GI:46561769
KEYWORDS
.
SOURCE
Saccharomyces cerevisiae (baker's yeast)
ORGANISM Saccharomyces cerevisiae
Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
Saccharomycetales; Saccharomycetaceae; Saccharomyces.
REFERENCE
1 (bases 1 to 2075)
AUTHORS
Chen,E. and Bretscher,A.P.
TITLE
The LAF1 open reading frame encodes a second isoleucyl tRNA
synthetase in the yeast Saccharomyces cerevisiae
JOURNAL
Unpublished
FEATURES
Location/Qualifiers
source
1..2075
/organism="Saccharomyces cerevisiae"
/mol_type="genomic DNA"
/db_xref="taxon:4932"
gene
<1..1204
/gene="LAF1"
CDS
<1..1204
/gene="LAF1"
/note="disruption results in an abnormal actin
cytoskeleton; putative"
/codon_start=2
/product="isoleucyl tRNA synthetase"
/protein_id="AAT01099.1"
/db_xref="GI:46561770"
/translation="SLKLSKLPSPLYQVCLEGSDQHRGWFQSSLLTKVASSNVPVAPY
EEVITHGFTLDENGLKMSKSVGNTISPEAIIRGDENLGLPALGVVGLRYLIAHSNFTT
DIVAGPTVMKHVGEALKKVRTNFRYLLSNLQKSQDFNLLPIEQLRRVDQYTLYKINEL
LETTREHYQKYNFSKVLITLQYHLNNELSAFYFDISKDILYSNQISWSWQEGRSNNAC
PYTNAYRAILAPILPVMVQEVWKYIPEGWLQGQEHIDINPMRGKWPFLDSNTEIVTSF
ENFELKILKQFQEEFKRLSLEEGVTKTTHSHVTIFTKHHLPFSSDELCDILQSSAVDI
LQMDSNNNSHPTIELGRGINVQILVNVQILVERSKRHNCPRCWKANSAEEDKLCDRCK
EAVDHLMS"
CDS
1452..2075
/note="putative"
/codon_start=1
/product="unknown"
/protein_id="AAT01100.1"
/db_xref="GI:46561771"
/translation="MTVMNLFFRPCQLQMGSGPLELMLKRPTQLTTFMNTRPGGSTQI
RFISGNLDPVKRREDRLRKIFSKSRLLTRLNKNPKFSHYFDRLSEAGTVPTLTSFFIL
HEVTANTTTVLLWWLLYNLDLSDDFKLPNFLNGLMDSCHTAMEKFVGKRYQECLNKNK
LILSGTVAYVTVKLLYPVRIFISIWGAPYFGKWLLLPFQKLKHLIKK"
ORIGIN
1 aagcttaaag ttgtcaaaac tcccatcccc cctgtaccaa gtttgtctag aaggatctga
61 tcaacataga ggatggtttc aaagttcact gctaacaaaa gtagcatcaa gtaatgtccc
121 tgttgcacca tatgaagaag tgattactca tggttttacc ctagatgaga atggtctgaa
181 aatgtcaaaa tctgtgggaa atacaatttc tcccgaagca ataattcgag gcgatgaaaa
241 cttaggctta ccagctttgg gtgttgtagg cttgaggtat ctgatagcac attcgaattt
301 cacaactgat atagttgctg gcccgactgt gatgaaacat gtaggagaag ctctaaaaaa
361 ggttaggact aactttcgct atttattgag taatttacag aagtcccaag atttcaacct
421 tttgccgatt gaacaattac gccgtgttga tcaatatacc ttgtataaga taaacgaact
481 gctggaaacg acgagagaac actaccaaaa gtacaacttt tccaaggttc tcattactct
541 acaatatcat ttaaataacg agctatcggc gttttatttt gatatctcaa aggatatttt
601 atattccaac caaatatctt ggtcatggca agaaggcagg tcaaacaacg cttgtccata
661 tactaatgca tatagggcaa ttcttgcacc aatattaccc gttatggtcc aagaagtatg
721 gaagtatata ccagaaggat ggttacaagg acaagaacat atagacatta atccgatgcg
781 tggaaaatgg ccgtttttgg actcaaatac ggaaatcgtc acctcctttg aaaactttg
2075 bp
L32174
26-APR-2004
ACCESSIONGenbank sequence record
Example
gene
Data
Lecture 3.1
<1..1204
/gene="LAF1"
28
Simple example: a relational model for
biological sequences and features
Lecture 3.1
29
Remove the last page from
your binder
Lecture 3.1
30
CREATE Sequence
CREATE TABLE Sequence (
sequence_id INT NOT NULL AUTO_INCREMENT,
sequence LONGTEXT NOT NULL,
defline TEXT,
accession VARCHAR(255) NOT NULL,
version INT DEFAULT 0,
length INT DEFAULT 0,
moltype INT NOT NULL,
PRIMARY KEY(sequence_id)
);
Lecture 3.1
31
CREATE Ontology
CREATE TABLE Ontology (
ontology_id INT NOT NULL AUTO_INCREMENT,
term VARCHAR(255) NOT NULL,
description TEXT NOT NULL,
PRIMARY KEY (ontology_id)
);
Lecture 3.1
32
CREATE Feature
CREATE TABLE Feature (
feature_id INT NOT NULL AUTO_INCREMENT,
sequence_id INT NOT NULL,
ontology_id INT NOT NULL,
FOREIGN KEY (sequence_id) REFERENCES Sequence,
FOREIGN KEY (ontology_id) REFERENCES Ontology,
PRIMARY KEY(feature_id)
);
Lecture 3.1
33
CREATE Location
CREATE TABLE Location (
location_id INT NOT NULL AUTO_INCREMENT,
feature_id INT NOT NULL,
start INT NOT NULL,
stop INT NOT NULL,
strand INT NOT NULL,
FOREIGN KEY (feature_id) REFERENCES Feature,
PRIMARY KEY(location_id)
);
Lecture 3.1
34
CREATE Qualifier
CREATE TABLE Qualifier (
qualifier_id INT NOT NULL AUTO_INCREMENT,
feature_id INT NOT NULL,
ontology_id INT NOT NULL,
value TEXT NOT NULL,
FOREIGN KEY (feature_id) REFERENCES Feature,
FOREIGN KEY (ontology_id) REFERENCES Ontology,
PRIMARY KEY (qualifier_id)
);
Lecture 3.1
35
INSERT an ontology
mysql> INSERT INTO Ontology (term, description) VALUES
-> ('start codon', 'denotes an Methionine codon of a transcript');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM Ontology;
+-------------+-------------+---------------------------------------------+
| ontology_id | term
| description
|
+-------------+-------------+---------------------------------------------+
|
3 | start codon | denotes an Methionine codon of a transcript |
+-------------+-------------+---------------------------------------------+
1 row in set (0.01 sec)
Lecture 3.1
36
INSERT some more ontologies
INSERT INTO Ontology (term, description) VALUES
-> ('exon', 'an exon in genomic sequence');
mysql>
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO Ontology (term, description) VALUES
-> ('exon type', '3\'UTR, initial, internal, terminal, 5\'UTR');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM Ontology;
+-------------+-------------+---------------------------------------------+
| ontology_id | term
| description
|
+-------------+-------------+---------------------------------------------+
|
3 | start codon | denotes an Methionine codon of a transcript |
|
4 | exon
| an exon in genomic sequence
|
|
5 | exon type
| 3'UTR, initial, internal, terminal, 5'UTR
|
+-------------+-------------+---------------------------------------------+
3 rows in set (0.00 sec)
Lecture 3.1
37
INSERT a sequence
mysql> DESC Sequence;
+-------------+--------------+------+-----+---------+----------------+
| Field
| Type
| Null | Key | Default | Extra
|
+-------------+--------------+------+-----+---------+----------------+
| sequence_id | int(11)
|
| PRI | NULL
| auto_increment |
| sequence
| longtext
|
|
|
|
|
| defline
| text
| YES |
| NULL
|
|
| accession
| varchar(255) |
|
|
|
|
| version
| int(11)
| YES |
| 0
|
|
| length
| int(11)
| YES |
| 0
|
|
| moltype
| int(11)
|
|
| 0
|
|
+-------------+--------------+------+-----+---------+----------------+
7 rows in set (0.00 sec)
mysql> INSERT INTO Sequence (sequence, defline, accession, version, length, moltype)
-> VALUES ('ATGACGATCAGCATCAGCTACAGCTG', '> seq1', 'seq1', 1, 26, 1);
Query OK, 1 row affected (0.00 sec)
Lecture 3.1
38
INSERT a Feature on a sequence
mysql> SELECT * FROM Sequence;
+-------------+----------------------------+---------+-----------+---------+--------+---------+
| sequence_id | sequence
| defline | accession | version | length | moltype |
+-------------+----------------------------+---------+-----------+---------+--------+---------+
|
2 | ATGACGATCAGCATCAGCTACAGCTG | > seq1 | seq1
|
1 |
26 |
1 |
+-------------+----------------------------+---------+-----------+---------+--------+---------+
1 row in set (0.03 sec)
mysql> SELECT * FROM Ontology;
+-------------+-------------+---------------------------------------------+
| ontology_id | term
| description
|
+-------------+-------------+---------------------------------------------+
|
3 | start codon | denotes an Methionine codon of a transcript |
|
4 | exon
| an exon in genomic sequence
|
|
5 | exon type
| 3'UTR, initial, internal, terminal, 5'UTR
|
+-------------+-------------+---------------------------------------------+
3 rows in set (0.00 sec)
INSERT INTO Feature (sequence_id, ontology_id)
-> VALUES (2, 3);
mysql>
Query OK, 1 row affected (0.00 sec)
Lecture 3.1
39
INSERT a Location
mysql> SELECT * From Feature;
+------------+-------------+-------------+
| feature_id | sequence_id | ontology_id |
+------------+-------------+-------------+
|
1 |
2 |
3 |
+------------+-------------+-------------+
1 row in set (0.01 sec)
mysql> DESC Location;
+-------------+---------+------+-----+---------+----------------+
| Field
| Type
| Null | Key | Default | Extra
|
+-------------+---------+------+-----+---------+----------------+
| location_id | int(11) |
| PRI | NULL
| auto_increment |
| feature_id | int(11) |
|
| 0
|
|
| start
| int(11) |
|
| 0
|
|
| stop
| int(11) |
|
| 0
|
|
| strand
| int(11) |
|
| 0
|
|
+-------------+---------+------+-----+---------+----------------+
5 rows in set (0.00 sec)
INSERT INTO Location (feature_id, start, stop, strand)
-> VALUES(1,1,3,1);
mysql>
Query OK, 1 row affected (0.02 sec)
Lecture 3.1
40
Queries using SELECT
mysql> SELECT * FROM Sequence;
+-------------+----------------------------+---------+-----------+---------+--------+---------+
| sequence_id | sequence
| defline | accession | version | length | moltype |
+-------------+----------------------------+---------+-----------+---------+--------+---------+
|
2 | ATGACGATCAGCATCAGCTACAGCTG | > seq1 | seq1
|
1 |
26 |
1 |
|
3 | SLKLSKLPSPLYQVCLE
| > seq2 | L32174
|
1 |
17 |
3 |
+-------------+----------------------------+---------+-----------+---------+--------+---------+
2 rows in set (0.00 sec)
mysql> SELECT sequence FROM
+----------------------------+
| sequence
|
+----------------------------+
| ATGACGATCAGCATCAGCTACAGCTG |
+----------------------------+
1 row in set (0.12 sec)
mysql> SELECT length
+--------+
| length |
+--------+
|
17 |
+--------+
1 row in set (0.03 sec)
Lecture 3.1
Sequence WHERE accession = 'seq1';
FROM Sequence WHERE sequence_id = 3;
41
Joining tables
mysql> SELECT * FROM Feature;
+------------+-------------+-------------+
| feature_id | sequence_id | ontology_id |
+------------+-------------+-------------+
|
1 |
2 |
3 |
|
2 |
2 |
4 |
|
3 |
2 |
4 |
+------------+-------------+-------------+
3 rows in set (0.04 sec)
Return me the descriptions of the features in the Feature table
mysql> SELECT feature_id, description
-> FROM Feature, Ontology
-> WHERE Feature.ontology_id = Ontology.ontology_id;
+------------+---------------------------------------------+
| feature_id | description
|
+------------+---------------------------------------------+
|
1 | denotes an Methionine codon of a transcript |
|
2 | an exon in genomic sequence
|
|
3 | an exon in genomic sequence
|
+------------+---------------------------------------------+
3 rows in set (0.04 sec)
Lecture 3.1
42
Setting up a complex query
• Consider sequence ‘seq1’ with the following
features:
– Initial exon from 1..6
– Internal exon from 15..20
• Note that with relational model the term ‘exon’
only appears once in the database
Lecture 3.1
43
mysql> SELECT * FROM Sequence WHERE sequence_id = 2;
+-------------+----------------------------+---------+-----------+---------+--------+---------+
| sequence_id | sequence
| defline | accession | version | length | moltype |
+-------------+----------------------------+---------+-----------+---------+--------+---------+
|
2 | ATGACGATCAGCATCAGCTACAGCTG | > seq1 | seq1
|
1 |
26 |
1 |
+-------------+----------------------------+---------+-----------+---------+--------+---------+
1 row in set (0.04 sec)
Complex query
mysql> SELECT * FROM Feature WHERE sequence_id = 2;
+------------+-------------+-------------+
| feature_id | sequence_id | ontology_id |
+------------+-------------+-------------+
|
1 |
2 |
3 |
|
2 |
2 |
4 |
|
3 |
2 |
4 |
+------------+-------------+-------------+
3 rows in set (0.03 sec)
mysql> SELECT * FROM Location;
+-------------+------------+-------+------+--------+
| location_id | feature_id | start | stop | strand |
+-------------+------------+-------+------+--------+
|
1 |
1 |
1 |
3 |
1 |
|
2 |
2 |
1 |
6 |
1 |
|
3 |
3 |
15 |
20 |
1 |
+-------------+------------+-------+------+--------+
3 rows in set (0.20 sec)
The relational model stores
data efficiently and
optimises the modifiablility
of the data. What if ‘exon’
changes to something
else?
mysql> SELECT * FROM Ontology;
+-------------+-------------+---------------------------------------------+
| ontology_id | term
| description
|
+-------------+-------------+---------------------------------------------+
|
3 | start codon | denotes an Methionine codon of a transcript |
|
4 | exon
| an exon in genomic sequence
|
|
5 | exon type
| 3'UTR, initial, internal, terminal, 5'UTR
|
+-------------+-------------+---------------------------------------------+
3 rows in set (0.00 sec)
mysql> SELECT * FROM Qualifier;
+--------------+------------+-------------+----------+
| qualifier_id | feature_id | ontology_id | value
|
+--------------+------------+-------------+----------+
Lecture 3.1
|
1 |
2 |
5 | initial |
|
2 |
3 |
5 | internal |
44
Complex query
Return me the sub-sequences and coordinates of the ‘exon’ features of ‘seq1’
mysql> SELECT SUBSTRING(sequence, start, stop), start, stop, term
-> FROM Sequence, Ontology, Feature, Location
-> WHERE accession = 'seq1' AND term = 'exon' AND
-> Feature.ontology_id = Ontology.ontology_id AND
-> Feature.sequence_id = Sequence.sequence_id AND
-> Feature.feature_id = Location.feature_id;
+----------------------------------+-------+------+------+
| SUBSTRING(sequence, start, stop) | start | stop | term |
+----------------------------------+-------+------+------+
| ATGACG
|
1 |
6 | exon |
| CAGCTACAGCTG
|
15 |
20 | exon |
+----------------------------------+-------+------+------+
2 rows in set (0.04 sec)
Lecture 3.1
45
Even more complex…
Return me the sub-sequences, coordinates feature name and qualifier value
of the ‘internal exon’ features of ‘seq1’
mysql> SELECT SUBSTRING(sequence, start, stop), start, stop, o1.term, value
-> FROM Sequence, Feature, Ontology o1, Ontology o2, Location, Qualifier
-> WHERE accession = 'seq1' AND
-> o1.term = 'exon' AND
-> o2.term = 'exon type' AND
-> value = 'internal' AND
-> Feature.ontology_id = o1.ontology_id AND
-> Qualifier.ontology_id = o2.ontology_id AND
-> Qualifier.feature_id = Feature.feature_id AND
-> Feature.sequence_id = Sequence.sequence_id AND
-> Location.feature_id = Feature.feature_id;
+----------------------------------+-------+------+------+----------+
| SUBSTRING(sequence, start, stop) | start | stop | term | value
|
+----------------------------------+-------+------+------+----------+
| CAGCTACAGCTG
|
15 |
20 | exon | internal |
+----------------------------------+-------+------+------+----------+
1 row in set (0.05 sec)
Lecture 3.1
46
Aggregate queries
mysql> SELECT * FROM Sequence;
+-------------+----------------------------+---------+-----------+---------+--------+---------+
| sequence_id | sequence
| defline | accession | version | length | moltype |
+-------------+----------------------------+---------+-----------+---------+--------+---------+
|
2 | ATGACGATCAGCATCAGCTACAGCTG | > seq1 | seq1
|
1 |
26 |
1 |
|
3 | SLKLSKLPSPLYQVCLE
| > seq2 | L32174
|
1 |
17 |
3 |
|
4 | MASQQQCGAR
| > seq
| seq3
|
1 |
10 |
3 |
+-------------+----------------------------+---------+-----------+---------+--------+---------+
mysql> SELECT count(*),
+----------+---------+
| count(*) | moltype |
+----------+---------+
|
1 |
1 |
|
2 |
3 |
+----------+---------+
2 rows in set (0.08 sec)
Lecture 3.1
moltype from Sequence GROUP BY moltype;
47
Using LIMIT
SELECT * FROM Sequence LIMIT 2;
mysql>
+-------------+----------------------------+---------+-----------+---------+--------+---------+
| sequence_id | sequence
| defline | accession | version | length | moltype |
+-------------+----------------------------+---------+-----------+---------+--------+---------+
|
2 | ATGACGATCAGCATCAGCTACAGCTG | > seq1 | seq1
|
1 |
26 |
1 |
|
3 | SLKLSKLPSPLYQVCLE
| > seq2 | L32174
|
1 |
17 |
3 |
+-------------+----------------------------+---------+-----------+---------+--------+---------+
2 rows in set (0.08 sec)
Lecture 3.1
48
UPDATING a table
mysql> SELECT * FROM Qualifier;
+--------------+------------+-------------+----------+
| qualifier_id | feature_id | ontology_id | value
|
+--------------+------------+-------------+----------+
|
1 |
2 |
5 | initial |
|
2 |
3 |
5 | internal |
+--------------+------------+-------------+----------+
2 rows in set (0.00 sec)
UPDATE Qualifier SET value = 'terminal'
-> WHERE qualifier_id = 2;
mysql>
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
Lecture 3.1
49
DELETING from a table
DELETE FROM Qualifier
-> WHERE qualifier_id = 2;
mysql>
Query OK, 1 row affected (0.04 sec)
mysql> SELECT * FROM Qualifier;
+--------------+------------+-------------+---------+
| qualifier_id | feature_id | ontology_id | value
|
+--------------+------------+-------------+---------+
|
1 |
2 |
5 | initial |
+--------------+------------+-------------+---------+
1 row in set (0.03 sec)
Lecture 3.1
50
Optimisation
• Perking up MySQL
– Queries
– Database server
Lecture 3.1
52
Indexing
• In general, indexing your data makes retrieval orders of
magnitude faster
• Consider a list of 1000000 sequences with accession numbers
• You need to find the one sequence with accession number
‘AC123456’
• Response time requires O(1000000) operations if the
accession field is not indexed
– Equivalent to scanning through a list
• Response time requires O(log(1000000)) = O(6) operations if
the accession field is indexed
– Somewhat like a hashtable lookup
Lecture 3.1
53
Types of indexes
• PRIMARY KEY
– To identify the main accessor field of the table
• UNIQUE
– Constraint to ensure that all entries in a field are different
• INDEX
– Creates a way to quickly search on a given field
• FULLTEXT
– For large TEXT fields > 255 characters
• Compound indexes – (column1, column2, …)
• NOTE – index is synonymous with KEY
Lecture 3.1
54
Drawbacks to indexing
• Need more disk space
• Can slow down inserts
• Know your data and the queries you will
perform on the data
– Only index fields you think you will query on
– Requires spending time in the design phase to
define ‘requirements’ of the database
Lecture 3.1
55
Creating an index
mysql> CREATE INDEX acindex ON Sequence (accession);
Query OK, 1 row affected (0.18 sec)
Records: 1 Duplicates: 0 Warnings: 0
Lecture 3.1
56
Tuning the database
• > mysqladmin variables
• > mysqld --help
DBA
Lecture 3.1
57
Variables (--variable-name=value)
and boolean options {FALSE|TRUE} Value (after reading options)
--------------------------------- ----------------------------basedir
/raid/db/mysql/mysql-max4.0.14-pc-linux-i686/
bdb-home
(No default value)
bdb-logdir
(No default value)
bdb-tmpdir
(No default value)
bind-address
(No default value)
console
FALSE
chroot
(No default value)
character-sets-dir
/raid/db/mysql/mysql-max4.0.14-pc-linux-i686/share/mysql/charsets/
datadir
/raid/db/mysql/mysql-max4.0.14-pc-linux-i686/data/
default-character-set
latin1
enable-locking
FALSE
enable-pstack
FALSE
gdb
FALSE
innodb_data_home_dir
(No default value)
innodb_log_group_home_dir
(No default value)
innodb_log_arch_dir
(No default value)
innodb_flush_log_at_trx_commit
1
innodb_flush_method
(No default value)
innodb_fast_shutdown
TRUE
innodb_max_dirty_pages_pct
90
init-file
(No default value)
log
(No default value)
language
/raid/db/mysql/mysql-max4.0.14-pc-linux-i686/share/mysql/english/
local-infile
TRUE
log-bin
(No default value)
log-bin-index
(No default value)
log-isam
myisam.log
log-update
(No default value)
log-slow-queries
(No default value)
log-slave-updates
FALSE
low-priority-updates
FALSE
master-host
(No default value)
master-user
test
master-port
3306
Lecture 3.1
58
master-connect-retry
60
master-retry-count
86400
master-info-file
master.info
master-ssl
FALSE
master-ssl-key
(No default value)
master-ssl-cert
(No default value)
master-ssl-capath
(No default value)
master-ssl-cipher
(No default value)
myisam-recover
OFF
memlock
FALSE
disconnect-slave-event-count
0
abort-slave-event-count
0
max-binlog-dump-events
0
sporadic-binlog-dump-fail
FALSE
new
FALSE
old-protocol
10
old-rpl-compat
FALSE
pid-file
/raid/db/mysql/mysql-max4.0.14-pc-linux-i686/data/watson.pid
log-error
port
3306
report-host
(No default value)
report-user
(No default value)
report-password
(No default value)
report-port
3306
rpl-recovery-rank
0
relay-log
(No default value)
relay-log-index
(No default value)
safe-user-create
FALSE
server-id
1
show-slave-auth-info
FALSE
concurrent-insert
TRUE
skip-grant-tables
FALSE
skip-slave-start
FALSE
relay-log-info-file
relay-log.info
slave-load-tmpdir
/raid/tmp/
socket
/tmp/mysql.sock
sql-bin-update-same
FALSE
sql-mode
OFF
temp-pool
TRUE
tmpdir
/raid/tmp
Lecture 3.1
59
external-locking
use-symbolic-links
symbolic-links
log-warnings
warnings
back_log
bdb_cache_size
bdb_log_buffer_size
bdb_max_lock
bdb_lock_max
binlog_cache_size
connect_timeout
delayed_insert_timeout
delayed_insert_limit
delayed_queue_size
flush_time
ft_min_word_len
ft_max_word_len
ft_max_word_len_for_sort
ft_stopword_file
innodb_mirrored_log_groups
innodb_log_files_in_group
innodb_log_file_size
innodb_log_buffer_size
innodb_buffer_pool_size
innodb_additional_mem_pool_size
innodb_file_io_threads
innodb_lock_wait_timeout
innodb_thread_concurrency
innodb_force_recovery
interactive_timeout
join_buffer_size
key_buffer_size
long_query_time
lower_case_table_names
max_allowed_packet
max_binlog_cache_size
max_binlog_size
max_connections
max_connect_errors
max_delayed_threads
max_heap_table_size
Lecture 3.1
FALSE
TRUE
TRUE
FALSE
FALSE
50
8388600
0
10000
10000
32768
5
300
100
1000
0
4
254
20
(No default value)
1
2
5242880
1048576
8388608
1048576
4
50
8
0
28800
131072
402653184
10
FALSE
1047552
4294967295
1073741824
100
10
20
16777216
60
max_join_size
max_relay_log_size
max_seeks_for_key
max_sort_length
max_tmp_tables
max_user_connections
max_write_lock_count
bulk_insert_buffer_size
myisam_block_size
myisam_max_extra_sort_file_size
myisam_max_sort_file_size
myisam_repair_threads
myisam_sort_buffer_size
net_buffer_length
net_retry_count
net_read_timeout
net_write_timeout
open_files_limit
query_cache_limit
query_cache_size
query_cache_type
read_buffer_size
read_rnd_buffer_size
record_buffer
relay_log_space_limit
slave_compressed_protocol
slave_net_timeout
read-only
slow_launch_time
sort_buffer_size
table_cache
thread_concurrency
thread_cache_size
tmp_table_size
thread_stack
wait_timeout
default-week-format
18446744073709551615
0
4294967295
1024
32
0
4294967295
8388608
1024
268435456
2147483647
1
67108864
16384
10
30
60
0
1048576
33554432
1
2093056
262144
2093056
0
FALSE
3600
FALSE
2
2097144
512
8
8
33554432
196608
28800
0
To see what values a running MySQL server is using, type
'mysqladmin variables' instead of 'mysqld --help'.
Lecture 3.1
61
Tuning the system to your needs
• Need to think about uses of the database
–
–
–
–
How many concurrent connections?
Will there be large records?
Will there be repetitive queries?
Will I need large indexes?
• Tuning the system can give huge gains in
performance – lets you get the most out of
the system
Lecture 3.1
62
Important parameters
• max_allowed_packet
– Largest amount of data to be transmitted to the client in 1
packet
• max_connections
– The largest number of concurrent connections to the
database server
• datadir
– The location of the data files on the system
• query_cache
– Size of cache for repetitive queries
• Many, many others…..
Lecture 3.1
63
Lecture 3.1
64
COMMUNICATING WITH MySQL
Lecture 3.1
65
Communicating with MySQL
•
•
•
•
Through a GUI
– MySQL ControlCentre
• http://www.mysql.com/products/mysqlcc/
• Standalone application supported by MySQL
Through the web
– PhpMyAdmin
• http://www.phpmyadmin.net/home_page/
• Works with Apache web server
Through the Unix command line
– MySQL client
– Comes with MySQL
Through APIs (Application Programming Interface)
– MySQL C API
– Perl DBI
– MySQL++ (C++)
• http://dev.mysql.com/downloads/other/plusplus/
– JDBC (Java Database Connectivity)
• Java protocol and API for RDBMS communication
Lecture 3.1
66
Communicating with MySQL
• Choose the method that is ‘right’ for the job
• Administration
– MySQL CC
– PHP MyAdmin
• Standalone Application
– APIs
• Web Application
– PHP/Java servlets
• ‘Low – throughput’ queries
– Command line client
Lecture 3.1
67
Working with JDBC
• JDBC is a standard API
that provides databaseindependent connectivity
to allow a Java
application to interact with
a database
http://java.sun.com/products/jdbc/overview.html
Lecture 3.1
68
Connector/J
• JDBC implementation for MySQL is Connector/J
– http://www.mysql.com/products/connector/j/
• Installation:
–
$ export CLASSPATH=/path/to/mysql-connector-java-[version]-bin.jar:$CLASSPATH
Lecture 3.1
69
Connector/J Steps
1.
2.
3.
4.
5.
Establish a connection
Prepare one or more queries
Execute one or more queries
Process the results (if applicable)
Destroy connection
Lecture 3.1
70
Connector/J – Establishing a connection
// we need the following 6 variables to make a jdbc connection
String DBSERVERNAME = “mysql”;
String JDBCDRIVERNAME = “com.mysql.jdbc.Driver”;
String host = “my.database.com”;
String databaseName = “sequence”;
String user = “me”;
String password = “mypwd”;
//load the driver into memory
Class.forName(JDBCDRIVERNAME).newInstance();
// create the connection URL
String connectionURL = "jdbc:” + DBSERVERNAME + “://" + host+ "/"
+ database + "?" + "user=“ + user + "&password=" + password;
// get the connection from the driver manager
Connection connection =
DriverManager.getConnection(connectionURL);
Lecture 3.1
71
Preparing and executing a query
// object required to execute the query
Statement statement = null;
// object to store results of the query
ResultSet resultSet = null;
// create the query string
String query = "SELECT sequence_id FROM Sequence";
// initialise the statement
statement = connection.createStatement();
// execute the query – the results are returned in resultSet
resultSet = statement.executeQuery(query);
Lecture 3.1
72
Process the results & close connection
// iterate through the ‘rows’ returned by the query
while (resultSet.next()) {
int sequenceId = resultSet.getInt("sample_id");
// do something with the sequenceId
}
// destroy the connection if its no longer needed
connection.close();
Lecture 3.1
73
Topics not covered…
• MySQL tools
– mysqldump
• Tool to dump a schema, all the data and/or both
– mysqlimport
• Tool to import delimited files
• Look before you parse!
– mysqladmin
• For DBAs to create database, change passwords, etc…
– Read the mysql documentation
Lecture 3.1
74
Topics not covered…
• Setting connection parameters in JDBC
– Consult Connector/J docs
• Database design
– Extremely important process
– Many courses at univ/college
Lecture 3.1
75
Summary
• Relational databases are necessary in bioinformatics
• Relational databases allow us to efficiently store and
query large amounts of data
• MySQL is a good choice for RDBMS engine because
it is highly functional at no cost
• JDBC provides a way to access MySQL from within a
Java program
Lecture 3.1
76
Resources
• MySQL
–
–
–
–
http://www.mysql.com
http://dev.mysql.com.mysql/en/index.html
http://www.mysql.com/products/mysqlcc/
http://dev.mysql.com/doc/connector/j/en
• NAR Database Issue 2004
– http://nar.oupjournals.org/content/vol32/suppl_1
• JDBC
– http://java.sun.com/products/jdbc/
• Me
– [email protected]
Lecture 3.1
77