Chem 731 – Computer methods for studying protein structure and
function
These are the coursenotes for the ongoing graduate course. I will post updated versions
as I go along. These course notes will contain the slides I show in class, as well as
additional notes and explanations.
Please let me know if something is unclear, so that I can improve these notes.
This version is from Wednesday 4th December, 2013.
Contents
1 Introduction
1
1.1 Overview
1.1.1 Before we begin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.2 Course topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
1
1.2 Linux
1.2.1
1.2.2
1.2.3
1.2.4
1.2.5
1.2.6
2
2
2
2
2
3
3
What is Linux? . . . . . . . . . . . . . . . . . . . . . . . .
Why use Linux, if it is a pain in the behind?
Recommended Linux distributions . . . . . . . .
Free Software . . . . . . . . . . . . . . . . . . . . . . . . .
Web resources for Linux . . . . . . . . . . . . . . . .
Let’s install Linux, if we haven’t yet . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1.3 Using the shell
3
1.4 The bash shell
1.4.1 Some basic shell commands . . . . . . . . . . . .
1.4.2 Becoming the super user . . . . . . . . . . . . . .
1.4.3 Installing some software . . . . . . . . . . . . . . .
1.4.4 Fortune cookies . . . . . . . . . . . . . . . . . . . . . .
1.4.5 Throwing fortune cookies into black holes
1.4.6 Saving fortune cookies for posterity . . . . .
1.4.7 Saving more fortune cookies for posterity .
1.4.8 Our own cookie factory . . . . . . . . . . . . . . . .
1.4.9 A fancier cookie factory . . . . . . . . . . . . . . .
1.4.10 Viewing documentation with man . . . . . . . .
1.4.11 Searching documentation with apropos . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 LATEX
4
4
5
5
5
6
6
6
6
7
8
8
9
2.1 Prerequisites
9
2.2 Overview
2.2.1 What is LATEX? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Example LATEX markup (source code of slide above) . . . . . . . . . . . . .
ii
9
9
10
iii
CONTENTS
2.2.3
“Logical markup”: Separating content from presentation . . . . . . . .
2.3 Examples and exercises
2.3.1 The source file for this presentation . .
2.3.2 Should I use LATEX or a word processor?
2.3.3 Exercise 1: Create a LATEX document . . .
2.3.4 Excercise 1 ctd. . . . . . . . . . . . . . . . . . . .
2.3.5 Exercise 2: Use the UW thesis template
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Gnuplot
10
11
11
11
12
12
13
14
3.1 Installation
14
3.2 Introduction
14
3.3 Plotting functions and files
3.3.1 Start Gnuplot . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Running a gnuplot script file . . . . . . . . . . . .
3.3.3 Saving a plot to file . . . . . . . . . . . . . . . . . . . .
3.3.4 Including a plot in LATEX . . . . . . . . . . . . . . . .
3.3.5 Plotting data files . . . . . . . . . . . . . . . . . . . . .
3.3.6 Working with data files in different formats
3.3.7 Plotting CSV files; multiple data sets . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
15
15
16
16
16
17
3.4 Curve fitting with Gnuplot
3.4.1 Theories without adjustable parameters . . . . . . . . . . . . . . . . . . . . .
3.4.2 Theories with adjustable parameters . . . . . . . . . . . . . . . . . . . . . . . .
3.4.3 Numerical curve fitting by gradient descent . . . . . . . . . . . . . . . . . .
3.4.4 Example: Receptor activation by ligand . . . . . . . . . . . . . . . . . . . . . .
3.4.5 The 5-HT2B receptor can be up- and down-regulated by ligands . . .
3.4.6 Receptor activation or inhibition by ligand – theory . . . . . . . . . . . .
3.4.7 How many variable parameters should we use? . . . . . . . . . . . . . . .
3.4.8 How good is the fit? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.9 Testing exact theories with inexact data . . . . . . . . . . . . . . . . . . . . .
3.4.10 Testing a theory with adjustable parameters . . . . . . . . . . . . . . . . . .
3.4.11 Evaluating the fit error: χ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.12 How do we obtain the standard deviations of the measured values?
3.4.13 A practical exercise: Calcium binding to daptomycin . . . . . . . . . . .
3.4.14 What daptomycin is supposed to do . . . . . . . . . . . . . . . . . . . . . . . .
3.4.15 One or more types of binding sites for calcium? . . . . . . . . . . . . . . .
3.4.16 Daptomycin fluorescence after addition of EDTA at t = 0 . . . . . . .
3.4.17 A single-exponential model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.18 Fitting with 1 to 4 exponential terms . . . . . . . . . . . . . . . . . . . . . . . .
3.4.19 Where are the parameters obtained from the fit? . . . . . . . . . . . . . .
3.4.20 Which fit is the best? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.21 Plotting the fit residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.22 Residuals from a good fit (4 exponentials) . . . . . . . . . . . . . . . . . . . .
3.4.23 Residuals from a poor fit (2 exponentials) . . . . . . . . . . . . . . . . . . . .
3.4.24 So have we found the truth? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
17
18
19
19
20
20
21
21
22
23
23
23
24
24
25
25
25
26
26
26
27
27
27
28
3.5 Code and data listings
28
iv
CONTENTS
4 Protein structure visualization with Jmol and Pymol
4.1 Introduction
4.1.1 Why X-rays? . . . . . . . . . . . . . . . . . . .
4.1.2 Is it easy? . . . . . . . . . . . . . . . . . . . . .
4.1.3 Protein structure databases . . . . . .
4.1.4 Protein structure family relations .
4.1.5 The PDB data format . . . . . . . . . . . .
4.1.6 Software for molecular visualiation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4.2 Jmol
4.2.1
4.2.2
4.2.3
4.2.4
4.2.5
4.2.6
4.2.7
4.2.8
4.2.9
4.2.10
4.2.11
4.2.12
4.2.13
4.2.14
4.2.15
Jmol exercises . . . . . . . . . . . . . . . . . . . .
The PDB file . . . . . . . . . . . . . . . . . . . . . .
The fields of the ATOM record . . . . . . . .
A hetero-atom record . . . . . . . . . . . . . .
Tweaking the view . . . . . . . . . . . . . . . . .
Saving our hard work . . . . . . . . . . . . . .
Saving the current state . . . . . . . . . . . .
Saving images . . . . . . . . . . . . . . . . . . . .
Looking at protein folds . . . . . . . . . . . .
Folds. . . . . . . . . . . . . . . . . . . . . . . . . . . .
More on selections . . . . . . . . . . . . . . . .
Exercise: Try to reproduce this display
Hints . . . . . . . . . . . . . . . . . . . . . . . . . . .
And another one . . . . . . . . . . . . . . . . . .
And a last one . . . . . . . . . . . . . . . . . . . .
4.3 Pymol
4.3.1
4.3.2
4.3.3
4.3.4
4.3.5
4.3.6
4.3.7
4.3.8
4.3.9
4.3.10
4.3.11
4.3.12
4.3.13
4.3.14
4.3.15
4.3.16
4.3.17
Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Opening files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Working with single structures . . . . . . . . . . . . . . . . . . . . . . . . .
Exercise: HIV protease with the inhibitor saquinavir bound to
What are virus proteases, anyway? . . . . . . . . . . . . . . . . . . . . . .
Saving a cleaned-up version of the molecule . . . . . . . . . . . . . .
Visualizing structure elements . . . . . . . . . . . . . . . . . . . . . . . . .
Saving state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Selections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Prettyfication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Producing high-quality figures . . . . . . . . . . . . . . . . . . . . . . . . .
Driving Pymol with scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The image produced by gyrase.pml . . . . . . . . . . . . . . . . . . . . .
What is DNA topoisomerase anyway? . . . . . . . . . . . . . . . . . . .
The reaction catalyzed by DNA topoisomerases . . . . . . . . . . .
Understanding script files . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
34
34
35
35
35
35
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
36
36
37
37
37
38
38
39
39
39
40
40
40
41
41
..
..
..
..
it
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
42
42
42
43
43
44
44
45
45
45
46
46
46
47
48
48
49
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
v
CONTENTS
5 Sequence analysis
51
5.1 Introduction
5.1.1 Sequence analysis resources: Starting points . . . . . . . . . . . . . . . . .
51
51
5.2 Exercises
5.2.1 Proteins of unknown function in the Saccharomyces cerevisiae
(baker’s yeast) genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Sequence composition and inferred properties . . . . . . . . . . . . . . . .
5.2.3 Secondary structure prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.4 Secondary structure prediction ctd. . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.5 Searching for sequence motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.6 Sequence motifs are expressed as consensus motifs . . . . . . . . . . . .
5.2.7 How do we find motifs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.8 Searching sequence motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.9 The CAAX box motif . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.10 Comparing sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.11 Aligning sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
6 Molecular docking
52
52
53
53
53
54
54
55
55
56
56
58
6.1 Introduction
6.1.1 Overview of the procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
58
6.2 Exercise: Docking imatinib to abl protein tyrosine kinase
6.2.1 Preparing the receptor input file . . . . . . . . . . . . . .
6.2.2 Preparing the ligand input file . . . . . . . . . . . . . . . .
6.2.3 Defining the search area . . . . . . . . . . . . . . . . . . . . .
6.2.4 Create the Vina configuration file . . . . . . . . . . . . . .
6.2.5 Run Vina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.6 Inspect the results in Pymol . . . . . . . . . . . . . . . . . .
59
59
59
60
60
60
61
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 Python programming
7.1 Introduction
7.1.1 Python vs. Gnuplot or LATEX . . . . . . . . . . . . . . . . . .
7.1.2 Is programming easy? . . . . . . . . . . . . . . . . . . . . . .
7.1.3 Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.4 How Python programs are created and executed .
62
.
.
.
.
62
62
62
63
63
7.2 First steps
7.2.1 Python’s interactive mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.2 Naming pieces of data: Variables . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
64
64
7.3 Keywords and builtins
7.3.1 Some names are special . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.2 Python keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.3 Python built-ins whose names are not protected . . . . . . . . . . . . . . .
65
65
66
66
7.4 Data types
66
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vi
CONTENTS
7.5 Working with more data: Containers
7.5.1 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.2 How variables work with mutable objects such as lists .
7.5.3 Testing for identity and equality . . . . . . . . . . . . . . . . . .
7.5.4 List slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.5 Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.6 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.7 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.8 Tuples vs. lists as dictionary keys . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
67
67
68
68
69
69
70
71
71
7.6 Repeated execution: Loops
7.6.1 Iterating over a dictionary . . . . . . . . .
7.6.2 Iterating over strings . . . . . . . . . . . . .
7.6.3 More fun with strings . . . . . . . . . . . . .
7.6.4 Exercise: Translating DNA to protein .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
71
72
73
73
74
7.7 List comprehensions
7.7.1 Exercise: Use a list comprehension to translate a DNA sequence . .
74
75
7.8 Nested
7.8.1
7.8.2
7.8.3
containers and loops
Nested containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Nested loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rewrite this using list comprehensions? . . . . . . . . . . . . . . . . . . . . .
75
75
76
76
7.9 Conditional execution
7.9.1 Conditional execution inside a loop . . . . . . . . . . . . . . . . . . . . . . . . .
77
77
7.10Boolean evaluation of expressions
7.10.1 Alternative formulation of conditionals . . . . . . . . . . . . . . . . . . . . . .
7.10.2 Exercise: What about 6? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
78
78
7.11Functions
7.11.1 Defining functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.11.2 Functions with default arguments . . . . . . . . . . . . . . . . . . . . . . . . . .
7.11.3 Exercise: Generating random passwords . . . . . . . . . . . . . . . . . . . . .
79
79
80
80
7.12Importing code
7.12.1 Importing self-written code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
81
7.13Exceptions
7.13.1 Catching exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
82
7.14Reading and writing files
7.14.1 Writing files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.14.2 Files and functions: Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
83
84
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Listings
3.1 Gnuplot script to fit activation of dopamine receptors by aripiprazole . .
28
3.2 Datafile for Gnuplot script in listing 3.1 . . . . . . . . . . . . . . . . . . . . .
28
3.3 Gnuplot script to fit and plot dopamine receptor activation by aripiprazole 29
3.4 Gnuplot script to fit up- and down-regulation of serotonin receptors . . .
29
3.5 The data file for listing 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.6 Gnuplot script that fits single- to quadruple-exponential decays to the
daptomyin EDTA dissociation kinetics . . . . . . . . . . . . . . . . . . . . . .
31
4.1 The gyrase.pml script for Pymol . . . . . . . . . . . . . . . . . . . . . . . . . .
49
vii
viii
LISTINGS
Chapter
1
Introduction
1.1
1.1.1
Overview
Before we begin. . .
1. This class is experimental – I have never taught it, or even anything like it, before
. . . update: this is actually the second time, but the following still applies
2. There will be a certain degree of chaos
3. There will very likely be things that I forget to explain. If you lose the plot, please
tell me. Please feel free to ask anything, at any time, you will be helping me and
each other in this way
4. Course website: http://watcut.uwaterloo.ca/chem731/
1.1.2
1.
2.
3.
4.
5.
6.
7.
Course topics
Linux
LATEX
Data evaluation and presentation with Gnuplot
Molecular visualization with Jmol and Pymol
Sequence analysis programs
Molecular docking
Programming in Python
1
2
CHAPTER 1. INTRODUCTION
1.2
Linux
1.2.1
What is Linux?
1. Re-implementation of Unix, started as a hobby but now developed by both volunteers and companies such as IBM and Novell, so no longer a toy
2. Used more commonly on servers but also usable as a desktop environment
3. Open source, meaning that everyone can dowload the source code, modify it, and
redistribute their modified code
4. Patchwork architecture: Multiple versions of everything, including graphical user
interfaces
5. Can be a pain in the behind to get running and trouble-shoot
6. Many different variations (“distributions”) – have a look at distrowatch.com if
you are interested.
1.2.2
Why use Linux, if it is a pain in the behind?
1. Many scientific programs were originally developed for Unix workstations and
therefore usually also run on Linux
2. Because of its heritage, Linux is a good learning environment for Unix – some
people may end up having to work with Unix work stations
3. No automatic “you forgot to update Adobe BlahBlahBlah” warnings Update: Linux
is catching up – now shows you nuisance messages aplenty by default, too
4. No viruses – I have never had anything in some six years of daily use, despite not
running any virus protection software (then again, I don’t go to ripped movie sites
a lot)
5. To scare away the amateurs
6. Some people like a pain in the behind. . .
1.2.3
Recommended Linux distributions
Debian Linux or one of its derivatives. Recommended flavours:
1. Debian itself – a little more involved to set up, but provides a very clean and stable
system
2. Mepis – easier install and slightly better hardware recognition, but no longer
leading in this regard
3. Ubuntu – very good hardware recognition and configuration, focus on user friendliness. My impression is that it contains more bugs than Debian.
4. Linux Mint – based on Ubuntu, with more bells and whistles pre-installed
Debian and Ubuntu have an excellent software packaging system that greatly facilitates
in the installation and configuration of complex programs, including scientific software.
All this packaged software is freely available.
1.2.4
Good:
Free Software . . .
1.3. USING THE SHELL
3
• Free – can’t argue with the price
• In most cases, source code free as well – can be used, modified and re-used by
others
Bad:
• Free – no paycheck for developers, many programs developed as a hobby
• Quality varies from excellent to horrible, some programs are not maintained –
scientific software often developed as part of scientists’ day jobs though, generally
of good quality
1.2.5
Web resources for Linux
1. ubuntu.org, mepis.org, debian.org
2. The Linux documentation project: tldp.org/guides.html – lots of documentation relevant for any Linux, has a section dedicated to Debian, too
1.2.6
1.
2.
3.
4.
5.
Let’s install Linux, if we haven’t yet
Make sure you prepare for the worst – backup all data you care about
Put your CD into drive and reboot
Cross your fingers and knock on wood
Create root and swap partitions as required: Root ≥ 10 GB, swap 0.5-1 GB
Install grub to master boot record – lets you switch between Windows and Linux
during boot
After install: Try your internet connection, if it doesn’t work, try to fix it
If you plan on using Linux in the long term, it may be better to create additional
partitions. My setup usually looks similar to this:
• Two system partitions to hold Linux installations (6–8 GB per partition is enough)
• One large partition that holds my data. I do not make this my home directory,
because the home directory contains all kinds of hidden files with settings, and
if I use the same home directory from different Linux installs, these are going to
stomp on each other’s feet, overwriting each other’s settings.
• A swap partition, with 1–2 times the size of the RAM. May not be needed if you
have 3-4 GB of RAM or more.
1.3
Using the shell
In the olden days, when you powered up a computer, it would land you in the shell,
which on PCs was called MS-DOS. The computer would wait for you to type a command
and then press Enter. It would then execute this command and dump you back into
the shell, waiting for the next command. Unix machines traditionally operated the same
way, as did the original Apple computers (pre-Macintosh), which used some dialect of
Basic as the shell language.
4
CHAPTER 1. INTRODUCTION
MS-DOS, or any other shell, understands a limited number of built-in commands. In
order to achieve anything useful, several commands usually have to be executed in
series. So that you don’t have to enter the same sequences of commands over and over
again, you can write the commands into a text file instead and save this file under a
suitable name. You can then simply enter the name of this script or batch file, just like
any built-in command, and the shell will execute all the commands it reads from this
text file, all in one pass. Therefore, by writing your file, you can craft a new command
from a set of existing ones – which is the essence of what we call programming.1
When the first Macs appeared with easy-to-use graphical user interfaces (GUIs), they
found a mixed reception; both positive and negative reactions had their valid reasons.
On the one hand, a program with a well-designed GUI makes it much easier to pick up
its basic operation intuitively. On the other hand, GUIs tend to get cumbersome with
more advanced program usage. For this reason, some programs combine a GUI with
shell-style operation and scripting. We will see examples in this course2
While GUI-driven programs may be easier to use, shell-driven ones are easier to write.
They receive all input at the beginning, and produce all output at the end. In contrast,
a GUI program has to continually watch for new user input while also processing the
previous input. To avoid such complexities, our own programming exercises in this
class will use the shell, and it therefore is necessary for us to learn how to use it.
1.4
The bash shell
On Linux, the most widely used shell is bash, based on the older Bourne shell, from
which it derives its name (bash = “Bourne Again Shell”). As stated above, it works similar
to MS-DOS, but it has a more powerful and versatile set of commands. It also has a
pretty hostile syntax, so that using it for advanced tasks is not fun. However, basic
usage is easy, and for anything advanced it is pretty easy to substitute it with something
more readable and pleasant such as Python; we will soon see how that is done.
We can use the bash shell from within our GUI by opening a console window from the
menu (the exact location will vary with your system).
1.4.1
Some basic shell commands
Try to bring up a console window from your menu. You need to hunt for it – the exact
location in the menu will vary with the distribution that you have installed.
1
Another widely used form of programs that often are not appreciated as such are spreadsheets. Each
time you enter a formula into a spreadsheet, you are in fact programming – chances are, therefore, that
you have already successfully written your own programs.
2
This is not limited to Linux or Unix. Microsoft Office has its own Basic dialect built in that lets you
program add-ins to extend its functionality. An example add-in for Excel is SpectraAnalysis.xla (see
http://www.science.uwaterloo.ca/~mpalmer/software.html).
1.4. THE BASH SHELL
1.4.2
5
Becoming the super user
On Debian or Mepis, type:
su
and press <enter>. This will prompt you for the root password that you entered during
installation. On Ubuntu or Mint, type
sudo su
When prompted for a password, type your normal user password – there is no separate
root password in this case. If you like, you can create a separate root password, while
in super user mode, with
passwd
Then type your chosen root password.
System administration, including software installation, requires super user or root privileges. On multi-user systems, such tasks are reserved to the systems administrator.
On your own laptop, that is you, and there may not be a need for separate root account.
Ubuntu and its derivatives (Mint) have done away with it, relying instead on the sudo
command to perform administrative tasks. On these systems, you can either prefix
each single administrative command with sudo, or you can become super user for the
session with sudo su.
1.4.3
Installing some software
Software can be downloaded and installed directly from the command line. Let’s try it:
apt-get install fortunes fortune-mod
This will install two software packages. When all is done, issue
exit
to leave the super user mode.
1.4.4
Fortune cookies
Let’s test our new piece of software. Issue
fortune
try it again. . . see how useful it is?
The shell keeps a command history. You can repeat the last command by hitting the
upward arrow. Hitting it twice takes you back to the command before that, and so on.
6
1.4.5
CHAPTER 1. INTRODUCTION
Throwing fortune cookies into black holes
If you don’t want to see the fortune cookie, you can type:
fortune > /dev/null
Even more useful. /dev/null is the system’s black hole device. If you don’t want to
see the output of some command, you can redirect it into the black hole as illustrated
here.
1.4.6
Saving fortune cookies for posterity
Instead of the black hole, we can also redirect the output to a file and so preserve it.
Issue
fortune > wisdom
Then issue
ls
You will see a new file named wisdom, which contains your fortune cookie. Now issue
cat wisdom
to have it printed to your console window.
1.4.7
Saving more fortune cookies for posterity
If you repeat the steps above, each new cookie will overwrite the previous one. Now
issue:
fortune >> wisdom; printf “\n” >> wisdom
and repeat these two commands a couple of times (using the up-arrow for convenience).
Now, the output of the fortune command got appended to the file instead; the printf
command served only to insert empty lines between the cookies.
1.4.8
Our own cookie factory
Issue the command
nano
This should open a text editor called “nano” within your console window. If it doesn’t,
become root (su) and issue
apt-get install nano
type exit to become yourself again and then bring up nano.
Type the following:
1.4. THE BASH SHELL
7
#!/bin/bash
cat /dev/null > wisdom # black hole > file
for i in $(seq 1 1 5)
do
echo "Cookie no. $i" >> wisdom
echo "------------" >> wisdom
fortune >> wisdom
echo >> wisdom
done
Now press ctrl+o, give the file name “cookiefactory”, press enter, and then ctrl-x to
exit.
Confirm that your new file exists by issuing
ls -l
Now issue the command
cookiefactory
What happened?
Let’s try
./cookiefactory
This time, it denies permission, which is progress – at least it found the file. Let’s fix
that:
chmod +x cookiefactory
makes the file executable. Now
./cookiefactory
should work.
1.4.9
A fancier cookie factory
Open up the file again:
nano cookiefactory
Change it like this:
#!/bin/bash
for i in $(seq 1 1 $2)
do
echo "Cookie no. $i" >> $1
8
CHAPTER 1. INTRODUCTION
echo "------------" >> $1
fortune >> $1
echo >> $1
done
Save with ctrl-o and quit nano with ctrl-x.Use it like
./cookiefactory w1 5
to write 5 cookies to a file named w1.
1.4.10
Viewing documentation with man
The man command lets you access documentation (so-called man-pages) for the various
programs and commands. For example, type
man less
to learn everything there is to know about the less command.
1.4.11
Searching documentation with apropos
Let’s say we want to convert something to pdf. How can we find out what programs
could help us with that? Type
apropos pdf
The apropos command searches all available man pages for a word or phrase (here
pdf). However, the output that it spits at us may be a bit longish. We can filter it with
the grep command:
apropos pdf | grep -i convert
Use apropos to find out how to get a list of the fonts available on your system.
Several new things have been introduced here:
1. The | character sets up a pipe – the output of the apropos command is fed as
input to the grep command
2. the -i option causes grep to ignore case – both "convert" and "Convert" will be
accepted
As for the list of fonts, try:
apropos fonts | grep -i list
That should give you a short list of search results, among which you should find the
command fc-list.
We will learn a few more shell commands later in this course. If you want to learn more
on your own, have a look at http://tldp.org/LDP/Bash-Beginners-Guide/html/
index.html.
Chapter
2
LATEX
2.1
Prerequisites
We first need to install LATEX itself, as well as an editor suitable for writing LATEX documents.
From your package manager, install the LATEX editor Texmaker. This should automatically also install the essential parts of TexLive, a complete LATEX installation. If it does
not, manually install package texlive as well.
If you search your package manager for TexLive, it will show you a long list of packages.
The names of some end in ’-recommended’. Install those as well.
Before you do this, it would be a good idea to make sure that your package manager
uses the repository at mirror.csclub.uwaterloo.ca. Downloads from there are very
quick.
2.2
2.2.1
Overview
What is LATEX?
1. A programmable typesetting system, based on TEX
2. Good for typesetting mathematics – widely used for publishing books or journals
in math and physics
3. Suitable for large, structured documents like reports, papers, books, theses, with
or without mathematics
4. Documents contain a mixture of text and formatting instructions (“markup”)
5. Extensible by user – very many special-purpose packages have been programmed
9
10
CHAPTER 2. LATEX
6. Various output formats; in practice, PDF output is usually what we want
Using LATEX well needs some study. There is a boatload of documentation available. Here
are some valuable resources:
• the LATEX FAQ at
http://www.tex.ac.uk/cgi-bin/texfaq2html?introduction=yes
• the “not so short introduction to LATEX" – should have come with your TexLive
installation. Type texdoc lshort into a console to view
• CTAN http://www.ctan.org – a repository of all kinds of LATEX packages. Most
of the mature, widely used packages come with TexLive though.
Notice the texdoc <package> trick used above. For most packages, this should find
and display information installed on your system. Try texdoc mhchem to see if it works.
2.2.2
Example LATEX markup (source code of slide above)
\begin{frame}\frametitle{What is \LaTeX?}
\begin{enumerate}
\item A programmable typesetting system,
based on \TeX{}
\item Good for typesetting mathematics -widely used for publishing books or
journals in math and physics
...
\end{enumerate}
\end{frame}
What we can see here is that LATEX cannot only be used for printed documents but also
for slides.
We also see that LATEX uses the concept of logical markup. The key idea behind logical
markup is the separation of content and presentation: In the text, we only specify what
is a heading, what is a normal paragraph, and so on. Attributes such as font, font weight
and size, color etc. are defined elsewhere, and these definitions can easily be applied or
replaced with others, without changing the text.
2.2.3
“Logical markup”: Separating content from presentation
Content with logical markup
External style file maps logical markup
to actual formatting instructions
Typeset document
2.3. EXAMPLES AND EXERCISES
11
It is advisable to use logical markup wherever possible. This has several advantages:
1.
2.
3.
4.
Your document will have a consistent look
You can easily change the layout, without going through your document again
Logical markup tends to be more legible and concise
You can use your content in different formats
As an example, of the last item above, I produce my printed course notes and slides
from the same source files. Huge time saver.
2.3
2.3.1
Examples and exercises
The source file for this presentation
% the beamer document class produces slides
\documentclass[ignorenonframetext, serif]{beamer}
% some customizations reside in this package
\usepackage{beamerslides}
% tell LaTeX where to look for images
\graphicspath{{/data/chem731/images/}}
% here, we include the actual content
\include{latexcontent}
I have a separate file for producing these course notes, which is a bit longer. However,
the key point is again the instruction \include{latexcontent}, and similar instructions for the other chapters.
In the xy-content source files, I have one frame environment for each slide, and notes
like this one between the frame environments. The beamer class option ignorenoframetext
will cause this additional text to be disregarded when creating the slides. For the typeset
notes, I use a simple trick to convert the content of the slides to plain text.
This setup makes it easy to keep slides and notes in sync and is actually quite fun to
work with.
2.3.2
Should I use LATEX or a word processor?
LATEX is good
• with large documents (like a thesis)
• if you are in charge of the layout (thesis)
• if you don’t mind spending some time to learn it
12
CHAPTER 2. LATEX
LATEX is no better than a word processor
• with small documents that don’t need much formatting – however, those may be
good for practicing
LATEX is more trouble than it’s worth
• with paper manuscripts that are going to be typeset by the publisher anyhow
(exception: publishers that ask for LATEX
• if you must cooperate on the document with someone who refuses to use LATEX
If you search the web for guidance on this choice, you will find a lot of LATEX zealots
ranting about it, and quite often it is clear that they haven’t touched a word processor
in the last 50 years or so. However, there are still valid reasons to prefer LATEX – it really
gives you more flexibility and power than a word processor.
Basic usage of LATEX, for example a for a thesis, does not take too long to learn. The
automatic placement of figures and tables alone will probably more than compensate
you for the amount of time you need to spend on learning it.
If you decide to stick with a conventional word processor, it is still a good idea to follow
the principle of separating the logical structure of the document from the formatting.
Both Word and OpenOffice let you do this, although it is a little less obvious how.
2.3.3
Exercise 1: Create a LATEX document
1.
2.
3.
4.
5.
6.
7.
8.
Start Texmaker
Select File > New
Select Wizard > Quickstart
Set papersize to letterpaper
Set encoding to utfx8
click OK
Save the document as exercise1.tex, preferably in a new folder
From the first drop-down menu in the tool bar, select PDFLaTeX, and then click on
the blue arrow next to it. This will compile the document.
9. From the second drop-down, select View PDF, and click on that blue arrow. You
should now see the compiled document on the screen.
You have now before you the skeleton of a LATEX document.
2.3.4
•
•
•
•
•
•
•
Excercise 1 ctd.
insert \maketitle
use lipsum package
add an abstract
type some text
type some lists
insert sectioning commands
some font formatting commands: bold, italics, font sizes
2.3. EXAMPLES AND EXERCISES
•
•
•
•
•
13
super- and subscripts
create a shortcut for “Pneumonoultramicroscopicsilicovolcanoconiosis”
customize hyphenation
load nicer fonts
adjust margins
These things are illustrated in the file exercise.tex that I will be sending along. Open
it with Texmaker, run it through PDFLaTeX, and view the resulting PDF file.
2.3.5
Exercise 2: Use the UW thesis template
• Open the file testthesis.tex that I sent around earlier.
• Compile with PDFLatex. Does it work? Let me know if it does not.
• Adjust Texmaker’s Quickbuild command: Options > Configure Texmaker >
Quickbuild > User. Into the text field at the bottom, type (all in one line):
pdflatex -interaction=nonstopmode %.tex|
bibtex %.aux|pdflatex -interaction=nonstopmode %.tex|
pdflatex -interaction=nonstopmode %.tex
• From the first drop-down in the tool bar, select Quick Build and hit the blue
arrow next to it.
Chapter
3
Gnuplot
3.1
Installation
Gnuplot can be installed as a package of that name through your friendly package
manager. It is a good idea to also get the package gnuplot-doc, which contains a lot of
worked examples. The documentation for Gnuplot in PDF format does not seem to be
in the package but can be found on Gnuplot’s website.
3.2
Introduction
Gnuplot can
1.
2.
3.
4.
5.
plot experimental data
plot mathematical functions (y = x 2 )
plot data and functions together
fit function parameters to experimental data
plot 3D graphs
If you look around the Gnuplot website, you will see all kinds of fancy, colorful 3D
graphics. I haven’t got enough neurons left to appreciate those – the focus here will be
on 2D graphics and data fitting.
14
3.3. PLOTTING FUNCTIONS AND FILES
3.3
3.3.1
15
Plotting functions and files
Start Gnuplot
• Bring up console
• type gnuplot -V to see your program version; make sure you obtain the documentation that matches your version
• type gnuplot
• type plot (x+3)**2 title ’a parabola’
• close plot window, type ctrl+d
As you can see, Gnuplot is driven from the command line. Typing gnuplot -V tells
Gnuplot to simply print its version number and then exit. If you type gnuplot, Gnuplot
starts and sits there, waiting for you to tell it what to do, just like a regular shell. You
can then interactively plot a function, as we have done.
3.3.2
•
•
•
•
Running a gnuplot script file
cd to the folder with the practice files I sent around
run gnuplot poteffx.plt
When you are done admiring the graph, click into the window
If you clicked the close button of the window, Gnuplot hangs; press ctrl+d to exit
For anything advanced, however, you don’t want to use Gnuplot interactively, because
it will forget all your hard work once it exits. Instead, you will usually type up all
commands in a script file and then let Gnuplot run it.
The script file contains both commands and explanatory comments. The easiest way to
learn Gnuplot is by looking at and playing with examples. To take full advantage of it,
it is necessary to read the documentation, which is reasonably well written and quite
complete, although a bit short on examples. A good website with worked examples is
http://t16web.lanl.gov/Kawano/gnuplot/index-e.html.
In this exercise, we again saw an interactive display. It is more useful to save the plot
to a file, however.
3.3.3
Saving a plot to file
• Run gnuplot poteff.plt. That should give you a pile of strange-looking text.
This is in fact a PostScript description of the plot. PostScript is a document
description language that is similar and can easily be converted to PDF.
• Run gnuplot poteff.plt > test.eps to send the PostScript to a file.
• Run gv test.eps to admire the fruit of your hard work.
• Run epstodpf test.eps. This will convert the eps file to a pdf file, which we can
for example use in LATEX.
• As a shortcut, run ./gnuplot-pdf poteff.plt
• Convert PDF to png: convert -density 300 poteff.pdf poteff.png
16
CHAPTER 3. GNUPLOT
Gnuplot can produce plots in various formats. To this end, it uses a variety of different
“terminals”, or output routines. The dirty little secret is that these all have their different
settings, abilities and limitations.
The EPS (encapsulated PostScript) terminal is mature and versatile. EPS can easily be
converted to other graphics file formats, in particular pdf. For use in LATEX, you should
always use a vector graphics format, that is in practice PDF.
The little gnuplot-pdf script runs gnuplot and epstopdf first and then shows you the
result in gv. Once you close gv, you still have the pdf file.
Only were you cannot use PDF, such as on a web page, should you use pixel graphics
(PNG). Most word processors still can’t use PDF, so you are stuck with PNG. The convert
utility lets you control the resolution of the resulting file. Use a high resolution to make
it look good in print, for example convert -density 600 plot.pdf plot.png.
Addendum: I have found that convert makes some dents into the plot graphs occasionally. It seems that pdftoppm works better. This program is part of the package
poppler-utils. Type man pdftoppm to find out how to use it.
3.3.4
Including a plot in LATEX
• Bring up Texmaker
• Load the file poteffplot.tex
• Run it through PDFLaTeX and look at the output.
This is only a quick illustration that Gnuplot and LATEX go well together. We won’t
elaborate further.
3.3.5
•
•
•
•
Plotting data files
Run ./gnuplot-kpdf sw17.plt
Looks nice, too, doesn’t it
Run less sw17.plt to inspect the file
Hit q and then run less sw17.dat
This example shows how to plot data from a file. The data are organized in columns
separated by one or more spaces. The columns are selected with the using clause, for
example using 1:3 uses the first column as x, and the third column as y values.
The plot file is much shorter than the previous one, since most of the settings have
now been factored out into a separate file that is simply loaded at the beginning. On
my computer, I keep a similar setup file in /gnuplot/setup_eps.plt, and I just load
it into new plot files with load ’ /gnuplot/setup_eps.plt’. This saves me finger
strokes and gives my plots a consistent look every time.
Settings that we want to change can still be overridden – for example if we first say set
logscale x and later unset logscale x, the second command will take effect.
3.4. CURVE FITTING WITH GNUPLOT
3.3.6
•
•
•
•
3.3.7
•
•
•
•
17
Working with data files in different formats
Run less a-chym.dat – note that file contains no x values
Press q, run gnuplot achym-dat – values are plotted, but from x=0 (wrong)
Run gnuplot, then plot ’a-chym.dat’ using ($0+178):1
Other transformations could also be applied, for example plot ’a-chym.dat’
using (log($0+1)):1 could be used to construct a logarithmic x axis.
Plotting CSV files; multiple data sets
Widely used low-tech format, easy export and import with spreadsheets
Just say set datafile separator ’,’
File chym.csv is an example – see whether you can get it plotted
Run ./gnuplot-pdf waldhoer.plt; inspect files
Multiple data sets can reside in one file when separated by two or more empty lines. The
index clause selects one or more data sets, counting from 0; for example 0:0 selects
only the first data set, whereas index 1:2 selects the second and third data set (you
would not often need multiple selection, though).
Not illustrated: Intervals can be selected with every, for example every 5 select every
fifth value only.
3.4
Curve fitting with Gnuplot
In addition to plotting functions and data, Gnuplot also lets us fit functions to data sets.
Function fitting (or curve fitting) can be done for different purposes:
1. Testing an theoretical model with measurement data – here, we typically need to fit
and test alternative models. Example: Single- vs. double-exponential fluorescence
decay
2. Obtaining values for the parameters of an accepted model for a given set of
experimental data. Example: Time course of drug excretion in a single patient –
we assume it’s single exponential and don’t consider any alternatives
3. Creating trend lines to “guide the eye”, using some arbitrary, simple equations
that need not have any exact physical meaning (e.g. the Hill equation)
Gnuplot employs the widely used Levenberg-Marquardt numerical fitting algorithm,
which can be used to fit arbitrary functions to given data sets. The one important
limitation in Gnuplot is that the function must be provided in an explicit form, that is
we cannot use iterative numerical procedures to calculate the value of the function, as
may be necessary if for example the function is only defined by a system of differential
equations. In that case, we need a real programming language such as Python, which lets
us use the Levenberg-Marquardt algorithm together with arbitrary ways of computing
function values.
18
3.4.1
CHAPTER 3. GNUPLOT
Theories without adjustable parameters
The law of Hagen and Poiseulle describes the velocity of laminar flow in a capillary:
π r 4 ∆p
dV
=
dt
8ηl
The answer to the ultimate question of life, the universe, and everything, according to
Douglas Adams:
42
Curve fitting requires that some of the parameters of a given function can reasonably
be treated as variable. The point of these examples is to show that this is not always
the case – in such cases, curve fitting is not applicable.
3.4.2
Theories with adjustable parameters
Hooke’s law: The extension of a spring is proportional to the force applied to it
F = −kx
Michaelis-Menten law of enzyme reaction velocity: The velocity is proportional to the
substrate saturation, which in turn follows mass action kinetics
V = Vmax
[S]
[S] + KM
(3.1)
In the case of Hooke’s law, k is the variable parameter; it is not universal but is a property
of the particular spring under study that must be determined experimentally. Since
Hooke’s law is a linear equation, we don’t need numeric fitting but can simply apply
linear regression. However, numeric fitting can do the job also, and indeed Gnuplot
doesn’t seem to provide built-in linear regression.
The Michaelis-Menten law is not linear, so we cannot directly apply linear regression.
We can apply it if we transform the equation into a linear shape; this is the point of
the Lineweaver-Burk plot and some other plots that are traditionally taught in enzyme
kinetics. However, using numerical fitting, we can evaluate the data directly and can
avoid the distortion inherent in the linear transformations.
19
3.4. CURVE FITTING WITH GNUPLOT
3.4.3
Numerical curve fitting by gradient descent
Starting point (arbitrary
parameter values)
Total error
12
8
4
0
6
8
10
parameter 1
12
14
2
4
6
8
10
12
parameter 2
Numerical fitting minimizes the total error as a function of all variable parameters.
With two variable parameters, we can envision the error function as a surface with
valleys and mountains; the lowest point is at the optimal combination of values for the
two parameters. Numerical fitting algorithms like the one devised by Levenberg and
Marquardt work by gradient descent, that is by climbing downhill on this error surface.
To get going, we need to provide a starting point, that is supply some arbitrary values
for the variable parameters. The algorithm then explores the terrain close by and moves
a step in a downhill direction. It repeats this until it can find no lower point, and then
stops.
Even without going into the mathematics (which I would have to retrieve from a textbook
also) this sketch explains why numerical fitting is general: The algorithm only considers
the slope of the error surface, it does not care about the underlying function that erects
the surface.
With more than two variable parameters, this visualization no longer works, but the
idea is still the same – march “downhill” in an (n + 1)–dimensional space.
3.4.4
Example: Receptor activation by ligand
Assumptions:
1. Ligand binding to the receptor follows mass action kinetics
2. Receptor is completely inactive without ligand bound, and fully active with ligand
bound. That is, receptor activation equals receptor saturation.
3. Ligand binds according to law of mass action: K = [L][R]
[LR]
The degree of receptor activation, as a function of [L], then becomes:
A=
[L]
[L] + K
(3.2)
This is a very simple case, with just one variable parameter (K). Example data, and a
Gnuplot file to fit them to the above equation are listed below. If you run the Gnuplot
20
CHAPTER 3. GNUPLOT
Receptor activation (%)
file in listing 3.1, it will tell you that the best fit to the data is obtained with k = 716.717.
It also prints a whole lot more information, some of which we will consider later.
100
80
kstart
kfit
60
40
20
0
101
102
103
104
Ligand concentration (nM)
105
This figure uses the same data as example ?? and shows the curves obtained with an
arbitrarily chosen starting value for K, as well as with the fitted K. The plot file that
first produces this figure is listed below.
3.4.5
The 5-HT2B receptor can be up- and down-regulated by ligands
IP3 release (%)
160
aripiprazole
serotonin
120
80
40
10-3 10-2 10-1 100 101 102 103 104
Ligand concentration (µM)
The 5-HT2B receptor activates phospholipase C, which releases inositoltriphosphate
(IP3 ). The drug aripiprazole increases receptor activity, but the physiological ligand
serotonin decreases it; therefore, receptor activity must be greater than zero without
the drug.
Clearly, we need to modify our activity function to account for this behaviour.
3.4.6
Receptor activation or inhibition by ligand – theory
Assumptions:
1. Ligand still binds according to law of mass action: K = [L][R]
[LR]
2. Receptor has a basal level of activity Afree in the unbound state, and some other
level of activity Abound with bound ligand.
21
3.4. CURVE FITTING WITH GNUPLOT
→ Degree of receptor activation, as a function of [L]:
A = Afree + (Abound − Afree )
[L]
[L] + K
(3.3)
In equation 3.3, Afree could be treated as variable, or as fixed. The activity data shown in
section 3.4.5 have been normalized to 100% in the absence of ligand, so it seems natural
to fix Afree at that value. On the other hand, since not all data points were used in that
normalization, we might obtain a nicer looking fit with letting Afree float. Which choice
is better?
3.4.7
How many variable parameters should we use?
“With four parameters I can fit an elephant, and with five I can make him wiggle his
trunk.”
John von Neumann The more variable parameters we allow, the more likely
it becomes that the theoretical model will be able to adopt a shape consistent with the
experimental data, even without any inherent physical validity. So, generally speaking,
the fewer variable parameters, the better; parameters should only be treated as variable
if there is a sound reason for it.
So, in our example, the right choice is to fix Afree .
3.4.8
How good is the fit?
If we run the Gnuplot file from the above example like so:
gnuplot iprelease.plt
we see the summary of the second fit close to the bottom of the screen, but the first
one is buried in clutter. Let’s filter the output for the information we want:
gnuplot iprelease.plt 2>&1 | grep variance
This gives us:
variance of residuals (reduced chisquare) = WSSR/ndf : 43.7331
variance of residuals (reduced chisquare) = WSSR/ndf : 37.1166 First off,
how did the output filtering work: With the | character, we sent Gnuplot’s output
through a pipe to grep, which filtered it for lines containing the word “variance”. The
idea of a pipe is that one program’s output becomes the second program’s input.
To make this work, we had to first redirect Gnuplot’s output from its so-called stderr,
or standard error stream, to its stdout, or standard output stream, since only the latter
can be attached to a pipe. The numbers 2 and 1 are so-called file handles with which we
can refer to stderr and stdout, respectively.
The two variance (total error) values that we obtain are for the two fits for the first and
the second data set, respectively. This is what Gnuplot thinks is the reduced χ 2 value
22
CHAPTER 3. GNUPLOT
(see below). For a real calculation of χ 2 , we would need the error of measurement. Since
we don’t have the experimental error here, these χ 2 values don’t mean much; they are
only useful if we have others to compare them to. Here, we could compare them to an
alternative fit, with Afree set as variable.
Rewriting the Gnuplot file to perform this fit is left as an exercise to the gentle reader.
If you did it right, you should see this output:
variance of residuals (reduced chisquare) = WSSR/ndf :
variance of residuals (reduced chisquare) = WSSR/ndf :
48.749
39.4221
These variances are higher than before – the fit got worse, not better. So, we were right
to use a fixed Afree in the first place.
3.4.9
Testing exact theories with inexact data
“From all this it is plain that these observations agree with theory, so far as they agree
with one another.”
Isaac Newton, in discussing his calculations on the comet of 1680 Succinct
expressions of difficult concepts, like this specimen, are the hallmark of true genius. In
my own, not quite so succinct words: If we want to use curve fitting to test the validity
of some theoretical model, we need to know the limits of experimental accuracy. The
experimental error is usually estimated from the variation of repeated measurements.
We then use the following rationale:
total error = measured data − predicted value
(3.4)
error of theory = total error − measurement error
(3.5)
If all observed errors are accounted for by errors of measurement, error of theory is
zero, and the theory is true. Otherwise, the theory is false.
The above pseudo-equations outline the principle but are not literally true. Instead
of the differences between measurement and prediction, we consider the squares of
those differences. This is based on the assumption that errors of measurement are
statistically distributed around a mean value: Two small deviations are more likely and
therefore are of less concern, if observed, than one large deviation.
3.4.10
Testing a theory with adjustable parameters
A theory with adjustable parameters is considered valid if there exists a combination of
values for these parameters that will yield an overall error no greater than the expected
error of measurement.
Therefore, we need to
23
3.4. CURVE FITTING WITH GNUPLOT
1. Find the best possible combination of values for the adjustable parameters, in
order to minimize the overall error. This is done through numerical fitting.
2. Compare the remaining error to the known error of measurement.
A parameter that helps us to compare the overall error to the experimental error is the
reduced χ 2 .
3.4.11
Evaluating the fit error: χ 2
Definition:
2
χ =
t1 , t2 . . . tn
m1 , m2 . . . mn
σ1 , σ2 . . . σn
t1 − m1
σ1
2
t2 − m2
+
σ2
2
tn − mn
+ ... +
σn
2
theoretical values
measured values
standard deviations for measured values
The reduced χ 2 is used to make the error estimate independent of the number of data
points:
χ2
2
χred
=
n−p
where p is the number of adjustable parameters, or degrees of freedom.
In the
2
above example (section 3.4.6), we noticed that χred increased with the introduction
of another free parameter. We can see now how this works – a greater number of
2
variable parameters decreases the denominator of χred
.
2
In the ideal case—all remaining error is error of measurement only—χred
should reach
a value of 1. In reality, it will usually remain somewhat higher; deciding whether a given
fit is “good enough” often is somewhat arbitrary.
3.4.12
How do we obtain the standard deviations of the measured values?
1. Repeat the measurements a sufficient number of times. Triplicate measurements
are often used but not statistically reliable; 10 repetitions is more like it
2. If the signal consists of a number of discrete counts, such as photons in a photoncounting fluorescence detector or in a β- or γ-counter, we can estimate the stan√
dard deviation of the signal N according to: σ = N
The first approach is universal. The second one is convenient, but it only gives the
theoretical minimum value of the experimental variance, wich results from counting
statistics alone. Possible sources of error such as for example baseline noise from the
detector or intensity fluctuations of the light source will not be accounted for.
24
3.4.13
CHAPTER 3. GNUPLOT
A practical exercise: Calcium binding to daptomycin
NH2
O
N
H
O
O
HN
NH2
H
N
N
H
O
O
O
NH
N
H
OH
O
N
H
OH HN
O
CH3
O
O
O
O
HN
O
O
O
OH
H
N
O
O
NH
N
H
O
CH3
O
H
N
HN
OH
O
H3C
HO
NH2
O
CH3
Daptomycin is a lipopeptide antibiotic. It contains some non-standard amino acids,
including a kynurenine residue (the lowermost aromatic side chain in the figure) that is
intrinsically fluorescent. Fluorescence is bright if daptomycin is bound to membranes,
but it is dim when it is in solution.
3.4.14
What daptomycin is supposed to do
Ca
Ca
Ca
Ca
solution
Ca
Ca
Ca
PC membranes
Ca
Ca
Ca Ca
PG
PC/PG membranes
K+
25
3.4. CURVE FITTING WITH GNUPLOT
Daptomycin binds to membranes and forms oligomers. Binding and oligomerization is
dependent on calcium ions and on negatively charged lipids such as phosphatidylglycerol (PG) in the membrane.
3.4.15
One or more types of binding sites for calcium?
• Incubate daptomycin with membranes and calcium
• At t=0, at EDTA to capture calcium
• Follow kinetics of daptomycin fluorescence decrease
It is not known for certain how the calcium ions interact with daptomycin and with
the lipids on the membrane. In the simplest case, all binding sites for calcium could
be equivalent, with the same rate of binding and dissocation. In this case, withdrawal
of calcium with a chelator (EDTA) at t=0 should a single-exponential decrease in the
daptomycin fluorescence. On the other hand, if there are different classes of binding
sites with different rates of calcium dissociation, the time course of the fluorescence
should have two or more exponential terms.
Daptomycin fluorescence after addition of EDTA at t = 0
Fluorescence (cps/106)
3.4.16
1.2
1
0.8
0.6
0.4
0.2
0
0
3.4.17
10
20 30 40
Time (minutes)
50
60
A single-exponential model
t
F = Fbasal + Finc e− τ
(3.6)
level
In the
off.
experimental
It is reasonable
time to
course,
assume
thethat
intensity
there drops
shouldoff
befast
some
initially
residual
andfluorescence
then seems to
at
t = ∞, which would correspond to the fluorescence of daptomycin in solution. On top
of this basal fluorescence, we assume a an additional component that at t = 0 equals
Finc and undergoes a single-exponential decay.
This modes is easily extended by adding more exponential terms, each with its own
pre-exponential (Finc ) and time constant (τ).
3.4.18
Fitting with 1 to 4 exponential terms
The Gnuplot script edtafit.plt (listing 3.6) contains the code for running all fits, one
after another. Invoke like so:
26
CHAPTER 3. GNUPLOT
gnuplot edtafit.plt 2>&1 | tee fitresults
The | tee trick duplicates the (redirected) output of the gnuplot command – we get to
see it on the screen, and a copy is saved to the file fitresults. This is convenient for
later analysis.
Alternatively, you can use the file fit.log, in which Gnuplot accumulates the output
of all fits.
Sit back and enjoy the numbers scrolling by. Each screenful of numbers shows the
results of one iteration. You may notice that the iterations become slower, and substantially more numerous, as we go from simple to complex models.
The Gnuplot script contains a lot of comments and explanations – it merits a good going
over. In particular, notice how we obtain the error estimates from the fluorescence
√
intensities, according to σ = N.
3.4.19
Where are the parameters obtained from the fit?
grep -A 11 ’Final set’ fitresults
With the -A 11 option, grep not only prints each line matching Final set but also the
next 11 lines. Try man grep to learn more about grep’s power. $0.50 (Canadian Tire)
for anyone who comes up with a problem that grep can’t solve.
3.4.20
Which fit is the best?
grep variance fitresults
should give you
variance
variance
variance
variance
of
of
of
of
residuals
residuals
residuals
residuals
(reduced
(reduced
(reduced
(reduced
chisquare)
chisquare)
chisquare)
chisquare)
=
=
=
=
WSSR/ndf
WSSR/ndf
WSSR/ndf
WSSR/ndf
:
:
:
:
1448.6
26.8559
2.89905
1.444
2
These are the χred
values for the 1-, 2-, 3- and 4-exponential fit, respectively. What do
we make of them?
2
Remember that χred
has a theoretical minimum of 1 for a perfect fit. Looking at those
numbers above, it is clear that only the 3- and the 4-exponential models come close
enough for consideration. In contrast, the single-exponential model is way beyond the
moon, and the 2-exponential model is still in orbit.
3.4.21
Plotting the fit residuals
The results of the fit can also be visualized using the fit residuals:
residuals =
t−m
σ
(3.7)
27
3.4. CURVE FITTING WITH GNUPLOT
which ideally should be just statistical flicker around zero.
Run the files edtadiffs.plt to see the plots of residuals from all four fits.
3.4.22
Residuals from a good fit (4 exponentials)
Residuals
5
2.5
0
-2.5
-5
0
900
1800
2700
Time (seconds)
3600
Here, the residuals are fairly evenly distributed around zero; only in the first ~500
seconds is there some apparent systematic distortion that represents data not fitted
adequately.
3.4.23
Residuals from a poor fit (2 exponentials)
Residuals
20
10
0
-10
-20
0
900
1800
2700
Time (seconds)
3600
Here, the random noise is much smaller than the large movements of the entire curve,
which represent a substantial residue that is not adequately covered by the model.
Therefore, a 2-exponential model is too simple.
3.4.24
So have we found the truth?
Remember John von Neumann: We may have found the truth, but we may also have
fitted a wiggling elephant.
All we can really say is that the kinetic data do not support a model in which a single class of calcium binding sites, or even two classes of sites kinetically control the
28
CHAPTER 3. GNUPLOT
release of daptomycin from the membrane – the kinetics of calcium and daptomycin
dissociation is more complex than that.
3.5
Code and data listings
Listing 3.1: Gnuplot script to fit activation of dopamine receptors by aripiprazole
(gppractice/arifit.plt)
# this minimal file only performs the data fit, no plotting.
set datafile separator ","
# read data from comma-separated file
# define the receptor activation function, scaled to 100%
activation(x, k) = 100 * x / (k + x)
# set an initial value for k
k_fit = 5000
# next comes the call to the fit routine. The ’via’ clause
# indicates which parameters are to be treated as variable
# here, we have only one, but we still need to declare it.
fit activation(x,k_fit) ’ari3-da.csv’ using 1:2 via k_fit
Listing 3.2: Datafile for Gnuplot script in listing 3.1 (gppractice/ari3-da.csv)
#"GTPgS-binding(dopamine), created by Plot Digitizer, 2.4.1"
#"Date: 11/22/07, 7:28:29 PM"
#dopamine,GTP-gamma-S-binding
9.82516E-1,-4.30702E-1
1.01527E+1,1.74836E+0
9.68155E+1,1.31456E+1
3.09817E+2,3.12565E+1
9.15102E+2,5.78764E+1
3.16345E+3,7.84713E+1
1.01471E+4,8.77169E+1
9.67349E+4,1.00178E+2
Listing 3.3: Gnuplot script to fit and plot dopamine receptor activation by aripiprazole. The
data are again from listing 3.2 (gppractice/arifitplot.plt)
# settings for the plot
load "setup_eps.plt"
# set up the eps terminal
29
3.5. CODE AND DATA LISTINGS
set output "arifitplot.eps"
set logscale x
set xtics 10, 10, 1e5
set mxtics
set format
set xrange
set xlabel
# write plot to this file
# logarithmic x-axis
# set x-axis tics from 10 to 10^5
# in intervals of 10
1
# hide minor axis tics by
# setting them to 1 per major tick
x "10^{%T}"
# format numbers on x-axis as powers of 10
[8:1.05e5]
# define the range of the x axis # a little space on the sides
"Ligand concentration (nM)"
set ylabel "Receptor activation ({/Symbol %})" offset 1.5,-0.25
set yrange [-10:101]
set ytics 0, 20, 100
# note that we don’t set a log y scale
set key top left
# location of the plot legend
# done with the formatting stuff, now on the actual work
set datafile separator ","
# read data from comma-separated file
# receptor activation function, scaled to 100%
activation(x, k) = 100 * x / (k + x)
k_start = 5000
k_fit = k_start
# set an initial value for the variable
# parameter and remember it
# k_fit will be different after fitting
# call the fitting routine
fit activation(x,k_fit) ’ari3-da.csv’ using 1:2 via k_fit
# k_fit now contains the optimized value. Plot the data and the
# function, with both the initial and the fitted values for k.
plot "ari3-da.csv" using 1:2 title "" with points pt 6 , \
activation(x, k_start) title "k_{start}" with lines lt 2, \
activation(x, k_fit) title "k_{fit}" with lines lt 1
Listing 3.4: Gnuplot script to fit up- and down-regulation of serotonin receptors
(gppractice/iprelease.plt)
# IP3 release after 5-HT2B receptor activation
# fit and plot dose-effect curves for aripiprazole and serotonin
load "setup_eps.plt"
set output "iprelease.eps"
set datafile separator ","
set format x "10^{%T}"
set xlabel "Ligand concentration ({/Symbol m}M)"
set logscale x
set key top left width -6 font "Helvetica,18"
30
set
set
set
set
set
CHAPTER 3. GNUPLOT
ylabel "IP_{3} release (%)" offset 1.5,-0.25
yrange [40:160]
xrange [0.5e-3:1.1e4]
ytics 40, 40, 160
mytics 1
# the interesting part
# receptor acitivity function. We fix the starting activity at 100%.
# K, as well the final activity, will vary freely.
activity(x, a_final, k) = 100 + (a_final - 100) * x / (k + x)
# define separate variables to be fitted for the two data sets
# aripiprazole
a_final_ari= 50
k_ari = 100
# serotonin (5-ht)
a_final_ht = 150
k_ht = 100
# perform the fits
fit activity(x, a_final_ari, k_ari) "pi-hydrolysis.csv" \
using 1:2 index 0:0 via a_final_ari, k_ari
fit activity(x, a_final_ht, k_ht) "pi-hydrolysis.csv" \
using 1:2 index 1:1 via a_final_ht, k_ht
# here, we plot only the fitted functions, not the starting ones
plot "pi-hydrolysis.csv" using 1:2 index 0:0 with points pt 6 title "aripiprazole", \
activity(x, a_final_ari, k_ari) with lines lt 3 title "", \
"" using 1:2 index 1:1 with points pt 7 title "serotonin", \
activity(x, a_final_ht, k_ht) with lines lt 1 title ""
Listing 3.5: The data file for listing 3.4 (gppractice/pi-hydrolysis.csv)
# phosphatidylinositol hydrolysis in response to
# serotonin receptor type 2B activation
# two data blocks, separated by two or more empty lines
# with aripiprazole. This data block is selected
# with ’index 0:0’ in the corresponding plot file
9.80193E-3,1.01172E+2
1.02716E+0,8.41158E+1
1.07407E+1,7.79321E+1
3.19429E+1,5.86920E+1
1.08057E+2,4.72681E+1
1.08000E+3,4.99374E+1
3.5. CODE AND DATA LISTINGS
31
# with 5-HT (serotonin). Selected with ’index 1:1’
1.02478E-3,9.58997E+1
1.02129E-1,1.13217E+2
1.06958E+0,9.92212E+1
9.77547E+0,1.14909E+2
3.43921E+1,1.24840E+2
1.06137E+2,1.38414E+2
1.01405E+3,1.49415E+2
1.01268E+4,1.56251E+2
Listing 3.6: Gnuplot script that fits single- to quadruple-exponential decays to the daptomyin
EDTA dissociation kinetics (gppractice/edtafit2.plt)
# fit the edta kinetics experiment, no plots
# somewhat simplified from a version that was included earlier
set datafile separator ","
# we test out between one and four exponential decays. We will declare
# only one function for all these different cases (named exp4):
exp4(t) = fbas + \
finc1 * exp(-t/tau1) + \
finc2 * exp(-t/tau2) + \
finc3 * exp(-t/tau3) + \
finc4 * exp(-t/tau4)
#
#
#
#
#
#
#
#
#
note that this function only receives one parameter - the time.
The other parameters must exist in the "global" space - we will
define them below.
we glean initial values for the parameters from the data. At time 0,
the intensity is ~1.2 million, at the end it’s around 0.3 million.
So, we use 0.3 million as the basal fluorescence, and the remainder
as the incremental fluorescence that participates in the exponential
decay(s) - the pre-exponential.
ftotal = 1.2e6
fbas = 3e5
finctotal = ftotal - fbas
# Initially, we use only one exponential term. We assign it all the
# incremental fluorescence as the pre-exponential.
finc1 = finctotal
# Fooling exp4: we set all unused pre-exponentials to zero,
# so that they will not affect the result of the calculation.
32
CHAPTER 3. GNUPLOT
finc2 = finc3 = finc4 = 0
# we use a guess for the time constant
tau1 = 300
# we set all other time constants to 1, so that we don’t get a zero
# division error - the tau values are in the denominator of the exponent
tau2 = tau3 = tau4 = 1
#
#
#
#
#
#
#
fit the single-exponential model. The only parameters that we will
allow to vary are fbas, finc1 and tau1. This is determined by the ’via’
clause.
also note the ’using’ clause: The third element is the error associated
with each data point. Here, we use the square roots of the intensities
as estimates for the error, which applies to measurements of stochastic
signals such as fluorescence, radioactivity and similar.
fit exp4(x) "edta_kinetics.csv" \
using 1:2:(sqrt($2)) \
via fbas, finc1, tau1
#
#
#
#
for the two-exponential fit, we assign finc2 and tau2 some initial
values and include them in the via clause.
we will use an ad-hoc construction method for the initial parameter
values that we can later extend to the 3- and 4-exponential fits.
interval = 10
tau1 = 50
tau2 = interval * tau1
finc1 = finc2 = finctotal/2
fit exp4(x) "edta_kinetics.csv" \
using 1:2:(sqrt($2)) \
via fbas, finc1, tau1, finc2, tau2
# lather, rinse, repeat
interval = 5
tau1 = 20
tau2 = interval * tau1
tau3 = interval * tau2
finc1 = finc2 = finc3 = finctotal/3
fit exp4(x) "edta_kinetics.csv" \
using 1:2:(sqrt($2)) \
via fbas, finc1, tau1, finc2, tau2, finc3, tau3
# apply reconditioner. Here, I had to tweak the interval and tau1, because
# the fit would abort with ’undefined value’ errors. That probably resulted
# from attempted zero division. Changing the initial parameters will
3.5. CODE AND DATA LISTINGS
# change all subsequent numbers computed during the fit, and so with
# some trial and error one can sidestep this problem.
# You can provoke an error with interval=5 and tau1=15.
interval = 5
tau1 = 10
tau2 = interval * tau1
tau3 = interval * tau2
tau4 = interval * tau3
finc1 = finc2 = finc3 = finc4 = finctotal/4
fit exp4(x) "edta_kinetics.csv" \
using 1:2:(sqrt($2)) \
via fbas, finc1, tau1, finc2, tau2, finc3, tau3, finc4, tau4
33
Chapter
4
Protein structure visualization with Jmol and Pymol
4.1
Introduction
Protein structures
• are usually determined by X-ray diffraction analysis of protein crystals
• can sometimes be determined by NMR, particularly with smaller proteins
Protein crystallization
• requires relatively large amounts of pure protein - has really taken off only once
recombinant methods of protein expression became available
• is more difficult to achieve with membrane proteins; number of membrane protein
structures lags behind that of soluble protein structures, but the situation is
changing
4.1.1
Why X-rays?
• Diffraction of X-rays by sodium chloride discovered by Max von Laue (Nobel prize
1913); proved both the wave nature of X-rays and the crystal structure of sodium
chloride. Theory worked out by Bragg sen. and jun. (Nobel prize 1914)
• In general terms: Periodic assemblies (crystals) will diffract electromagnetic waves
by way of constructive interference if and only if the wavelength is similar to the
spacing of the diffracting centers
• The wavelength of X-rays is similar to that of chemical bonds – γ-rays are too
short, UV rays are too long
4.1.2
Is it easy?
Max Perutz
34
4.1. INTRODUCTION
35
• was the first to tackle the structure of a protein crystal (myoglobin)
• was declared a lunatic when he announced his intention to do so
• worked more than 25 years to finish it up
Even this first crystal structure was only solved after computers had become available.
The calculations involved are too much for humans.
4.1.3
Protein structure databases
1. The protein data bank
rcsb.org/pdb/home/home.do
2. NCBI Pubmed ncbi.nlm.nih.gov/sites/entrez?db=structure
In the last couple of years, the number of protein crystal structures that have come out
has really exploded. There are now many structures available of proteins that have not
even been biochemically characterized.
4.1.4
Protein structure family relations
• Sequence homology families are a familiar concept
• 3D-structural homology usually accompany sequence homology but may extend
even further, that is it may occur even between proteins that have no significant
sequence similarity
4.1.5
The PDB data format
• Standard format for macromolecular structures
• Text-based – human-readable, sort of, but usually disfigured by lots of computergenerated tripe
• Contains annotation on protein structure (α-helices, β-sheets, disulfide bonds)
that may be displayed by molecular viewers
• Contains quite a bit of secondary information on experimental conditions, citations and the like that do not show up in molecular viewers, so it is often
worthwhile to look over a pdb file with one’s own eyes
4.1.6
Software for molecular visualiation
Examples
1. Rasmol – excellent in its day and performs wonderfully on low-end hardware, but
now dated. The scripting language lives on in Jmol
2. Jmol – Java program that can run as an applet (inside web pages) and stand-alone.
Scriptable with an extended version of the Rasmol scripting language
3. Pymol – Programmed with a mixture of Python and C++, very flexible, produces
very good images but clunkier than Jmol in some ways
4. Cn3d – the “official” viewer of the NCBI. I haven’t used it much, so can’t comment
on its qualities
36
CHAPTER 4. PROTEIN STRUCTURE VISUALIZATION WITH JMOL AND PYMOL
These programs are all freely available. Thorsten Dieckmann swears by another one,
UCSF Chimera. From looking at its web page, it seems similar in power to Pymol, but I
have no hands-on experience with it.
Jmol offers a nice balance of ease of use and capability, so that’s what we are going to
use for our initial exercises.
4.2
4.2.1
Jmol
Jmol exercises
Open a shell window and cd to your jmol-practice directory, then enter the command
jmol chloroA.pdb
Jmol should start and show you something like this:
Click and drag with the mouse to rotate. Hold down the shift key, and double-click,
then drag to shift the molecule. Roll your mouse wheel to zoom in and out. For now,
quit Jmol.
4.2.2
The PDB file
Before having some more fun with Jmol, let’s look at the data file, chloroA.pdb. Type
less chloroA.pdb
to look at it. As you can see, it is quite human-readable, and it starts with a lot of
information, including the protein sequence and the regular secondary structure motifs
(α-helices and β-sheets). The coordinates start with the first line prefixed with ATOM:
ATOM
5 N
THR A 10
92.241 155.870 190.344
1.00 33.86
N
37
4.2. JMOL
4.2.3
The fields of the ATOM record
ATOM
5
N
THR
A
10
92.241
1.00
33.86
N
record describes a “regular” atom, not a hetero-atom
running number of the atom (arbitrary)
atom name (relates to the residue)
residue (threonine)
chain
residue number (in this file, residues 1-9 are missing)
x-coordinate (then y, then z)
Occupancy
temperature factor (mobility of the atom)
element
. . . we can use most of these fields to select the atom within Jmol.
4.2.4
A hetero-atom record
HETATM 5567 MG
BCL 1
3
58.663 173.879 180.379
1.00
7.66
Mg
Now, let’s start up jmol again with the same command as before: jmol chloroA.pdb.
Hetero-atoms usually follow below the “regular” atoms, that is those that are part of
the macromolecule itself. This HETATM record represents the first magnesium of the
first chlorophyll molecule. Note that it has been assigned the chain name 1, as have
all molecules of chlorophyll associated with the protein chain A. Such decisions are up
to the PDB file’s author; some PDB files are well-organized like this one, while others
aren’t.
BCL, I suppose, stands for “bacteriochlorophyll”. Again, such acronyms for prostetic
groups, drugs, or other ligands are made up on the spot, and the fastest way to find out
about them is just to look at the pdb file.
4.2.5
Tweaking the view
Bring up a Jmol-console: Right-click in the main window to bring up the context menu
and then choose “console”. First, let’s change the background color:
background white
Let’s blow up the atoms to their (approximate) van der Waals size:
spacefill
We can also scale them up or down with for example
spacefill 50%
Or, we can assign them explicit diameters (in angstroms):
38
CHAPTER 4. PROTEIN STRUCTURE VISUALIZATION WITH JMOL AND PYMOL
spacefill 2.0
Jmol has a menu. This is for wimps; real men use the console. (The menu also is
somewhat limited, so you actually have no choice.)
Next, we want distinguish the protein from the hetero atoms:
select protein; color white; select bcl; color green; select water; color
blue
You should now see something like this (after some manual rotation with the mouse):
We don’t really need that second protein molecule. Let’s axe it, and while we are at it,
get rid of the water molecules, too:
restrict (chain=a or chain=1) and not water
Now, if you rotate the molecule, it feels a little awkward, since the visible part of the
molecule is off center. Click View→Define center in the menu to fix this.
4.2.6
Saving our hard work
We can
1. Save the current state in a Jmol script
2. Save an image (screenshot)
3. Export a povray script
An image is just a snapshot of the current display; it cannot be changed later from
within Jmol. Similarly, a povray script is for creating a still image; povray simply adds
some 3D spiffiness to the image.
Saving a Jmol script is different – you can load up the script later and continue to modify
the display of the molecule.
4.2.7
Saving the current state
You can do so from the menu (File→Export→Write state) or from the console:
write state state1.spt
Did it work?
4.2. JMOL
39
zap; load state1.spt
should delete the current view and then restore it from the saved file. You can use the
load command anytime to revert to a previously saved state in case you goofed up.
4.2.8
Saving images
Two methods
• From the menu: Export→Export image
• From the console: write image 2000 2000 chloroA.png
The console method has the advantage that you can increase the resolution, which is
advisable for printed documents.
or from the console. In either case, I recommend to use the PNG format for exporting,
since it is widely compatible and gives better quality than the JPEG format. GIF is similar
to PNG but more compact; it works in many applications but not with PDF-LATEX, whereas
PNG and JPEG do.
4.2.9
Looking at protein folds
Let’s explore the molecule some more. How is it folded? Let’s inspect the backbone of
the polypeptide chain.
restrict chain=a and backbone
A better display for this is:
spacefill off; wireframe 0.3
With
antialiasDisplay=true
the image will look nicer, but at the expense of a slower response to mouse movements.
When saving images, the antialias switch seems to be set implicitly, so as long as you
only care about the exported images it’s not needed.
4.2.10
Folds. . .
Let’s highlight the secondary structure elements:
select helix; color blue; select sheet; color red
Another way to display the secondary structure is with the cartoon mode:
wireframe off; cartoon on
Our second protein molecule reappeared, since we did not exclude it explicitly in our
above select commands. Get rid of it with
restrict chain=a
Save the current display state in fold.spt.
40
CHAPTER 4. PROTEIN STRUCTURE VISUALIZATION WITH JMOL AND PYMOL
4.2.11
More on selections
We can select atoms using various kinds of atom expressions. Each of these can be
used like so: select alpha, and they can be combined by boolean operators: select
buried and backbone, select basic or aromatic and so on.
• Role in structure: alpha, backbone, sidechain; hetero
• Solvent: solvent, water
• Properties of sidechains: surface, buried, acidic, aliphatic, basic, buried,
charged, hydrophobic, neutral, polar
• spatial relationships; example: within(10, [trp]179)
• Chemical element; example: element=“N”
4.2.12
Exercise: Try to reproduce this display
• One protein chain in white and cartoon mode, helices highlighted in pink
• the associated chlorophyll molecules in wireframe and in different colors, with the
central magnesium atoms in spacefill and in blue
4.2.13
Hints
To select individual chlorophyll molecules, you can do
select group="bcl" and resno=3
and so on. . . as you can glean from the pdb file, the ones associated with chain A are
numbered 3–9. After selecting each group, apply
color red
and so on. If you run out of colours, try cornflowerblue, cyan, fuchsia, lime, orchid,
peachpuff, pink, purple, salmon, turquoise, violet. . .
Select helices with
select structure="helix"
When done, save your state, and save a picture.
4.3. PYMOL
4.2.14
41
And another one
To generate the surface for chain A, use
isosurface select(chain=a) sasurface; color isosurface white
To cut away the front of the molecule, use
slab on; slab 50
The command sasurface means solvent-accessible surface.
Larger or smaller numbers for the slab command will cut away more or less from the
molecule.
4.2.15
And a last one
Use slab off to restore the molecule. Make the surface translucent: color isosurface
white translucent
Try to render this picture with povray. To use povray, you need to have it installed ;)
Jmol is supposed to be able to run povray itself, but I couldn’t get it to work. I ended
up saving the povray script to a file (which you can do from the Jmol menu) and then
invoking povray manually.
4.3
Pymol
Differences to Jmol:
42
CHAPTER 4. PROTEIN STRUCTURE VISUALIZATION WITH JMOL AND PYMOL
1. Written in Python and C++, not Java – installation on different platforms can be a
bit more cumbersome
2. Doesn’t run inside web pages
3. More advanced graphics capabilities
4. License: If you visit pymol.org, it looks as if it’s commercial – but the code is
actually open source and will stay this way
4.3.1
Documentation
A wiki – current, but not quite complete http://www.pymolwiki.org
A manual from the programmer himself – oldish, but still adequate for basic topics:
http://pymol.sourceforge.net/newman/userman.pdf
Assorted tutorials collected from the web: http://watcut.uwaterloo.ca/chem731/
2011/pymoldocs/
4.3.2
The GUI
external GUI
internal GUI
The GUI is split across
two windows. This is somewhat clunky, but it does permit a larger view of the molecule,
as you can maximize that window separately.
4.3.3
Opening files
1. From the menu (File→Open)
2. From the command line:
load file.pdb
3. Directly from the protein database (while online):
fetch 7ahl (no .pdb extension in this case!)
Commands can be entered either in the internal or external GUI.
You can specify multiple names to load multiple structures in one go (for example for
alignments, see later). Note that structures that you fetch from the web also get saved
4.3. PYMOL
43
locally into the current directory, so make sure you are in the right one before fetching
stuff.
4.3.4
Working with single structures
To create images, the basic workflow is similar to Jmol:
1. Arrange the molecule in space, using the mouse
• Left mouse button rotates
• Middle button or wheel moves in XY plane
• Right button zooms
2. Select parts of the molecule. Note, however, that selections are named in pymol
3. Apply formatting instructions to selections
4. Save image
4.3.5
Exercise: HIV protease with the inhibitor saquinavir bound to it
Load the molecule:
load 2nnp.pdb
You should now see the structure, and an entry representing it in the lower window:
Click S→as→spheres to display everything in spacefill mode
Click A→remove waters The A S H L C buttons are menus that allow you to work
with the entire structure. For each selection you create (see below), you get a new set of
buttons that apply the same operations to this selection only.
4.3.6
What are virus proteases, anyway?
44
CHAPTER 4. PROTEIN STRUCTURE VISUALIZATION WITH JMOL AND PYMOL
Some important viruses such as hepatitis C virus and human immunodeficiency virus
(HIV) translate their genomic nucleic acids into polyproteins. The domains of a polyprotein are then cleaved from one another to become the mature virus proteins.
As such, polyproteins are inactive. The only component that is active within the polyprotein is the protease that first cleaves itself and then the other components, which then
part ways to serve their respective roles in virus replication and assembly. Protease
inhibitors such as saquinavir prevent this necessary maturation and therefore block
virus replication. They are effective in the treatment of virus infections.
4.3.7
Saving a cleaned-up version of the molecule
Display heteratoms:
• Activate the sequence view: click on the S button close to the bottom right corner
of the bottom window
• In the sequence view, use the scroll bar to navigate to the right end
• You should now see the following: ROC ACY ACY SO4 GOL GOL
• Select ACY ACY SO4 GOL GOL
• Click S→as→spheres to verify you have the right selection (just a bunch of superficially associated small molecules)
• Click A→Remove atoms
• Save the molecule: File→Save molecule, or type save 2nnp_cleaned.pdb
Molecules like salts, glycerol and detergents are commonly used in crystallography to
facilitate crystallization. They often have no real meaning for the biological activity of
the protein as such. In our example, ACY is acetate, and GOL is glycerol. SO4 I’m sure
you can guess. ROC is the actual ligand (the drug saquinavir), so we want to keep it.
4.3.8
Visualizing structure elements
In the menu of the selection object “all”, click C→by chain→by chain
You should now see that the molecule contains two polypeptide chains, which between
them enclose a drug molecule (saquinavir).
4.3.9
Saving state
• From the command line: save 2nnp-state.pse
• From the menu: Save session
Saving state frequently is advisable, since many operations in Pymol can’t be undone.
The save command is the same as used above for the cleaned-up pdb file. Pymol infers
your intention from the file extension. The extension pse represents a pymol session.
Unlike the Pymol script files, the session files are not editable, but they do have the
advantage to save the complete program state, not just those parts of the state that was
created from the command line.
4.3. PYMOL
4.3.10
45
Selections
Before we can apply further prettification, we first must get hold of the components of
the structure. For saquinavir, we can use the sequence view again:
1. Click on ROC (now at the far right)
2. In the “sele” menu, click A→rename selection, then type “saquinavir”.
For the polypeptide chains, it is easier to use the command line:
• select chain_a, chain a
• select chain_b, chain b
Save state: save 2nnp-state.pse In Pymol, selections are persistent – you can have
multiple selections, each of which has a name. While this is not very important with
trivial selection criteria as the ones created in this example, it is really useful with more
complicated criteria. For example, we can select the backbone atoms of polypeptide
chains like this:
select backbone, name c+o+n+ca
We could then narrow down this selection to specific chains:
select backbone_ab, backbone and (chain a or chain b)
At this level of complexity, persistent selections begin to make sense. Also, even trivial
selections as the ones shown here have the advantage that they give you a ASHLC menu
bar.
Selections are retained in your session when you save it as a .pse file.
4.3.11
Prettyfication
• Apply S→as→surface to chain_a and chain_b
• Apply S→as→sticks to saquinavir
This view nicely illustrates how the drug molecule fills the active site of the HIV protease.
46
CHAPTER 4. PROTEIN STRUCTURE VISUALIZATION WITH JMOL AND PYMOL
4.3.12
Producing high-quality figures
Invoke the built-in raytracer:
ray
Enhance the resolution with e.g.:
ray 2000
Smooth the edges:
set antialias 3
To apply this antialias value, you have to invoke ray again.
In the command ray 2000, the number specifies the horizontal resolution. A value of
2000 should be o.k. for printed documents
Possible values for antialias are 0 to 4. For printed output, it should be 2 or 3. A
value of 4 burns lots of CPU cycles, but I haven’t seen a great improvement over 3.
Combinations of high values for antialias, image resolution and surface transparency
may test the limits of your hardware.
4.3.13
Driving Pymol with scripts
Pymol understands two languages:
1. Python – after all, it is partially programmed in it.
2. It’s own scripting language
Let’s try it: Run
@gyrase.pml
. . . and wait, and wait some more. . . You need Python if you want to extend Pymols
abilities. Python programmed extensions are often packaged as plugins.
The Pymol scripting language is the same that we also have used at the command line
within pymol (load, save, select . . . ) and is documented in the user and reference
guides. This language is more straightforward to use and suffices to control the built-in
capabilities of Pymol.
What is the advantage of scripting over interactive usage? It depends on the scale of
usage. As long as you just need one or two figures, interactive usage is fine. Scripting
comes into play if you for example need to produce many figures in a consistent style.
You could for example create a script file that selects the backbone of each polypeptide
chain and displays it in a consistent way, and then run this script over each structure.
4.3. PYMOL
4.3.14
The image produced by gyrase.pml
4.3.15
What is DNA topoisomerase anyway?
47
The gyrase.pml script displays human DNA toposisomerase I, in complex with DNA
and the topoisomerase inhibitor topotecan. What do DNA gyrases and topoisomerases
do?
This picture illustrates the degree to which DNA is curled up inside the cell: On the left,
the packed form of the bacterial chromosome is shown (the light area in the center of
the cell), whereas on the right side the DNA is spilled out of the cell.
Transcription and replication of DNA require it to be unpacked or unwound. In both
eukaryotic and prokaryotic cells, this is accomplished by DNA topoisomerases I and II.
48
4.3.16
CHAPTER 4. PROTEIN STRUCTURE VISUALIZATION WITH JMOL AND PYMOL
The reaction catalyzed by DNA topoisomerases
This figure illustrates the basic function of topoisomerases I and II: A DNA molecule
is cut, and the free ends are moved past the other DNA molecule and then rejoined.
In the case of DNA topoisomerase I, the two DNA molecules are single strands. DNA
topoisomerase II applies the same operation to double strands, that is it cleaves both
strands of one double helix and moves the free ends past another double helix. Both
activities are needed for transcription and translation.
Inhibitors of DNA topoisomerases are used in both antibacterial chemotherapy and
in tumour therapy. Irinotecan (shown in the figure above) is an inhibitor of human
toposomerase I that is used in the treatment of cancer.
4.3.17
Understanding script files
Read through the file gyrase.pml – see comments for explanations.
Listing 4.1: The gyrase.pml script for Pymol (pmpractice/gyrase.pml)
# dna gyrase with dna and topotecan
# clear out and load file
reinitialize
load 1k4t.pdb
# don’t display anything while the settings are adjusted
hide everything
# surface transparency
set transparency, 0.6
# thickness of sticks in stick display
set stick_radius, 0.2
# diameter of spheres
set sphere_scale, 0.6
# illumination for the ray tracer
4.3. PYMOL
set direct, 0.2
set fog, 0.5
# define sub-structure selections
select drug, resn TGP
select protein, chain a
select dna, (chain b or chain c or chain d) and not drug
# create a dummy selection to deselect the dna
select dummy, chain z
# define colors for the sub-structures
color gray80, protein
# color one dna strand black, the other gray
color black,dna
color gray70, chain d
color gray50, drug
# set the view coordinates. These were copied from an
# interactive Pymol session. Click "Get view" in the
# outer (top) GUI window to get the current coordinates.
set_view (\
-0.316715509,
-0.014092376,
-0.948415995,\
-0.048230056,
0.998836935,
0.001265566,\
0.947293043,
0.046144385,
-0.317026347,\
-0.000038713,
0.000031844, -265.550964355,\
21.163307190,
-1.675732613,
40.628482819,\
191.892059326, 339.233642578,
0.000000000 )
# white background - looks so much better in print
bg_color white
# display everything according to the settings above
show surface, protein
show spheres, drug
show sticks, dna
49
Chapter
5
Sequence analysis
5.1
Introduction
What is it good for?
• Genome sequences has provided boatloads of information
• Many sequences encode proteins that have not yet been biochemically characterized
• The function of such uncharacterized proteins can often be inferred by comparison to known sequences and sequence motifs
5.1.1
Sequence analysis resources: Starting points
Web-based:
• Gene and genome databases accessible through NCBI: http://www.ncbi.nlm.
nih.gov/
• Directories of analysis tools at EBI:
http://www.ebi.ac.uk/
and Expasy:
http://ca.expasy.org/
Local:
• The EMBOSS suite of programs
• Look around in your package manager (select sections science, then search for
“sequence”)
For one or a few sequences, the on-line resources are sufficient. If we want to analyze
and compare large numbers of sequences, it can be useful to download them and run
the analysis locally.
50
5.2. EXERCISES
51
For our exercises, we will use the EMBOSS suite. Some additional exercises will be part
of the sessions on Python programming.
5.2
5.2.1
Exercises
Proteins of unknown function in the Saccharomyces cerevisiae (baker’s yeast)
genome
File yuk.fasta contains the sequences of all proteins that were uncovered by genome
sequencing yet had not been characterized biochemically before.
less yuk.fasta
The sequences are listed in the so-called FASTA format – the first line starts with “>”
and contains name and description, followed by the protein sequence in single letter
code.
How many sequences?
grep -c ’>’ yuk.fasta For our exercises, I have compiled a file that contains all
sequences of proteins with unknown function from the genome of Saccharomyces cerevisiae. This file dates back to 2009 – some of the sequences may have been biochemically
characterized meanwhile.
5.2.2
Sequence composition and inferred properties
pepstats -outfile yuk.pepstats yuk.fasta
Have a look at the results:
less yuk.pepstats
Some predictions are more reliable than others . . . The molecular weight should be
accurate, except that it does not take into account post-translational modifications
(cleave, glycosylation, acylation). The absorbance at 280 nm should be accurate to
within a few percent.
The isoelectric point should be a reasonable approximation, whereas the “Improbability
of expression in inclusion bodies” looks a bit funny at 3 significant digits.
52
CHAPTER 5. SEQUENCE ANALYSIS
5.2.3
Secondary structure prediction
• α-helix: More sterically hindered, preferred by aa with smaller side chains
• β-sheet: More room, preferred by aa with side chain bulk close to the backbone
prefer β-sheet
The standard amino acids and some of their properties, including preference for α-helix
or β-sheet structure, are listed in table 5.1 on page 57.
5.2.4
Secondary structure prediction ctd.
Find a program to use:
wossname secondary
Read its documentation:
tfm garnier
Run it:
garnier -outfile yuk.garnier yuk.fasta
Examine the output:
less yuk.garnier The EMBOSS suite comes with a whimsically named utility, wossname,
which searches the documentation of all EMBOSS programs for a keyword, and lists all
programs that contain it. The tfm utility displays the full documentation for any EMBOSS program.
5.2.5
Searching for sequence motifs
The concept of sequence motifs applies to both nucleic acid and protein sequences. We
can distinguish
• Structural motifs (for example combinations of secondary structure elements)
• Functional motifs: Binding sites, target sites of enzyme action
The concept of sequence homology is the foundation of pretty much everything else
in sequence analysis. Simply put, functionally similar genes and proteins should have
similar sequences, the more so if the source organisms are phylogenetically related.
5.2. EXERCISES
53
Indeed, the extent of sequence homology between genes or genomes is currently the
gold standard for establishing phylogenetic relationships.
The concept of structural motifs overlaps with that of functional motifs, so let’s not
try too hard to draw an artificial line. However, structural motifs do not necessarily
imply a high degree of sequence similarity. Instead, they may simply consist of clusters
of amino acids with similar preference for a given secondary structure (α-helices or
β-sheets, respectively; table 5.1) , or they may combine a succession of helical and sheet
motifs into a higher order structure.
Functional sites often require some structural context, for example they must be exposed on the surface of the protein molecule in order to be accessible – so prediction
based on sequence will generate some false positives.
An exhaustive collection of functional and structural protein motifs is maintained in
the Prosite database, which also points to proteins that contain the sites in question.
5.2.6
Sequence motifs are expressed as consensus motifs
An example: The consensus motif for active sites of serine proteases
less serineprotease
The characters [LIVM]-[ST]-A-[STAG]-H-C mean: A leucine, isoleucine, valine or
methinione, followed by serine or threonine, followed by alanine, . . .
This consensus motif is described in the syntax that is used in the Prosite database, and
is also understood by the fuzzpro program (see below).
5.2.7
How do we find motifs?
List suitable programs:
wossname motif
In this list: fuzzpro – the pattern syntax in file serineprotease and a few others—
downloaded from prosite—is the one expected by fuzzpro.
In theory, we could run
fuzzpro
But that would require us to type the longish motifs. We don’t want that. Enter shell
scripting:
less fprun The fprun script simply takes the name of a file that contains a search
pattern, reads the file content and constructs the full fuzzpro command for us.
Inside the script, reading the file contents is done with cat, and the output of the cat
command is captured with the backticks. We can apply the same steps directly:
cat serineprotease
54
CHAPTER 5. SEQUENCE ANALYSIS
Insert the file content into a command using backticks:
fuzzpro -sequence yuk.fasta -pattern -outfile jnk `cat cholesterol`
5.2.8
Searching sequence motifs
Run fuzzpro via fprun:
./fprun efhand
Inspect results:
less efhand_results
Try the same with the motif file shortchain.
Search for putative cholesterol binding motifs:
echo "[LV]-X(1,5)-Y-X(1,5)-[KR]" > cholesterol
fprun cholesterol
5.2.9
The CAAX box motif
• Causes C-terminal farnesylation (attachment of hydrophobic moiety)
• Farnesylated proteins stick to membranes
• Cysteine, two aliphatics, one arbitrary, then end (C-terminus)
How to search for it? The difficulty here is that the CAAX box is supposed to be located
at the C-terminus. I have not found a way to instruct fuzzpro to limit the search by
location.
Another program for motif search is preg. This program uses a more powerful syntax
that also allows us to specify location. We can use it like so to search for the CAAX box:
In this command, [ILV] represents any of I, L or V (aliphatics), the 2
between braces denotes 2 of the foregoing, and A-Z denotes any letter (X).
The dollar sign represents the end of the sequence. Thus, we will only capture
the CAAX pattern if it runs right up to the C terminus.
5.2.10
Comparing sequences
wossname compare
Hm. Widen the search a bit . . .
wossname compar
There. seqmatchall is what we want.
seqmatchall -outfile matched -wordsize 20 yuk.fasta
5.2. EXERCISES
55
That will take a little while. When done, sift through the file with less. Take note of
the names of some pair of matched sequences. Here, we compare all sequences in the
file against one another, and obtain all pairs in which there is one or more identical
stretch of 20 or more amino acids. That involves quite a bit of data churning, and so
seqmatchall takes a little while.
5.2.11
Aligning sequences
Extract two matched sequences: seqret
When prompted for the sequence to read, type something like:
yuk.fasta:NP_116593.1
and then, for the output file:
1.seq
Repeat this for the second file, giving 2.seq as the file name. Then do
cat 1.seq 2.seq > both.seq
clustalw For a more detailed examination of similarity between any two sequences,
we can do a sequence alignment with clustalw. This procedure searches not just for
identity but also for similarity, and it tries to arrange the two sequences in such a way
that as many residues as possible are matched up with an identical or similar residue
in the other molecule.
56
CHAPTER 5. SEQUENCE ANALYSIS
Table 5.1: Amino acids and some of their properties
Amino Acid
3-Letter
1-Letter
Polarity
Charge
SS
Alanine
Arginine
Asparagine
Aspartic acid
Cysteine
Glutamic acid
Glutamine
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
Ala
Arg
Asn
Asp
Cys
Glu
Gln
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
A
R
N
D
C
E
Q
G
H
I
L
K
M
F
P
S
T
W
Y
V
nonpolar
polar
polar
polar
nonpolar
polar
polar
nonpolar
polar
nonpolar
nonpolar
polar
nonpolar
nonpolar
nonpolar
polar
polar
nonpolar
polar
nonpolar
0
+
0
–
0
–
0
0
0
0
0
+
0
0
0
0
0
0
0
0
α
α
neither
neither
β
α
α
neither
α
β
α
α
α(β)
α(β)
break
neither
β
β
β
gb
other
disulfides
aliphatic
aliphatic
aromatic
aromatic
aromatic
aliphatic
Chapter
6
Molecular docking
6.1
Introduction
• Purpose: Find binding sites for small molecules on protein receptors
• Method: Position and conformation of the ligand is randomly varied, and the
binding energy is estimated for each variation
• Widely used for in silico screening on the interactions of existing or hypothetical
small molecules with drug targets
• Requires crystal or NMR structures of the receptors
• Various commercial and free software implementations. We will use Vina
The receptor can be treated as rigid or conformationally flexible; the latter increases
computational cost. In high-throughput applications, that is when screening a large
number of compounds, the receptor is therefore typically treated as rigid. We will adopt
the same approach in our exercises.
6.1.1
•
•
•
•
Overview of the procedure
Use Autodock tools to prepare input files for receptor and ligand
define search box (limit the space in which the ligand can hunt for a binding site)
Run Autodock Vina to perform the docking
Examine output in Pymol
We need to install Vina (http://vina.scripps.edu) and MGLtools (http://mgltools.
scripps.edu).
The input files for the docking program are derived from .pdb files. The latter only
contain molecular coordinates but no information about charges, and they typically also
lack the hydrogen coordinates. Charges and hydrogen bonds are important in binding,
so we need to supply this information. We produce it with Autodock tools.
57
58
CHAPTER 6. MOLECULAR DOCKING
6.2
Exercise: Docking imatinib to abl protein tyrosine kinase
This exercise is a recapitulation of a video tutorial that is available on the Vina website.
A little background:
• abl kinase is a mutant receptor tyrosine kinase that causes chronic myeloic
leukemia (CML)
• Imatinib is a tyrosine kinase inhibitor that is used against leukemia and some solid
tumors
• Good example of structure-based drug design
6.2.1
1.
2.
3.
4.
5.
Preparing the receptor input file
In a console, cd to the tutorial folder
Start autodock tools: adt
Menu File→Read Molecule→receptor.pdb
Add hydrogens: Edit→hydrogens→add→polar only→OK.
Apply: Grid→Macromolecule→Choose→OK.
Save the resulting file as receptor.pdbqt.
It is assumed that both autodock tools and Vina have been installed and are on your
shell’s PATH.
The .pdbqt file that is produced in this step assigns the charges to the basic and acidic
amino acid side chains. I also adds the hydrogen coordinates.
When adding the hydrogens, we choose “polar only”. The apolar ones will then be
treated by Vina by way of pseudo-atoms. For example, a methyl group is treated as if
it where a single atom, with a volume that includes both the central carbon and the
three hydrogens attached to it. This simplifies the calculations considerably, without
sacrificing too much accuracy. In contrast, polar hydrogens (on –OH groups for example)
must be treated explicitly and individually, since they can engage in hydrogen bonding.
6.2.2
Preparing the ligand input file
1. From the toolbar, execute: Ligand→input→open→drug.pdb→OK. This will load
the pdb file and automatically assign a polarity to each atom.
2. Hide protein in the dashboard panel (the white area), zoom and center onto
ligand. Zoom and rotation work with the mouse wheel; movement in the plane
works with the right mouse key.
3. Bond rotations: From the toolbar, execute
Ligand→Torsion tree→choose torsions. Bonds considered rotatable are highlighted in green, non-rotatable ones in magenta.
There is one non-rotatable bond that is next to a phenyl ring; click on it to make it
rotatable. Done.
4. Save: From the toolbar, execute Ligand→output→pdbqt, save file as drug.pdbqt.
When loading the ligand, hydrogen atoms get assigned automatically, so we don’t have
to do it manually in this case.
6.2. EXERCISE: DOCKING IMATINIB TO ABL PROTEIN TYROSINE KINASE
6.2.3
59
Defining the search area
1. Toolbar – Grid→Grid box. This brings up a dialog with cheesy “thumbwheel”
controls, and a cube in three colors. This cube represents the search area.
2. First, adjust the units: Turn up the “spacing” thumbwheel to 1.0 by stroking it
from left to right with the mouse.
3. Adjust the center coordinates. Don’t use the thumbwheel controls for these, but
the text fields; update the display with <enter> (Groan. So user friendly.)
4. Adjust the dimensions. Note that the colors of the controls and of the surfaces of
the box correspond to one another.
Restricting the search area in which Vina is supposed to look for docking sites avoids a
lot of unnecessary computation.
6.2.4
Create the Vina configuration file
center_x = 15
center_y = 50
center_z = 20
size_x = 30
size_y = 30
size_z = 30
receptor = receptor.pdbqt
ligand = drug.pdbqt
log = log.txt
exhaustiveness = 20
cpu = 1
Vexingly, the program does not let us save our hard work directly. So, we read the
coordinates of the box from the screen and type them into a text file. While we are at it,
we also add the names of the receptor and ligand files that we want to dock.
The log file will contain messages that vina produced during its run. If all goes well, we
can ignore it. The exhaustiveness parameter, here set to 20, can be given higher values
for more thorough optimization. The cpu parameter specifies the number of CPUs that
you want to let Vina use. If you plan on doing other stuff while Vina is running, keep at
least one to yourself (for example, let Vina use 3 out of 4 available CPUs).
Save this file as vina.cfg and you are ready to run the docking.
6.2.5
Run Vina
vina --config vina.cfg
60
CHAPTER 6. MOLECULAR DOCKING
This will take a while; a progress bar indicates how long the progam will take to execute.
After it terminates, you should have a file drug_out.pdbqt; this should contain a series
of docking configurations, with the energetically most favourable one at the top.
6.2.6
Inspect the results in Pymol
• Load the file drug_out.pdbqt
• Load the receptor file and the X-ray coordinates of the ligand
• Highlight and contrast the experimental drug molecule with the one obtained by
docking
• Move through the various docked conformations with the arrow buttons at the
bottom right of the lower Pymol window
The best conformation should be pretty close to the one obtained by X-ray crystallography.
Chapter
7
Python programming
7.1
Introduction
Why? Let’s ask the German poet Friedrich Schiller:
Die Axt im Haus ersetzt den Zimmermann.
Translation:
An axe at home is worth a carpenter.
7.1.1
Python vs. Gnuplot or LATEX
• All are programmable systems
• Gnuplot and LATEX are adapted to special purposes, Python is a general purpose
language – manipulates arbitrary data in arbitrary ways
• Python can be used for text processing, serving web content, number crunching,
image processing – you name it
7.1.2
Is programming easy?
Yes. . .
• Simple tasks can be accomplished with simple programs
• Many code libraries are available that we can use as building blocks for our own
programs with little effort
No. . .
• Programs cannot be simpler than the problems they aim to solve
61
62
CHAPTER 7. PYTHON PROGRAMMING
• Large programs need careful and sound design in order to remain manageable
In this first session, we will go over a few elementary concepts. Nevertheless, you will
see that even with very basic elements we will manage to write a program that translates
a DNA sequence to a protein sequence.
7.1.3
Why Python?
• Relatively easy to learn – increasingly used as an introductory language in university classes
• Emphasis on readability and on sane, consistent syntax
• Good allround capabilities
• Many libraries available for scientific computing
• Well-designed – suitable both for small, one-off scripts and large, complex programs
• Well-behaved when program errors occur – essential, since errors happen often,
particularly during development
Python is nice but it does have some idiosyncrasies. As an alternative, you might
consider Ruby, which seems a bit simpler in several ways. However, it does not have the
same range of libraries for scientific computing. On balance, Python is a better choice.
7.1.4
How Python programs are created and executed
1. Python code is written and saved in a simple text file named something like
myprogram.py
2. We ask Python to execute it: python myprogram.py
3. Python reads the text file and translates or compiles it into an intermediate (byte
code) format
4. Python executes the byte code
For creating Python code files, we can use any text editor, but it will be helpful to use one
that assists us with Python-specific syntax highlighting. The editors Gedit (part of the
Gnome desktop) or Kate (part of the KDE desktop) are good enough and straightforward
to use. While we are at it: You should configure your editor to use tab-stops 8 spaces
wide, and to insert spaces instead of tabs for indentation. Indentation is important in
Python, and these settings are most widely used and recommended.
The execution model described above makes Python an interpreted language, as opposed
to e.g. C or Fortran, which are compiled languages. One practical consequence is that we
always need to have Python installed not only to develop but also to run Python code.
Another consequence is that Python code often runs slower than C or Fortran.
In certain situations, Python will store the byte-code format in a separate file, such as
myprogram.pyc; it will update such files as needed, so we can just leave them alone.
7.2. FIRST STEPS
7.2
63
First steps
Bring up your text editor and type:
print "Hello, world!"
Save the file as hello.py and run it in a console:
python hello.py
Instead of calling python ourselves, we can also tell bash to do it for us, by inserting
the so-called “hash-bang” line at the very top of hello.py:
#!/usr/bin/python
After making it executable (chmod +x hello.py), we can then invoke the file like this:
./hello.py If we insert the hash-bang line and store the program in some folder
on the PATH, we can invoke it from anywhere, just like gnuplot or any other script or
program.
7.2.1
Python’s interactive mode
If you just type
python
you will see something like this:
Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license"
for more information.
>>>
Type print 2 + 3 and enter. The code is executed immediately, and the result is
displayed.
Type <ctrl-d> to exit. The interactive mode can be useful for trying things out, but
for anything that is longer than one or two lines you want to save the code as a file, so
that you can change it and observe the effect of your changes.
7.2.2
Naming pieces of data: Variables
a = 2
b = 3
c = a + b
print c
64
CHAPTER 7. PYTHON PROGRAMMING
d = c**a
print d
firstname = ’john’
lastname = ’doe’
print firstname + ’ ’ + lastname
Variables are essential in any kind of programming language. They can be made up
from thin air and used right away; this works much in the same way as we have used
them in Gnuplot. Variable names must start with a letter or underscore and can be
continued with letters, underscores and digits. For example, joe, _joe_123 are valid
names, while 123joe is not. Also note that variable names, or in fact all names in
Python, are case-sensitive – to Python, joe is distinct from Joe and from JOE.
You can use almost any word to name a variable. It is advisable to use meaningful
variable names – a variable name should reflect the meaning of the data it contains.
Choosing good names can make a big difference to the readability of your code. This
also goes for the names of functions or objects (see later).
7.3
Keywords and builtins
7.3.1
Some names are special
If we try:
class = ’destroyer’
we get
File "<stdin>", line 1
class = ’destroyer’
^
SyntaxError: invalid syntax
This is because class is a Python keyword – a word that is reserved by Python for its
own use.
7.3.2
and
class
Python keywords
as
continue
assert
def
break
del
65
7.4. DATA TYPES
elif
finally
if
lambda
print
while
else
for
import
not
raise
with
except
from
in
or
return
yield
exec
global
is
pass
try
. . . not meant for immediate memorization. Your editor should highlight them in bold
face or in some special color. These words have a fixed meaning and cannot be used in
any other way. If you try, Python will flag a syntax error, just as in the example above.
7.3.3
Python built-ins whose names are not protected
dir(__builtins__)
print reload
reload = 55
print reload
Save this file as junk.py and run
pychecker junk.py If you inadvertently rebind the name of a built-in, your own code
may work fine, but only as long as the code in the libraries you may be using does not
depend on the original meaning of that name.
I fail to see the benefit of this – flexibility is good, but allowing the user to clobber
built-in names is overdoing it. So, watch out for it. Happily, Pychecker helps you with
that – so it is a good idea to run it over your code files, particularly while you are still
learning Python.
Good editors should also recognize built-in names and colorize them as part of the
syntax highlighting. If a name you chose unexpectedly changes color, modify it until it
doesn’t.
7.4
Data types
Try out the following in an interactive python prompt:
a = 5
b = 3
c = ’john’
d = ’doe’
print a + b
# 8
66
CHAPTER 7. PYTHON PROGRAMMING
print c + d
# ’johndoe’
print a + c
# throws TypeError
print "type(a):", type(a)
print "type(c):", type(c)
# <type ’int’>
# <type ’str’>
Variable names can be made up from thin air and used right away; this works in the
same way as we have seen in Gnuplot.
Variable values have types that determine what operations can be applied to them. In
the above example, the + operator effects addition of numbers, and concatenation of
strings (words). If we try to apply + to a string and a number, Python determines that
the two operands are of different type and that the + operation between them is not
defined. It therefore “throws” a TypeError exception.
7.5
Working with more data: Containers
Names of individual pieces of data are useful, but we also need to work with larger and
variable amounts of data. For this, we use containers. The most important containcers
are lists and dictionaries. A list can be created as follows:
7.5.1
Lists
a = [1, 2, 3, 4, 5, 6, 7]
print a
a is a list.
a.append(’joe’)
print a
a.pop()
print a
del a[5]
print a
a is mutable – we can change its contents in place. Lists can contain arbitrary objects –
numbers, first names, other lists, whatever – alone or in combination.
7.5. WORKING WITH MORE DATA: CONTAINERS
7.5.2
67
How variables work with mutable objects such as lists
a = range(5)
print a
# a is a list
b = a
print b
# b and a now point to the same list
c = list(a)
print c
# c is a copy of a
a.pop()
print a
print b
print c
Assignments like a = b create a second reference to the same object – if we change the
object through one reference, we will see the change through the other one also.
As in Gnuplot, we can insert comments into Python code. Anything preceded by a #
character is ignored. You should get into the habit to insert comments into your code
that explain what the code is supposed to be doing, or why you chose this particular
way of doing things over another.
7.5.3
Testing for identity and equality
a = range(5)
b = a
c = list(a)
# create a list
# create another reference to it
# create a copy of the list
print ’a equals b?’, a == b
print ’a equals c?’, a == c
print ’a same data as b?’, a is b
print ’a same data as c?’, a is c
print id(a)
print id(b)
print id(c)
As we have just seen, with mutable objects such as lists, the distinction of equality and
identity becomes important. Identity can be tested for with the is keyword, whereas
equality is tested with the == operator. Identity implies equality, but not vice versa.
Notice the difference between assignment (a = 5) and comparison (a == 5). Assignment sets a variable to a new value, whereas comparison tests the current value for
equality to another.
68
CHAPTER 7. PYTHON PROGRAMMING
The difference between equality (==) and identity (is) is as follows: Two lists (or other
objects) are equal if they contain the same value or values. They are identical only if
they are one and the same, stored at the same location in memory. If we create a new
list by copying an existing one, for example with the list function, the original and the
copy will be equal, but not identical, since the copy is stored in a new place in memory.
By default, Python does not create a copy – an assignment like b = a will always simply
give us a new reference to the same data. If we need a copy, we must create it explicitly.
7.5.4
List slicing
a = range(10)
print a
b = a[0:5]
print b
c = a[5:-1]
print c
d = a[:]
print d
e = a[::2]
print e
List slicing is an efficient way to extract parts of a list. Slicing also works with strings
(see later).
7.5.5
Tuples
a = (1,2,3,4)
print type(a)
# make a tuple from scratch
b = (1)
# does NOT make a tuple
print b, type(b)
c = (1,)
# make a tuple with one element
print c, type(c)
print a[1:3]
a.append(5)
# tuples can be sliced
# fails - tuples are immutable
b = list(a)
c = tuple(b)
# make a tuple from a list
# make a list from a tuple
7.5. WORKING WITH MORE DATA: CONTAINERS
69
Tuples are similar to lists but immutable – you can’t add elements to tuples or remove
them. There are a few cases in which tuples must be used instead of lists; we will get to
that.
7.5.6
Sets
a = ’john doe was born in 1806’.split()
print a
# a is now a list of strings
sa = set(a)
# create a set from a
print sa
b = list(a)
b.reverse()
print b
# copy a
# reverse order of elements
sb = set(b)
print sb
c = a * 2
# merge two copies of a
sc = set(c)
print sc
print sa == sb == sc
The first line illustrates the split method of string objects. Strings are just pieces of
literal text. In Python, they are objects, that is they contain both data (the text itself) and
code that we can use to operate on those same data. A unit of code that is attached to
an object is called a method. We will get back to this topic later.
The remainder of this example illustrates the key properties of sets:
1. Order does not matter, and
2. each element occurs only once.
Sets are useful if you want to ensure that each single piece of data in a collection is
unique.
7.5.7
Dictionaries
phonenumbers = {’john’: 911, ’jane’: 119, ’bill’: 191}
print phonenumbers[’john’]
# look up john’s phone number
phonenumbers[’jim’] = 919
print phonenumbers
# assign new value to new key
70
CHAPTER 7. PYTHON PROGRAMMING
phonenumbers[’john’] = 119
print phonenumbers
# assigning to an existing key
# overwrites the previous value
print phonenumbers.keys()
# get a list of all keys
print phonenumbers.values() # get a list of all values
print phonenumbers.items() # list of all key-value pairs
Like lists, dictionaries are a real workhorse and are used all over the place in Python
programs. Dictionaries, or dicts for short, let us connect arbitrary pieces of data to one
another. The example above shows that each key can only occur once – if we assign a
new value to an existing key, the old value is forgotten.
In contrast, values need not be unique. In our example, John moved in with Jane, and
the same phone number was listed with both their names.
7.5.8
Tuples vs. lists as dictionary keys
\phones = {
(’joe’,’home’) : 119,
(’joe’,’work’) : 191,
(’jane’,’home’) : 119,
(’jane’,’work’) : 911
}
\phones = { # this won’t work
[’joe’,’home’] : 119,
}
If we need to combine several pieces of data into one dictionary key, we must use tuples.
This is one place where lists simply won’t work.
7.6
Repeated execution: Loops
a = range(10, 30, 5)
# [10, 15, 20, 25]
for x in a:
print x
# x adopts the value of each
# item in the list in turn
y = 10
while y > 0:
print y
y = y- 1
# condition is tested before
# each run of the loop
# decrement y
7.6. REPEATED EXECUTION: LOOPS
71
In all previous examples, each piece of code was executed only once. Sometimes, however, we want a line, or a block of code to execute repeatedly, for example to apply some
operation to each item in a list or dictionary. This is where loops come in.
There are two loop constructs: The for loop and the while loop. The for loop is well
suited for iterating over a container. The while loop works differently – it is controlled
by a condition that is evaluated before each run of the loop. If we do not want the loop
to continue until doomsday, the code inside the loop must change a variable so that the
controlling condition at some point becomes false.
The code that is supposed to be inside the loop, meaning that it should be executed in
toto for each loop iteration, is indicated by indentation. Be sure to indent each line to
the exact same extent – by 4 empty spaces exactly.
7.6.1
Iterating over a dictionary
from gencode import genetic_code
print genetic_code
for codon in genetic_code:
amino_acid = genetic_code[codon]
print codon, amino_acid
inverted = {}
for codon, aa in genetic_code.items():
inverted[aa] = codon
print inverted
print len(genetic_code), len(inverted)
A new concept: Importing code from other files, or modules. In this case, we simply
import a dictionary that was defined in the file gencode.py. Note that in the import
statement we omit the .py extension.
There are several ways to iterate over a dictionary; the second one shown here is particularly useful. The len (for length) function counts the items in the two dictionaries.
Can you figure out why the inverted dictionary is shorter than genetic_code?
7.6.2
Iterating over strings
from string import ascii_lowercase
print ascii_lowercase
# ’abcd...’
for character in ascii_lowercase:
print character
length = len(ascii_lowercase)
72
CHAPTER 7. PYTHON PROGRAMMING
for i in range(length):
print i, ascii_lowercase[i]
for i, c in enumerate(ascii_lowercase):
print i, c
Iterating over a string—that is, using a string as if it were a container inside a for loop—
gives us one character in each run of the loop. The second example shows that we can
also index into a string by numbers. Finally, the third example shows how to obtain the
running number as well as the character without first defining a list of numbers: The
enumerate function does it for us. The enumerate function also works with lists or
tuples.
7.6.3
More fun with strings
s = ’’’atgtatacta aaaattttag taattccaga
atggaagtaa aaggtaataa cggctgttct’’’
fragments = s.split()
print fragments
joined = "".join(fragments)
print joined
print joined[0:3], joined[3:6], joined[6:9]
codons = []
for i in range(0, len(codons), 3):
codon = joined[i:i+3]
print i, codon
codons.append(codon)
print codons
Note the triple-quote syntax that allows us to define strings that span more than one
line. It is also useful if the string itself contains single quote characters. Instead of
triple single quotes, we can also use triple double quotes.
You should run this code and see what it does, and make sure you understand how it
comes about. Also try ’-’.join(fragments) to understand what the .join method
does.
7.6.4
Exercise: Translating DNA to protein
1. Use the genetic_code dictionary to build a program that translates the dna sequence into the corresponding protein sequence.
7.7. LIST COMPREHENSIONS
73
2. Build a program that reverse-translates the protein sequence back to a DNA sequence encoding it.
We now have all the tools to translate a DNA sequence into protein, and back again. Can
you figure it out?
7.7
List comprehensions
. . . can be used to quickly create a list by iterating over another container. Example:
Collect all keys from the genetic_code dictionary and convert them to lowercase.
from gencode import genetic_code
# use a loop
lower_keys = []
for key in genetic_code:
keys.append(key.lower())
# now use a list comprehension and
keys2 = [x.lower() for x in genetic_code]
assert keys == keys2
# confirm that both are equal
List comprehensions offer a more concise syntax for simple loops. Where more complex
operations or conditions are involved, it is more readable to write the loops explicitly.
Note the somewhat backward looking syntax of the list comprehension. In this case,
we use the expression x.lower() before we have even assigned a value to x. In each
other context, this would cause an error: NameError: name ’x’ is not defined.
Here, this works because internally Python translates a list comprehension to a loop
construct like the one above.
The assert statement ensures that some condition is fulfilled, which can be helpful
particularly in development. Try: assert 1==2, ’numbers unequal’
7.7.1
Exercise: Use a list comprehension to translate a DNA sequence
from gencode import genetic_code
from random import shuffle
DNA = genetic_code.keys()
shuffle(DNA)
print DNA
translated = []
74
CHAPTER 7. PYTHON PROGRAMMING
for codon in DNA:
translated.append(genetic_code[codon])
translated2 = [...] ?
assert translated == translated2, ’you goofed’
Solution: translated = [genetic_code[codon] for codon in DNA]. Again, we use
the codon variable before assigning it a value.
7.8
7.8.1
Nested containers and loops
Nested containers
from atoms import atom_names
import pprint # pretty printing of data structures
pprint.pprint(atom_names, width=100)
pprint.pprint(atom_names.items(), width=100)
The pprint module exists in the Python standard library. It is not part of the language
itself, so the code has to be imported. However, it is available on each Python installation.
The Python standard library is very rich and provides solutions for many problems. For
serious use of Python, it is important to become acquainted with it. Documentation is
available on-line at http://docs.python.org/library/.
The atom_names dictionary specifies the names of atoms that occur in the amino acid
residues in pdb files. In this dictionary, each value is a list. If we extract the items from
this dictionary, the lists wind up inside the tuples, which in turn are the members of a
list.
We can nest containers to arbitrary depth. By the way Python lists are one-dimensional;
we can mimic multi-dimensional arrays using nested lists. For numerical heavy lifting
though, it is preferable to use the numpy package, which introduces proper array and
matrix data types. Numpy is not part of the standard library but is available through
your package manager – as are many other useful and powerful libraries.
7.8.2
Nested loops
Nested containers often go with nested loops. Example: For each amino acid, extract
the element (the first letter) from each atom name for each amino acid, and construct a
new dictionary.
from atoms import atom_names
7.8. NESTED CONTAINERS AND LOOPS
75
from pprint import pprint
elements = {}
for aa, names in atom_names.items():
elm = []
for name in names:
elm.append(name[0])
elements[aa] = elm
pprint(elements)
Note the nested indentation that defines what lines are repeated in each outer loop or
inner loop, respectively.
7.8.3
Rewrite this using list comprehensions?
elements2 = {}
for aa, names in atom_names.items():
elements2[aa] = [name[0] for name in names]
elements3 = dict([(aa,[name[0] for name in names]) \
for aa, names in atom_names.items()])
assert elements == elements2 == elements3
While the elements3 version is the shortest, the elements2 version is simpler and
clearer. Terse one-liners can be nice as brain teasers but not really recommended for
real programs – readability beats terseness.
76
7.9
CHAPTER 7. PYTHON PROGRAMMING
Conditional execution
firstname = "Joe"
if firstname in ("Joe","James","Jim"):
greeting = "Sir:"
elif firstname in ("Joan", "Jane", "Jill"):
greeting = "Madam:"
else:
greeting = "Sir/Madam:"
print "Dear", greeting
Conditional execution is an important building block of almost any program. The
if clause takes another statement that it subjects to boolean evaluation, that is it
determines whether this statement is true or false. If it is true, the code controlled by if
clause executes, otherwise it doesn’t. This conditional code is indicated by indentation;
in the above example, it is just one line (greeting = "Sir:".
The elif clause takes effect only if the preceding if clause did not apply. Again, it
requires a statement or expression that is tested for truth or falsehood.
Finally, the else clause will only take effect if all preceding clauses failed. Unlike the if
and the elif clauses, the else clause does not take any test statement and will execute
in any case.
Only the if clause is required. There can be any number of elif statements, which will
be evaluated in order. The else statement is optional, too; it can occur only once.
7.9.1
Conditional execution inside a loop
for i in range(1,11):
if i % 3 == 0: # % - remainder of division
print i, ’divisible by 3’
elif i % 2 == 0:
print i, ’divisible by 2’
else:
print i, ’divisible by neither 2 nor 3’
Conditional execution very often occurs inside loops, such that the if clause is controlled by one or more variables that change in each loop iteration.
7.10. BOOLEAN EVALUATION OF EXPRESSIONS
7.10
77
Boolean evaluation of expressions
for thing in [0, 1, -1, 0.0001, [], [1,2], [0], \
{}, {0:0}, ’joe’, ’’, None, False, True]:
print str(thing).ljust(10), bool(thing)
print False, True, False and True # the and operator
print False, True, False or True # the or operator
print True, not True
print False, not False
# the not operator
print (False and False) or True
print False and (False or True)
print False and False or True
# ’and’ or ’or’ - which
# one has precedence?
Containers or strings that are not empty, and numbers that are not zero evaluate to True
in a boolean context. Boolean evaluation is implicitly performed by if...elif...else
statements and in while loops. We can also explicitly invoke it with the built-in bool
function; for example, bool(-3) returns True.
The False and True values used in the and...or examples are dummies. In real life,
and and or would be used like so:
The False and True values used in the and...or examples are dummies. In
real life, and and or would be used like so:
if (firstname == ’Joe’ or firstname == ’Jim’) and age > 15: greeting =
"Sir:"
The two == statements and the > are then evaluated to True or False, respectively,
and the results of that evaluation are combined with and and or.
7.10.1
Alternative formulation of conditionals
for i in range(1,11):
if not i % 3:
print i, ’divisible by 3’
elif not i % 2:
print i, ’divisible by 2’
else:
print i, ’divisible by neither 2 nor 3’
7.10.2
Exercise: What about 6?
One problem with the loop example in section 7.9.1 is that the number 6 gets evaluated
for divisibility by 3 only, but not for 2, since the Can we rewrite the previous example
so that each number gets evaluated for divisibility by both 2 and 3?
78
7.11
7.11.1
CHAPTER 7. PYTHON PROGRAMMING
Functions
Defining functions
def divisible_by(dividend, divisors):
’’’
Tests a dividend for divisibility
by a list of divisors. Returns
a list with all valid divisors.
’’’
divs = []
for d in divisors:
if not dividend % d:
divs.append(d)
return divs
for i in range(1,11):
result = divisible_by(i, [2,3])
print i, ’divisible by ’, result
We have already used a couple of built-in functions, for example range or dir. Here we
see an example how to define functions ourselves. The function is declared with the
def statement, which determines the name of the function, as well as the arguments
that the function expects. Arguments in this context are data that are handed over to
the function when the function is called. In this example, the function calls occur inside
the loop, successively on each of the numbers from 1 to 10.
The body of the function, that is the code contained within, is again defined by indentation. In this example, the first part of the function body is a string – a short text that
explains the purpose of the function. This is not necessary but is good practice. This
so-called doc-string will be displayed by python if we type (in an interactive session)
help(divisible_by).
The last statement in the function is return divs. This means that the divs list that
was computed inside the function should be handed back to the piece of code that
called the function. In our example, the divs list computed inside the function will be
assigned to the variable result that was used in the for loop, outside the function.
Note that the return statement is not mandatory. Some functions don’t produce any
data and accordingly don’t return any data either. An example is the pprint function
that we used above (section 7.8.1). If we say, for example, b = pprint(a), then b will
be equal to None.
Another point to note is that variables that were declared inside the function will not be
visible from the outside, and will cease to exist once the function has finished executing
and returned control to the code outside. In the example, the variable divs is declared
inside the function and exists only within it.
79
7.12. IMPORTING CODE
7.11.2
Functions with default arguments
def greet(name, greeting = ’Hello’):
print greeting, name
greet(’Joe’, ’Good morning’) # prints ’Good morning Joe’
greet(’Jane’)
# prints ’Hello Jane’
def greet(name=’Joe’, greeting=’Hello’):
print greeting, name
greet()
greet(’Hi’)
greet(greeting=’Hi’)
# prints ’Hello Joe’
# print ’Hello Hi’
# print ’Hi Joe’
This example shows how to define functions with default arguments. Note in particular
the last example: By default, any arguments that we supply are used in order of declaration, meaning in this case that a single argument is used as a value for name, not for
greeting. If we want to supply a value for greeting but not for name, we can declare
this explicitly.
7.11.3
Exercise: Generating random passwords
Write the appropriate code for the following function:
def pwd(length=8):
’’’
produce a random password consisting of
a random sequence of any number of lowercase
letters, but with the length defaulting to 8.
’’’
When done, try to extend the function so as to also use uppercase letters and digits.
We skipped this exercise in class – but it might be fun for you to try on your own. All
the necessary tools have already been provided.
7.12
Importing code
import sys
print sys.path
# import the sys module
# access a name defined in sys
from re import compile
# import a name from a module
80
CHAPTER 7. PYTHON PROGRAMMING
r = compile(’^joe$’)
# use the imported name directly
from scipy.stats import poisson # scipy is a package,
# containing stats as one
# of its modules
# we can also rename modules or other objects
# during import, for example to use shorter names
import some_module_with_very_long_name as sm
from some_module import very_long_name as vln
This example illustrates
7.12.1
Importing self-written code
from divisible import divisible_by
help(divisible) # press ’q’ to quit the help view
for i in range(1,11):
result = divisible_by(i, [2,3])
print i, ’divisible by ’, result
When we have written a function, we often want to make it available and reusable in
other code files. In this example, we save the code from section 7.11.1 into the file
divisible.py. Then, we can import it into another code file, or an interactive session
like shown here.
Where does Python look for code files (modules)?
import sys
print sys.path
# on my machine, shows:
# [’’, ’/home/mpalmer/’,
#
’/data/python_mine’,
#
’/data/python_foreign’, ...
Without further action, this will work only as long as the two code files are in the same
directory. To make our own Python code files (modules) available across directory
boundaries, we save them in a dedicated directory that we then add to sys.path, a
list that contains the names of all directories that Python will search for modules to be
imported.
The recommended way of adding directories to sys.path is to set the PYTHONPATH
environmental variable. For example, in the .profile file in my home directory, I have
the following line:
export PYTHONPATH=:/data/python_mine:/data/python_foreign
7.13. EXCEPTIONS
81
On start-up, Python reads PYTHONPATH and adds the names listed there to sys.path.
In my case, I store my own code in /data/python_mine.
If, on Linux, you install Python libraries, such as scipy or matplotlib, through the package
manager, then sys.path will usually be updated as required. On the other hand, if you
download Python code from somewhere on the web, you have to take care of this
yourself. I deal with this by saving such code in my folder /data/python_foreign,
which I have included in my PYTHONPATH.
7.13
Exceptions
from divisible import divisible_by
r = range(5)
# [0, 1, 2, 3, 4]
print r
print divisible_by(10, r)
r = list(’joe’) # [’j’, ’o’, ’e’]
print r
print divisible_by(10, r)
Exceptions happen if Python cannot perform an instruction. This can be due to faulty
syntax, in which case we get a SyntaxError, or because of faulty or missing data. In the
first example here, we get a ZeroDivisionError, since we pass a 0 as the first divisor
to the divisible_by function. In the second example, we get a TypeError, because we
attempt a numeric operation on letters.
If an exception occurs, execution of the program stops, unless we provide code to catch
the exception, that is deal with it and arrange for the program to recover and continue.
7.13.1
Catching exceptions
from divisible import divisible_by_protected
for i in [4, ’joe’, 6]:
result = divisible_by_protected(i, [0,2,3,’jim’])
print i, ’divisible by’, result
The try-except construct allows our code to recover from bad input. This is particularly
important in those parts of a program that receive data directly from the outside world.
For example, a program that expects certain options and arguments as its input on the
command line should have a way to fail gracefully and print a helpful message in plain
English, instead of just confronting the user with a cryptic Python error traceback, that
is a printout of the code lines that were executing before the error occurred. In other
cases, we might just replace the missing or faulty input with some default values, and
82
7.14
CHAPTER 7. PYTHON PROGRAMMING
Reading and writing files
Here is how you read the content of a file:
f = open(’1X9P.pdb’, ’r’)
blurb = f.read()
print len(blurb)
#
#
#
#
open file in read mode
read the entire file in one
go into the variable blurb
number of bytes in the file
lines = blurb.splitlines() # break it up into lines
for line in lines[:10]:
# print the first ten lines
print line
f.seek(0)
lines = f.readlines()
for line in lines[:10]:
print line
# ’rewind’ for reading again
# read lines directly
The open function takes the file name and the opening mode, which can be either ’r’
for reading or ’w’ for writing. If not given, the mode is ’r’, so open(’myfile’) is
equivalent to open(’myfile’), ’r’.
Note that the open function does not search the sys.path list that is used by import.
Files have to be in the current directory, or else the path to the file has to be explicitly
specified. For example, a file in a sub-directory can be opened with open(’subdir/myfile’),
and one in the upper directory can be specfied like open(’../myfile’)
The difference between the .splitlines and the .readlines method is that in the
first case the linebreaks will be stripped from the end of each line, whereas they are
retained with .readlines.
Since print adds a newline to each line it prints, we get an extra empty line between
each two lines of text in the second example.
7.14.1
Writing files
infile = open(’1X9P.pdb’)
# read mode
outfile = open(’atoms.pdb’,’w’) # write mode
for line in infile.readlines():
if line.startswith(’ATOM’):
outfile.write(line)
infile.close()
outfile.close()
7.14. READING AND WRITING FILES
83
This example shows a shortcut for iterating over all the lines in a file: The for line in
infile statement is equivalent to for line in infile.readlines(), except that
the whole file is not read in one go, but each line is fetched separately. On today’s
computers with abundant memory (my first computer had a whopping 1 MB memory!)
and fast hard drives, this difference rarely matters.
When we are done with files, we should close them. If we don’t, Python will do it for us
either on program exit or when the last variable that points to the open file is destroyed
(for example because the function in which it was declared has returned).
7.14.2
Files and functions: Exercise
Write the appropriate code for this function:
def filter_pdb(infilename, outfilename, record=’ATOM’):
’’’
Reads PDB file <infilename> and write each line
that starts with <record> into <outfilename>.
’’’
1. Modify the code from the last example so that it takes an optional outfilename
argument. If given, the function should write the filtered lines to a file of that
name. In any case, the filtered lines should also be returned in a list.
2. Modify the function so that it takes one or more record arguments; lines should
be retained if they match any of those.
Again, we skipped this is class – trying it for yourself can’t hurt.
© Copyright 2026 Paperzz