NOTES ON PLOTTING GRAPHS

NOTES ON PLOTTING GRAPHS
The purpose of a graph is two fold; to present your data
in a compact easy to digest form, and to present the
outcome of your data analysis (if required), also in a
compact form, that can be directly compared to your
experimental data.
This implies a number of things:
a) If the data set is small, a table is often sufficient.
Three point graphs are irritating, but still check your
data visually.
b) The data is the focus.: do not clutter the graph with
shading, borders, pictures, unnecessary annotation, or
any other nonsense Excel can throw up (unless you
intend to conceal the data; that’s what “business
graphics” means).
c) You need properly annotated and scaled axes, and
some way of identifying the data if you have multiple
data sets. Various symbols are preferred, but color works
as well. For histograms (rarely used in chemistry), you
will need shading or color.
d) Do not join data points with a line (see below).
e) Acceptable clutter, if required) is a fitted line, error
bars, confidence bounds (usually meaningless outside
an analytical lab.) and, in the initial stages, fitting
statistics. When you finalise the graph move the
statistics to the caption. Do not include R (in reports for
me), it’s meaningless in most cases.
f) Do not ‘write’ on the graph: all text, including your
title, should be in the caption. I prefer the caption at the
bottom of the graph. The caption should be numbered
(usually Figure n).
g) Elaborate graphs are a sign of weak data and small
minds.
Data points and lines:
a) In chemistry you will nearly always use a scatter
plot. Do not confuse this with a line plot (x data equally
spaced and not numeric — e.g. names).
b) If you are plotting an equation then plot it as a line.
Typically, the equation will be your fit for the data;
superimpose it on your data. Identify the fit in the
caption, along with its statistics, or, if applicable, give
the function the line represents.
c) If you are plotting spectra, or any other data set
where the points are very close together (say less than
the width of a 12pt letter), then plot the data as a line,
not with symbols, or a line+ symbols. It’s just too messy.
d) If you are plotting plain old data, with big gaps; plot
symbols. DO NOT put a line between them. You do not
have data between the data points. Putting a line there
implies you have data there. That is fraud.
e) Sometimes, at the start of data analysis, you may
wish to sketch a line through the data to assist you with
the initial fit, particularly with noisy data. This is called
adding a line “to guide the eye”. It’s also called “chi by
eye” (chi is a goodness-of-fit parameter). Sometimes,
when you are leading an audience through your analysis
you might want to do this, but the final graph should
not use it. Leaving a “chi by eye” line on a graph
completely undermines your credibility. If you do not
identify it as a line to “guide the eye”, it’s out and out
fraud.
f) Under rare circumstances, you may have data that
cannot be fitted (well) to any function. In this case you
have to resort to an approach that is, at least, unbiased.
If the data is clean, you can use interpolation routines.
These are an admission of defeat, but it is acceptable to
use them to “fit” the data. If the data is noisy, you can
smooth it. There are many ways of doing this, the best
way to do this is to use a cubic spline under tension. As
long as you don’t pretend that this is not a route to
proper data processing, or to further experimentation,
they shouldn’t get you into to much trouble.
g) Spectroscopic a data is usually closely spaced and
smoothed, usually by Fourier methods. Much of what I
have said here is not applicable in these cases, because:
i) the smoothing methods are standardised and well
understood, ii) the artefacts they introduce are also easy
to recognize, iii) a little bit of smoothing is always
required to remove well known instrument artefacts, iv)
usually, only the peak positions are required. Smoothing
has little effect on these.
h) Sometimes, you have to smooth data, by whatever
method you can, to keep your fitting algorithms stable
(at least initially). That is data processing though, not
presentation. Smoothing data does not improve the fit,
just the algorithmic stability.
Just a couple more general comments. If you are plotting data with high data densities (points are very close
together), e.g. spectral data, click on the data, right click and select format data series, then use the options to turn
off the markers and turn on the line. If you don’t the plot will be very messy.
1.0E-02
1.0E-02
8.0E-03
8.0E-03
6.0E-03
6.0E-03
4.0E-03
4.0E-03
2.0E-03
2.0E-03
0.0E+00
0.0E+00
-2.0E-03
500
-2.0E-03
500
520
540
560
580
600
Line only option
620
640
520
540
560
580
600
620
640
Markers on
On the other hand for normal (low data density graphs) never turn on the line option: you do not have data between
the points, therefore you cannot plot it. If you must add a line add a fitted line, which is essentially your hypothesis
as to what the data should look like.
A figure showing the hazards of ad
hoc interpolation. All these fits
represent perfect fits. The data may
represent the classic line break, but it
may be a phase transition (bottom
left) or an oscillating function. Take
into account that the data may be
noisy it becomes even less clear, the
top left may represent the classic
absorption function with saturation,
or a simple quadratic. The top right
implies two intersection straight
lines.
AR2GHH!
12. 000
10. 000
8. 000
6. 000
4. 000
2. 000
0. 000
0
2
4
6
8
10
12
Perfect straight line with one outlier
All the graphs on the left have an r2 value of 0.7. In the physical
sciences this would be considered a poor fit; in the life and
social sciences, that is a very good fit. Whatever your perspective
the r2 value is misleading because the data is perfect, except in
the first case, where there is one outlier, and in the second case
which is complete and utter nonsense (for a start uniform
random numbers don’t occur in nature).
Rule 1. Never quote r2 without plotting the data.
9
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
The integral of uniform random noise
The regression coefficient is just a measure of how close the
data conforms to the selected fitted function (usually a straight
line). You choose that function, so it’s not an independent
measure. You must have a good theoretical or empirical reason
for your fitting function. If you have no reason to choose a given
function, then the line serves no other purpose than to “guide
the eye”, or more likely, mislead somebody.
Rule 2. Never choose your fitting function to minimise the r2.
1. 200
1. 000
0. 800
The data for the intersecting straight lines will have a much
better r2 if you use a quadratic. All that proves is that you cannot
distinguish between the two cases.
0. 600
0. 400
0. 200
0. 000
-0. 200
-0. 400
0
2
4
6
8
10
12
Fit to y=1/x data
6. 000
5. 000
4. 000
3. 000
2. 000
1. 000
0. 000
0
2
4
6
8
10
12
Two intersecting straight lines
1. 200
Rule 3. r2 proves nothing. It doesn’t even point you in the right
direction.
An r2 alone doesn’t prove anything. Statistics only works if you
compare two things. If the r2 for one straight line is better than
r2 for a similar data set, then the former data set is considered to
be better, but the question is how much better? An r2 of 0.8 is a
10x “better” fit than an r2 of 0.7 (for 10 data points). How much
better is an r2 of 0.99 vs. 0.9. It depends; to answer that you have
to resort to t-tests or similar (but note that confidence limits
are meaningless for standalone data sets).
Rule 4. r2 alone is not a good measure of how good a fit is.
1. 000
0. 800
0. 600
0. 400
0. 200
0. 000
0
2
4
6
8
10
12
Fit to partial sine wave; y=sin(x)
Basically, if you are doing physical sciences and you get an r2 of
less 0.9, you need to go back to the drawing board and work out
how to improve your data, or go back to the theory to see if your
functional form is correct. If it’s less than 0.7 it’s probably not
even worth that effort, you may be able to measure useful
things, but you are not going to prove any theories.
Least squares fitting implicitly assume that each point is equally important, that is they are equally weighted, or
more explicitly, they have the same measurement error. This is often the case, but if you transform your data. e.g. you
collect a series of rate constants, k, vs. temperature, T and do an Arrhenius plot; ln k vs. 1/T the error is no longer the
same (calculate the error on 1/x for x=5 an 3, each with an error of +1). In such cases and r2 is completely
meaningless, unless you do a weighted linear regression.
Rule 5. Do not use r2 with transformed data, unless you know how to do a weighted regression.
16
x
1.0
1.5
2.0
2.5
3.0
3.5
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
13.0
14.0
15.0
14
R2 = 0.9996
12
10
8
6
4
2
0
0
2
4
6
8
10
Examine the graph above; perfect data! No, not quite,
look at the data between 1-4 and 14-16, it’s a little off.
Not convinced? Hold the paper horizontally and look
along the line, you’ll see it’s bent up at the low end. If
you are still not convinced, look at the table of data. In
fact, it’s intersecting straight lines. (In practice, it would
probably be a straight line with a curl at the low end).
The point is, r2 will lull you into believing the data is a
straight line, when in fact it’s not. If there wasn’t so
much data between 0-4 you wouldn’t see it at all. That
data may be critical in discovering something new, or
proving your theory.
Rule 6. If you insist on using r2 you had better know
what you are doing.
You do not know what your doing (yet) so......
12
14
16
y
1.0
1.5
2.0
2.5
3.0
3.5
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
13.0
14.0
15.0
Rule 7.
Do not use r2 in lab. reports to me; ever.
But, be aware that data handling in analytical chemistry
is a different game. Analytical instruments and methods
are designed to produce straight lines, you have a
historical (the experiments have been repeated many
times to make sure the data is linear — the method has
been “developed”) and theoretical (somebody has done
the math to make sure you would expect the data to be
linear — that’s what analytical research is about) context.
In such cases r2 may be a legitimate measure, and you
will be taught how to deal with it in that context. You do
not have that context in physical chemistry labs.