NOTES ON PLOTTING GRAPHS The purpose of a graph is two fold; to present your data in a compact easy to digest form, and to present the outcome of your data analysis (if required), also in a compact form, that can be directly compared to your experimental data. This implies a number of things: a) If the data set is small, a table is often sufficient. Three point graphs are irritating, but still check your data visually. b) The data is the focus.: do not clutter the graph with shading, borders, pictures, unnecessary annotation, or any other nonsense Excel can throw up (unless you intend to conceal the data; that’s what “business graphics” means). c) You need properly annotated and scaled axes, and some way of identifying the data if you have multiple data sets. Various symbols are preferred, but color works as well. For histograms (rarely used in chemistry), you will need shading or color. d) Do not join data points with a line (see below). e) Acceptable clutter, if required) is a fitted line, error bars, confidence bounds (usually meaningless outside an analytical lab.) and, in the initial stages, fitting statistics. When you finalise the graph move the statistics to the caption. Do not include R (in reports for me), it’s meaningless in most cases. f) Do not ‘write’ on the graph: all text, including your title, should be in the caption. I prefer the caption at the bottom of the graph. The caption should be numbered (usually Figure n). g) Elaborate graphs are a sign of weak data and small minds. Data points and lines: a) In chemistry you will nearly always use a scatter plot. Do not confuse this with a line plot (x data equally spaced and not numeric — e.g. names). b) If you are plotting an equation then plot it as a line. Typically, the equation will be your fit for the data; superimpose it on your data. Identify the fit in the caption, along with its statistics, or, if applicable, give the function the line represents. c) If you are plotting spectra, or any other data set where the points are very close together (say less than the width of a 12pt letter), then plot the data as a line, not with symbols, or a line+ symbols. It’s just too messy. d) If you are plotting plain old data, with big gaps; plot symbols. DO NOT put a line between them. You do not have data between the data points. Putting a line there implies you have data there. That is fraud. e) Sometimes, at the start of data analysis, you may wish to sketch a line through the data to assist you with the initial fit, particularly with noisy data. This is called adding a line “to guide the eye”. It’s also called “chi by eye” (chi is a goodness-of-fit parameter). Sometimes, when you are leading an audience through your analysis you might want to do this, but the final graph should not use it. Leaving a “chi by eye” line on a graph completely undermines your credibility. If you do not identify it as a line to “guide the eye”, it’s out and out fraud. f) Under rare circumstances, you may have data that cannot be fitted (well) to any function. In this case you have to resort to an approach that is, at least, unbiased. If the data is clean, you can use interpolation routines. These are an admission of defeat, but it is acceptable to use them to “fit” the data. If the data is noisy, you can smooth it. There are many ways of doing this, the best way to do this is to use a cubic spline under tension. As long as you don’t pretend that this is not a route to proper data processing, or to further experimentation, they shouldn’t get you into to much trouble. g) Spectroscopic a data is usually closely spaced and smoothed, usually by Fourier methods. Much of what I have said here is not applicable in these cases, because: i) the smoothing methods are standardised and well understood, ii) the artefacts they introduce are also easy to recognize, iii) a little bit of smoothing is always required to remove well known instrument artefacts, iv) usually, only the peak positions are required. Smoothing has little effect on these. h) Sometimes, you have to smooth data, by whatever method you can, to keep your fitting algorithms stable (at least initially). That is data processing though, not presentation. Smoothing data does not improve the fit, just the algorithmic stability. Just a couple more general comments. If you are plotting data with high data densities (points are very close together), e.g. spectral data, click on the data, right click and select format data series, then use the options to turn off the markers and turn on the line. If you don’t the plot will be very messy. 1.0E-02 1.0E-02 8.0E-03 8.0E-03 6.0E-03 6.0E-03 4.0E-03 4.0E-03 2.0E-03 2.0E-03 0.0E+00 0.0E+00 -2.0E-03 500 -2.0E-03 500 520 540 560 580 600 Line only option 620 640 520 540 560 580 600 620 640 Markers on On the other hand for normal (low data density graphs) never turn on the line option: you do not have data between the points, therefore you cannot plot it. If you must add a line add a fitted line, which is essentially your hypothesis as to what the data should look like. A figure showing the hazards of ad hoc interpolation. All these fits represent perfect fits. The data may represent the classic line break, but it may be a phase transition (bottom left) or an oscillating function. Take into account that the data may be noisy it becomes even less clear, the top left may represent the classic absorption function with saturation, or a simple quadratic. The top right implies two intersection straight lines. AR2GHH! 12. 000 10. 000 8. 000 6. 000 4. 000 2. 000 0. 000 0 2 4 6 8 10 12 Perfect straight line with one outlier All the graphs on the left have an r2 value of 0.7. In the physical sciences this would be considered a poor fit; in the life and social sciences, that is a very good fit. Whatever your perspective the r2 value is misleading because the data is perfect, except in the first case, where there is one outlier, and in the second case which is complete and utter nonsense (for a start uniform random numbers don’t occur in nature). Rule 1. Never quote r2 without plotting the data. 9 8 7 6 5 4 3 2 1 0 0 2 4 6 8 10 12 The integral of uniform random noise The regression coefficient is just a measure of how close the data conforms to the selected fitted function (usually a straight line). You choose that function, so it’s not an independent measure. You must have a good theoretical or empirical reason for your fitting function. If you have no reason to choose a given function, then the line serves no other purpose than to “guide the eye”, or more likely, mislead somebody. Rule 2. Never choose your fitting function to minimise the r2. 1. 200 1. 000 0. 800 The data for the intersecting straight lines will have a much better r2 if you use a quadratic. All that proves is that you cannot distinguish between the two cases. 0. 600 0. 400 0. 200 0. 000 -0. 200 -0. 400 0 2 4 6 8 10 12 Fit to y=1/x data 6. 000 5. 000 4. 000 3. 000 2. 000 1. 000 0. 000 0 2 4 6 8 10 12 Two intersecting straight lines 1. 200 Rule 3. r2 proves nothing. It doesn’t even point you in the right direction. An r2 alone doesn’t prove anything. Statistics only works if you compare two things. If the r2 for one straight line is better than r2 for a similar data set, then the former data set is considered to be better, but the question is how much better? An r2 of 0.8 is a 10x “better” fit than an r2 of 0.7 (for 10 data points). How much better is an r2 of 0.99 vs. 0.9. It depends; to answer that you have to resort to t-tests or similar (but note that confidence limits are meaningless for standalone data sets). Rule 4. r2 alone is not a good measure of how good a fit is. 1. 000 0. 800 0. 600 0. 400 0. 200 0. 000 0 2 4 6 8 10 12 Fit to partial sine wave; y=sin(x) Basically, if you are doing physical sciences and you get an r2 of less 0.9, you need to go back to the drawing board and work out how to improve your data, or go back to the theory to see if your functional form is correct. If it’s less than 0.7 it’s probably not even worth that effort, you may be able to measure useful things, but you are not going to prove any theories. Least squares fitting implicitly assume that each point is equally important, that is they are equally weighted, or more explicitly, they have the same measurement error. This is often the case, but if you transform your data. e.g. you collect a series of rate constants, k, vs. temperature, T and do an Arrhenius plot; ln k vs. 1/T the error is no longer the same (calculate the error on 1/x for x=5 an 3, each with an error of +1). In such cases and r2 is completely meaningless, unless you do a weighted linear regression. Rule 5. Do not use r2 with transformed data, unless you know how to do a weighted regression. 16 x 1.0 1.5 2.0 2.5 3.0 3.5 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 14 R2 = 0.9996 12 10 8 6 4 2 0 0 2 4 6 8 10 Examine the graph above; perfect data! No, not quite, look at the data between 1-4 and 14-16, it’s a little off. Not convinced? Hold the paper horizontally and look along the line, you’ll see it’s bent up at the low end. If you are still not convinced, look at the table of data. In fact, it’s intersecting straight lines. (In practice, it would probably be a straight line with a curl at the low end). The point is, r2 will lull you into believing the data is a straight line, when in fact it’s not. If there wasn’t so much data between 0-4 you wouldn’t see it at all. That data may be critical in discovering something new, or proving your theory. Rule 6. If you insist on using r2 you had better know what you are doing. You do not know what your doing (yet) so...... 12 14 16 y 1.0 1.5 2.0 2.5 3.0 3.5 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 Rule 7. Do not use r2 in lab. reports to me; ever. But, be aware that data handling in analytical chemistry is a different game. Analytical instruments and methods are designed to produce straight lines, you have a historical (the experiments have been repeated many times to make sure the data is linear — the method has been “developed”) and theoretical (somebody has done the math to make sure you would expect the data to be linear — that’s what analytical research is about) context. In such cases r2 may be a legitimate measure, and you will be taught how to deal with it in that context. You do not have that context in physical chemistry labs.
© Copyright 2025 Paperzz