Engineering Statistics
EGE370
Term Project
Statistics Graphics in MATLAB
Name
Nicholas Crimmins
Department
CE
Course Professor: Dr. Vaziri
5/10/13
1
Table of Contents
Introduction and Background ...................................................................................................................... 3
Implementation............................................................................................................................................ 4
Stem and Leaf Diagram ............................................................................................................................. 5
Character Stem and Leaf Diagram ............................................................................................................ 6
Histogram .................................................................................................................................................. 7
Boxplot ...................................................................................................................................................... 9
Conclusion .................................................................................................................................................. 11
Appendix (MATLAB Code) .......................................................................................................................... 12
2
Introduction and Background
In order to clearly and logically organize a given set of data, it is common to make use of
graphs and figures that organize the data in such a way as to make it both useful for analysis and
aesthetically pleasing. In this project only a stem and leaf plot, character stem and leaf plot,
histogram, and boxplot were utilized, but there are various other figures that may be used, each
of which serves a specific purpose for analysis. Everything was coded using MATLAB.
Before being able to display the data on these graphs, there were a few calculations that
needed to be made. The first was the sample mean, the value of which represents the arithmetic
average of the sample set and is a fairly accurate estimate of a central value:
The next equation used was the sample variance, which is used as a measurement of
variability in a set of data. In other words, if there is a low variance, the numbers in the sample
set are closer together, and vice versa:
The first of the graphical representation used in this project was the stem and leaf
diagram. This figure organizes the data into ranges, or stems, and shows the corresponding
individual observations. A character stem and leaf plot is similar, but sorts the data in ascending
order and adds a numbering system to make it easier to find values such as the median,
percentiles or quartiles. A histogram is similar to a stem and leaf in terms of organization, but
uses bins rather than leaves to display data as a bar graph. Finally, a box plot is used to very
easily see values such as median, quartiles, and percentiles in a set of data.
3
Implementation
In order to display each graph, MATLAB was used in writing the code necessary for all
calculations and figures. The first step after inputting the data was to calculate the mean and
variance of the sample:
mean = 0;
for i = 1:length(data)
mean = mean + data(i);
end
mean = mean / length(data);
The above code sorts through the entire data set, summing each piece of data in a loop. It
is then divided by the length of the data vector, representing the number of points. A similar
algorithm was used for calculating the variance:
var = 0;
if length(data) ~= 1
for i = 1:length(data)
var = var + ((data(i) - mean)^2);
end
var = var / (length(data) - 1);
end
The next few pages discuss the coding techniques used to implement the various
diagrams.
4
Stem and Leaf Diagram
The first graph to be displayed was the stem and leaf diagram. Organizing the stems,
leaves, and frequencies was done as follows:
stems_string = sprintf('%d', stems(1));
leaves_string = sprintf('%2d', leaves(stems == stems(1)));
freq_string = sprintf('\t\t\t%d', length(leaves(stems == stems(1))));
disp( strcat([stems_string ' |' leaves_string freq_string]) );
Using the ‘sprintf’ statement in combination with ‘strcat’ rather than the ‘disp’ statement,
the data was able to be displayed on a single line by simply piecing each component together.
Above is only the first line of the diagram, but a loop is used for the rest of the lines of the
diagram. The reasoning for this is explained in the next section as well as in the code itself. The
output for this plot goes directly to the command window. An example is shown below:
Steam and Leaf Plot of Data:
Stem
Leaf
Frequency
----------------------------7 | 6
1
8 | 7
1
9 | 7
1
10 | 1 5
2
11 | 0 5 8
3
12 | 0 1 3
3
13 | 1 3 3 4 5 5
6
14 | 1 2 3 5 6 8 9 9
8
15 | 0 0 1 3 4 4 6 7 8 8 8 8
12
16 | 0 0 0 3 3 5 7 7 8 9
10
17 | 0 1 1 2 4 4 5 6 6 8
10
18 | 0 0 1 1 3 4 6
7
19 | 0 3 4 6 9 9
6
20 | 0 1 7 8
4
21 | 8
1
22 | 1 8 9
3
23 | 7
1
24 | 5
1
Key for steam and leaf: 1 | 2 = 12, 12 | 1 = 121, etc.
For decimal inputs: 12 | 2 = 12.2, 121 | 3 = 121.3, etc.
5
Character Steam and Leaf Diagram
The character stem and leaf plot was very similar to the steam and leaf diagram, except a
column for ‘N’ needed to be added in order to provide a simple way to locate percentiles and
quartiles, as well as the median. This algorithm is shown below:
if any(data(stems == stems(i)) == floor(median))
M = length(leaves(stems == stems(i)));
M_string = sprintf('(%d) ', M);
disp( strcat([M_string stems_string ' |' leaves_string]) );
N = N + M;
N = length(data) - N;
elseif M ~= 0
leaf_count = leaves_string;
leaf_count(leaf_count == ' ') = [ ];
N_string = sprintf('%d\t', N);
disp( strcat([N_string stems_string ' |' leaves_string]) );
N = N - length(leaf_count);
else
leaf_count = leaves_string;
leaf_count(leaf_count == ' ') = [ ];
N = N + length(leaf_count);
N_string = sprintf('%d\t', N);
disp( strcat([N_string stems_string ' |' leaves_string]) );
end
These ‘if’ statements increment a variable ‘M’ until it reaches the midpoint f the data, at
which point the middle ‘N’ value if displayed and the numbers begin to decrease until the end of
the diagram. The graph for this is shown below:
Character Steam and Leaf Plot of Data:
N = 80 Leaf Unit = 1
----------------------------1
7 | 6
2
8 | 7
3
9 | 7
5
10 | 1 5
8
11 | 0 5 8
11
12 | 0 1 3
17
13 | 1 3 3 4 5 5
25
14 | 1 2 3 5 6 8 9 9
37
15 | 0 0 1 3 4 4 6 7 8 8 8 8
(10) 16 | 0 0 0 3 3 5 7 7 8 9
33
17 | 0 1 1 2 4 4 5 6 6 8
23
18 | 0 0 1 1 3 4 6
16
19 | 0 3 4 6 9 9
10
20 | 0 1 7 8
6
21 | 8
5
22 | 1 8 9
2
23 | 7
1
24 | 5
Median Value = 163
6
Histogram
The histogram was the first graph that utilized Figures in MATLAB. The use of the
‘bar()’ function was particularly helpful in creating the rectangles for each bin. The first step was
to sort the data into bins (the number of bins is the square root of the length of the data):
bin_width = (max(data) - min(data)) / num_bins;
bin_x = round( min(data):bin_width:max(data) );
With all of the data separated into bins, all that was left was to calculate the
corresponding frequencies:
m = 1;
n = 1;
count_freq = 0;
freq = zeros(1,num_bins+1);
%Preallocate space
for i = 1:length(data)
if data(i) >= bin_x(m)
if m == length(bin_x)
freq(n) = logical(data(i)) * length(data) - i + 2;
break;
elseif data(i) < bin_x(m + 1)
count_freq = count_freq + 1;
else
freq(n) = count_freq;
n = n + 1;
m = m + 1;
count_freq = 1;
end
end
end
This algorithm sorts through each bin and determines whether the current piece of data in
the set is within the bounds of that bin. If so, the frequency increases. Otherwise, the total
frequency for the previous bin is recorded and the counters reset. In order to display the
histogram, the following piece of code was implemented:
bar(bin_x-bin_width/2 + bin_width, freq)
xlim([min(bin_x) max(bin_x)+bin_width ])
set(gca, 'XTick', bin_x) %Changes the X-axis scale to match the bins
xlabel('Data');
ylabel('Frequency');
The above code creates a bar for each bin with the corresponding frequency determining
the height, and sets all axes and labels appropriately. The bars were adjusted to fit between each
bin rather than directly above each tick mark.
7
The resulting plot of the histogram for the given data set is shown below. Notice how the
shape of the histogram represents the normal distribution of data:
8
Boxplot
Plotting the boxplot diagram was fairly simple, though it did have the most code out of
the four graphs. The majority of the work was calculating each necessary value, including the
median, each quartile, each whisker, each outlier vector, and the IQR (Interquartile Range). A
few examples of the calculations are shown below. The rest of the calculations are similar to
these:
if rem(N,2) == 0 %Even number of data points
median = data( round(((N/2)+((N/2)+1))/2) );
else
%Odd number of data points
median = data(((N-1)/2)+1);
end
quart1 = data( round(length(data)*(1/4)) );
IQR = quart3 - quart1;
lower_whisk = quart1-IQR;
lower_outliers = data(data < lower_whisk);
In order to effectively plot the figure, vertical lines were drawn for the quartiles and
markers were used for the whiskers and outliers:
y = [median median+5];
%Draw vertical lines when plotting
In order to fit the boxplot on the graph without taking up the entire window, the axes
were set appropriately. Special conditions such as empty vectors were taken into consideration:
if isempty(upper_outliers)
x_ax2 = upper_whisk + 10;
else
x_ax2 = max(upper_outliers)+10;
end
axis( [x_ax1 x_ax2 median-(median/2) median+(median/2)] )
Following are a few examples of plotting these calculated values. The ‘hold on’
command was utilized to plot multiple times. In order to draw each vertical line, vectors with
specified coordinates were used:
plot([quart2 quart2], y);
plot(lower_whisk, median+2.5, 'x');
plot( [lower_whisk quart1], [median+2.5 median+2.5],
for i = 1:length(lower_outliers)
plot(lower_outliers(i), median+2.5, 'o');
end
':');
9
The calculations for the boxplot are displayed in the command window as follows:
Calculations For Box Plot of Data:
----------------------------------Median = 163
1st Quartile = 143
3rd Quartile = 181
Interquartile Range (IQR) = 38
Lower Whisker = 105
Upper Whisker = 219
Lower Outliers: 76
87
97 101
Upper Outliers: 221 228 229 237 245
The resulting plot of the boxplot for the given data set is shown below. The ‘x’ markers
represent each whisker, while the ‘o’ markers are the outliers. Again, quartiles are drawn as
vertical lines. No scale is needed for the Y-Axis.
10
Conclusion
While the plotting of these graphs is very easy to grasp conceptually, the actual coding
process is much more difficult and complex. When plotting the stem and leaf and character stem
and leaf diagrams, everything had to be printed on a single line (stem, leaf, frequency). In order
to organize each piece of the plot, strings were very helpful in that it allowed the assembly of
each component outside of the display function. Also, when determining when to display a new
line, the algorithm used required checking the previous index of the stems vector, which would
check stems(0) for the first iteration and cause an error. This was handled by printing the first
line of the diagram separate from the rest of the stem and leaf. For the histogram, the algorithm
for calculating the frequencies for some reason did not seem to include the last bin, requiring the
‘freq()’ vector to be appended outside of the loop for the last bar on the graph. The only issue
with the boxplot was when there were no outliers present, causing the empty vector to be used as
the limiting values for the X-Axis in order to allow the graph to span most of the window with
any given data set. A workaround was implemented which adjusted the axis depending on
whether or not any outliers were present. Overall, the project was instrumental in furthering the
understanding of these various statistical diagram, as well as the MATLAB programming
language.
11
Appendix (MATLAB Code)
clear all
clc
%Uses inputdlg() instead of input() function, prevents cluttering of
%command window.
data = inputdlg('Input data seperated by commas or spaces: ');
data = str2num(data{1}); %Convert string of data to numbers
disp([ 'Data = ', num2str(data) ]);
% Mean
mean = 0;
for i = 1:length(data)
mean = mean + data(i);
end
mean = mean / length(data);
% Variance
var = 0;
if length(data) ~= 1
for i = 1:length(data)
var = var + ((data(i) - mean)^2);
end
var = var / (length(data) - 1);
end
disp([ 'Mean = ', num2str(mean), ' , Variance = ', num2str(var)]);
disp(' ');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%***** Stem and Leaf Plot *****
%Create stem and leaf vectors
data = sort(data, 'ascend');
if floor(data) ~= data
% Decimals
stems = floor(data);
leaves = abs(round(rem(data,1) * 10));
else % Whole numbers
stems = floor(data./10);
leaves = abs(rem(data,10));
end
disp('Steam and Leaf Plot of Data: ');
disp('Stem
Leaf
Frequency');
disp('-----------------------------');
%Display first line of stem and leaf plot, necessary for this
%particular algorithm, otherwise stems(0) would be accessed.
stems_string = sprintf('%d', stems(1));
leaves_string = sprintf('%2d', leaves(stems == stems(1)));
freq_string = sprintf('\t\t\t%d', length(leaves(stems == stems(1))));
disp( strcat([stems_string ' |' leaves_string freq_string]) );
%Display rest of steam and leaf plot with a key
for i = 2:length(stems)
if stems(i-1) ~= stems(i)
stems_string = sprintf('%d', stems(i));
leaves_string = sprintf('%2d', leaves(stems == stems(i)));
freq_string = sprintf('\t\t\t%d', length(leaves(stems == stems(i))));
disp( strcat([stems_string ' |' leaves_string freq_string]) );
end
end
disp('Key for steam and leaf: 1 | 2 = 12, 12 | 1 = 121, etc.');
disp('For decimal inputs: 12 | 2 = 12.2, 121 | 3 = 121.3, etc.');
disp(' ');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
12
%***** Character Stem and Leaf Plot *****
%Use appropriate leaf units depending on the data
N = length(data);
if floor(data) ~= data
%Deciamls
leaf_unit = 0.1;
else
%Whole numbers
leaf_unit = 1.0;
end
if rem(N,2) == 0 %Even number of data points
median = data( round(((N/2)+((N/2)+1))/2) );
else
%Odd number of data points
median = data(((N-1)/2)+1);
end
disp('Character Steam and Leaf Plot of Data: ');
disp([ 'N = ', num2str(N), ' Leaf Unit = ', num2str(leaf_unit) ]);
disp('-----------------------------');
%Create a string for each piece of the plot, to be concatenated later.
%Similar algorithm as in the regular stem-and-leaf.
stems_string = sprintf('%d', stems(1));
leaves_string = sprintf('%2d', leaves(stems == stems(1)));
leaf_count = leaves_string;
leaf_count(leaf_count == ' ') = [ ];
N = length(leaf_count);
N_string = sprintf('%d\t', N);
disp( strcat([N_string stems_string ' |' leaves_string]) );
%Loop to plot 2nd line forward of stem and leaf
M = 0;
for i = 2:length(stems)
if stems(i-1) ~= stems(i)
stems_string = sprintf('%d', stems(i));
leaves_string = sprintf('%2d', leaves(stems == stems(i)));
if any(data(stems == stems(i)) == floor(median))
M = length(leaves(stems == stems(i)));
M_string = sprintf('(%d) ', M);
disp( strcat([M_string stems_string ' |' leaves_string]) );
N = N + M;
N = length(data) - N;
elseif M ~= 0
leaf_count = leaves_string;
leaf_count(leaf_count == ' ') = [ ];
N_string = sprintf('%d\t', N);
disp( strcat([N_string stems_string ' |' leaves_string]) );
N = N - length(leaf_count);
else
leaf_count = leaves_string;
leaf_count(leaf_count == ' ') = [ ];
N = N + length(leaf_count);
N_string = sprintf('%d\t', N);
disp( strcat([N_string stems_string ' |' leaves_string]) );
end
end
end
disp([ 'Median Value = ', num2str(median) ]);
disp(' ');
13
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%***** Histogram *****
%Calculate number of bins:
if length(data) <= 25
num_bins = 5;
else
num_bins = round( sqrt(length(data)) );
end
%Place data into the correct bins:
bin_width = (max(data) - min(data)) / num_bins;
bin_x = round( min(data):bin_width:max(data) );
%Calculate the corresponding frequencies:
m = 1;
n = 1;
count_freq = 0;
freq = zeros(1,num_bins+1);
%Preallocate space
for i = 1:length(data)
if data(i) >= bin_x(m)
if m == length(bin_x)
freq(n) = logical(data(i)) * length(data) - i + 2;
break;
elseif data(i) < bin_x(m + 1)
count_freq = count_freq + 1;
else
freq(n) = count_freq;
n = n + 1;
m = m + 1;
count_freq = 1;
end
end
end
%Plot last bar separately (above algorithm only plots up to last bar)
freq(length(freq)) = sum(data == bin_x(length(bin_x)));
%Plot the histogram with appropriate labels and axes:
plot1 = figure;
hold on;
bar(bin_x-bin_width/2 + bin_width, freq)
xlim([min(bin_x) max(bin_x)+bin_width ])
set(gca, 'XTick', bin_x) %Changes the X-axis scale to match the bins
xlabel('Data');
ylabel('Frequency');
hold off;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%***** Box Plot *****
disp('Calculations For Box Plot of Data: ');
disp('-----------------------------------');
quart2 = median; %From earlier calculation in character stem and leaf plot
disp([ 'Median = ', num2str(quart2) ]);
%Rest of calculations for each section of data on the plot
quart1 = data( round(length(data)*(1/4)) );
disp([ '1st Quartile = ', num2str(quart1) ]);
quart3 = data( round(length(data)*(3/4)) );
disp([ '3rd Quartile = ', num2str(quart3) ]);
IQR = quart3 - quart1;
disp([ 'Interquartile Range (IQR) = ', num2str(IQR) ]);
lower_whisk =
disp([ 'Lower
upper_whisk =
disp([ 'Upper
quart1-IQR;
Whisker = ', num2str(lower_whisk) ]);
quart3+IQR;
Whisker = ', num2str(upper_whisk) ]);
14
lower_outliers = data(data
disp([ 'Lower Outliers: ',
upper_outliers = data(data
disp([ 'Upper Outliers: ',
plot2 = figure;
hold on;
y = [median median+5];
< lower_whisk);
num2str(lower_outliers) ]);
> upper_whisk);
num2str(upper_outliers) ]);
%Draw vertical lines when plotting
%Make box plot condensed on graph:
if isempty(lower_outliers)
x_ax1 = lower_whisk - 10;
else
x_ax1 = min(lower_outliers)-10;
end
if isempty(upper_outliers)
x_ax2 = upper_whisk + 10;
else
x_ax2 = max(upper_outliers)+10;
end
axis( [x_ax1 x_ax2 median-(median/2) median+(median/2)] )
%Plot each quartile as a vertical line:
plot([quart2 quart2], y);
plot([quart3 quart3], y);
plot([quart1 quart1], y);
%Plot each whisker as an 'x'
plot(lower_whisk, median+2.5, 'x');
plot(upper_whisk, median+2.5, 'x');
%Draw a dotted line between the whiskers and the quartiles
plot( [lower_whisk quart1], [median+2.5 median+2.5], ':');
plot( [upper_whisk quart3], [median+2.5 median+2.5], ':');
%Plot the lower outliers
for i = 1:length(lower_outliers)
plot(lower_outliers(i), median+2.5, 'o');
end
%Plot the upper outliers
for i = 1:length(upper_outliers)
plot(upper_outliers(i), median+2.5, 'o');
end
hold off;
%Set approprate labels
xlabel('Data');
title('Boxplot of Data');
set(gca, 'YTick', []) %Y Axis label removed since there are no units
15
© Copyright 2026 Paperzz