Analysis of Multiple Linear Regression in Three Segments of the

Analysis of Multiple Linear Regression in Three Segments of the Houston Ship Channel
Zachary J. Van Brunt
Advisors: Dr. Hanadi Rifai, Anu Desai
Table of Contents
Abstract
2
Introduction
2
Background & Theory
3
Procedures
5
Results
7
Discussion
8
Acknowledgements
9
References
10
1
Abstract
With the understanding that clean water fit for contact recreation is a priority in a modern
urban environment, the Texas Commission on Environmental Quality (TCEQ) and
various other environmental sampling agencies have sampled water quality in Houston
for more than 30 years, more or less continuously. Data from six sampling sites, all
within the Houston ship channel, were analyzed to study relationships between pH,
temperature, specific conductance, dissolved oxygen, and ammonia in relation to levels
of fecal coliform, using multiple linear regression analysis. At three of the sites there
were shown to be statistically significant relationships (p-value < 0.05), while the other
three showed no discernable relationships between any of the variables examined.
Introduction
Contaminated surface water is an issue that has become crucially important in recent
years, and is still gaining public prominence. Surface water quality is at a meeting point,
with concerns over environmental responsibility and sustainability on one side, and an
enduring watchfulness for public sanitation and health on another. Beach closures,
particularly those related to high levels of Escherichia coli (E. coli) or fecal coliform,
seem to have increased public awareness of this issue. (Stellin 2008) Beaches are clearly
not the only water bodies affected though, and in Houston levels of fecal coliform and E.
coli in the city’s extensive system of bayous have regularly failed to meet standards set
by the Environmental Protection Agency (EPA).
In order to assess the level of impairment various water bodies face, programs known as
Total Maximum Daily Loads (TMDLs) are undertaken. The purpose of a TMDL for
bacteria is to sample at various points in a water body over time in order to eventually
produce a single number, which is the total maximum load of bacteria that can be put into
a water body each day without the water becoming unsafe for human use. The TCEQ
instituted a TMDL for two central Houston bayous, Buffalo and White Oak Bayous, in
April of 2000, and then issued another TMDL for five additional bayous in the Houston
2
Metropolitan Area in June of 2005 (“Houston Metropolitan Area” 2008). This paper
discusses three segments added onto the Houston Metro Area TMDL as of 2008 for
review, if not necessarily for sampling, due to their lower likelihood of being used for
contact recreation, due to all of them being geographically located in the vicinity of the
Houston Ship Channel.
Background & Theory
The Houston Ship Channel, formed from the channelization of Buffalo Bayou in 1915
(“Basin 10 San Jacinto River” 2008) has exhibited problematic water quality for a
number of years, not just concerning contact recreation, but also dioxin and PCB
ingestion through fish tissue, and a general public perception in the region of pollution.
Three specific segments, as defined by the TCEQ, are being examined in this paper. The
first of these is defined as “Buffalo Bayou (US 59 to upstream of 69th Street WWTP)” in
the TCEQ’s 303(d) list, a list of all impaired water bodies in the state of Texas. A visual
examination of this segment was made during sampling for an unrelated project over the
summer, and the whole segment has clearly been a part of the industrial ship channel area
for quite some time and subject to regular boat traffic. The land surrounding the channel
in this area includes open-air storage, warehouses, and as the title of the segment
indicates, a large Wastewater Treatment Plant. This area is the furthest west of the three
segments being examined, and at its westernmost point nears the northern section of
downtown Houston. Three of the sampling stations being examined (11295, 11296, and
15841) are located in this segment. The second segment is defined as “ Houston Ship
Channel Tidal – Greens Bayou confluence to Patrick Bayou confluence.” Its location,
shown by the central red box in Figure 1, is, again, an industrialized section of the
Houston Ship Channel, in a more frequently trafficked section of the channel to the east
of the first segment. Greens Bayou and Patrick Bayou are frequently traveled by barges
and their confluences define the bounds for this segment. This segment includes stations
11271 and 16617. The third and final segment being examined is described in the 303(d)
list as “Houston Ship Channel Tidal- Lynchburg Ferry Road to Goose Island.” This
segment exists near a turn in the Ship Channel, and includes the constant ship traffic
3
created by the running of the Lynchburg Ferry. It includes station 11258, and is shown
by the red rectangle on the right in Figure 1.
Figure 1: Station Locations in Houston Ship Channel
Multiple linear regression analysis is a widely used method of statistically analyzing a
dataset where a large number of variables could potentially be impacting a final result.
(Ge & Frick 2007) A reasonably intuitive method for understanding the concept behind
Multiple linear regression analysis is to first look at a simple linear regression. Finding a
linear regression for a dataset is a simple, well-understood tool for finding the
relationship between an independent and a dependent variable. A simple linear
regression would give an equation:
Log(FC) = b0 + b1(pH)
(The equation above would show the values of the log of Fecal Coliform values that
would result from a certain pH value.)
Multiple linear regression is a good deal more complicated, but in principal is achieving a
similar result, in that it is determining the relationship between variables. Now, however,
4
there are a large number of independent variables that could potentially be affecting the
dependent variable, and all by various amounts. An equation for multiple linear
regression would generally take a form similar to:
Log(FC) = b0 + b1(pH) + b2(Temp) + b3(Cond) + b4(DO) + b5(Ammonia) + b6(Salinity)
(The above equation projects a value for the log of a Fecal Coliform value based on
different values for pH, Temperature, Conductivity, Dissolved Oxygen, Ammonia
Concentration, and Salinity.)
Calculating a multiple linear regression analysis for a large set of data would be
immensely time-intensive manually, but luckily statistical analysis software can be used
to quickly run the test. The results from such a software, given a certain data set, explain
how much of the dependent variable can be explained by the other variables (R2 value,)
and also would tell you with how much confidence (p-value) any single variable can be
said to be contributing to the final result. As a method multiple linear regression analysis
has been used widely to study the effects of various parameters on water quality. (Eleria
& Vogel 2005) It is naturally suited to examining complex systems such as streams,
rivers, estuaries, etc. when it is felt that sufficient data has been collected to potentially
describe the cause of something like fecal coliform levels. (Ge & Frick 2007)
Procedures
When analyzed from a temporal perspective, the vast majority of the procedural
undertakings for a multiple regression analysis, given data from the TCEQ, is in the
manipulation of and transfer of data into a more readily accessibly and useful format.
The data are provided in text files, accessible to the general public on the TCEQ website
(http://www.tceq.state.tx.us). These text files are delimited by the character “|,” so that
they are then imported into Microsoft Access as databases. The headings for the data,
necessary to understand what the columns are, have been provided in a separate text file
from the TCEQ, which must then be incorporated into the Access database by hand. At
this point there will be two Access databases for every sub-basin: one database
5
containing the “Event” data, such as what type of sample was taken and when and where,
and another database containing the “Result” data, such as what parameters were
collected when the sample was taken, and what the actual value of that parameter was for
that sampling time.
In these two databases, there are unique IDs assigned to each and every sample that allow
the “Event” file and the “Result” file to be matched together. A query in Microsoft
Access is then used to pair up the “Event” data with their matching “Result” data,
creating a new database that includes all pertinent data. This is done individually for
each of the 3 sub-basins that contain TCEQ sampling stations that are in the three
segments of the Houston Ship Channel being examined. At this point the data has
reached roughly its desired form. After this the desired stations are selected out of all
stations in the database, and this data is copied into Microsoft Excel.
The next step in this process is to check to make sure that the data is in the right format,
and to pare it down to just the parameters desired. Any reading in Fahrenheit must be
converted to Celsius, and data for parameters such as phosphate, though available, isn’t
being examined in the scope of this report. All of this data is culled, leaving a
substantially smaller excel file, which can be readily worked with. At this point Excel’s
“VLOOKUP” function is used to create a new spreadsheet that will find, by date, all the
various parameters that were sampled at any single point, and pair them up into a single
line in the spreadsheet. At this point all that remains of the Excel formatting portion is to
remove all the date-ranges that are missing data for one or more of the parameters.
Unfortunately, but perhaps reasonably considering the large temporal spread of the data,
the parameters that were measured have not remained constant, but have changed from
year to year and from sampling agency to sampling agency. The practical implication of
this is that approximately 60% of the data are missing one or more parameters, and are
discarded.
The data that remain are analyzed using Statistical Analysis Software (SAS). (SAS 2008)
The data from Excel is loaded into the SAS software. Within the software various
6
programs may be run to analyze data, and the one chosen is a program written in
FORTRAN specifically to call on and use SAS’s multiple linear regression capabilities.
This program is used for each of the six stations individually, and the results of each are
copied into a text file.
Results
The results (Table 1) show that three Sampling stations returned both R2 Values in the
desired range and had p-values<0.05. These stations were 11258, 11271, and 16617.
Both 11258 and 11271 gave very low p-values (<0.0001) with temperature and
conductance, and p-values near p<0.03 for dissolved oxygen. The R2 value for station
11258 was 0.30, while the R2 value for station 11271 was 0.31, suggesting that
approximately 30% of the fecal coliform concentrations in these two stations can be
accounted for by the temperature, conductance, and dissolved oxygen. The results for
these two stations are noticeably similar, though no clear reason for this is evident.
Station 16617, in the same segment as 11271, had an R2 value of 0.20, with the only
parameter with a p-value <0.05 being pH. This suggests that for the data from that
station, pH is being shown to contribute substantially to the fecal coliform levels.
Station
11258
11271
16617
11295
11296
15841
Temperature Conductance Dissolved Salinity Ammonia
R2
Oxygen
Value
----<0.0001
<0.0001
0.0257
----0.30
----<0.0001
<0.0001
0.0203
----0.31
0.0324
--------------------0.20
--------------------0.00
--------------------0.03
--------------------0.17
pH
Table 1: Results obtained from SAS software
Three other stations found no statistically significant relationship between any of the
parameters and the fecal coliform concentrations present.
7
Discussion
The results obtained from the multiple linear regression analysis show, generally, nothing
so decisive as to lead to a clear conclusion. In two stations, 11258 and 11271, it seemed
that not only was there clearly an effect on the fecal coliform levels by the various
parameters, but the same parameters were producing the effect, and with a high
confidence (p-value <0.03) These sites are reasonably close to each other geographically
but not actually in the same segment, so though there might be similar conditions at the
sites, it is not as easy to state that the results make sense as if 11271 and 16617, which are
in the same segment, produced such similar results. In station 16617 it was found that pH
had a statistically significant effect on fecal coliform values, though 16617 was the only
station where this was found.
In general, an expected result would be for pH to have an effect, due to the preference of
close-to-neutral pHs by bacteria. Dissolved Oxygen, as a measure of the ability for
certain wildlife to live in the water, would also have a potential effect, as would any of
the other parameters for various reasons. None of this, despite the theory, is born out in
the data in a widespread fashion, and with no R2 value exceeding 0.31, it is clear that
there must be other factors involved in determining fecal coliform values in the Houston
Ship Channel.
There could be a number of explanations for this. One is that the nature of the data could
provide too much variation: not only have the people gathering the data changed, but the
ways in which they gather it have evolved, and the site at which samples are collected is
almost always an estimate of its location. (“Water Quality Sampling and Shipping
Procedures” 2008) Another explanation for the lack of clear trends in the multiple
regression analysis is that there are factors that are known to effect fecal coliform levels,
such as storm water runoff, that are not taken into account in the data set, due to them
never being consistently measured historically. A third explanation is that the Ship
Channel itself has changed too much. The entire bank of the ship channel has been
created and recreated by old and new construction, it has been dredged to keep it
8
sufficiently deep for large ships, old barges have been incorporated into the banks to form
sturdy walls and left slowly to rust, new highways have been built, old bridges
demolished, and various chemical plants have released various different effluents over
the years. Much water body analysis concerns water bodies that are changing slowly in
relation to the data being collected, or in only a few noticeable ways at a time. (Eleria &
Vogel 2005) It seems reasonable to conclude, after examining the results from this
multiple linear regression analysis, that only a much larger dataset would allow for a
large number of clear trends to emerge explaining the relationships between fecal
coliform and other variables in the Houston Ship Channel.
Acknowledgments
The research study described herein was sponsored by the National Science Foundation
under the Award No. EEC-0649163. The opinions expressed in this study are those of the
authors and do not necessarily reflect the views of the sponsor.
Additional acknowledgment goes to Dr. Hanadi Rifai, and Anu Desai for immense
professional support and assistance.
9
References
Eleria, A., & Vogel, R. M. (2005). Predicting Fecal Coliform Bacteria Levels in the
Charles River, Massachusetts, USA. Journal of the American Water Resources
Association, 41, 1195-1209.
Ge, Z., & Frick, W. E. (2007). Some statistical issues related to multiple linear regression
modeling of beach bacteria concentrations. Environmental Research, 103, 358364.
Houston Metropolitan Area: A TMDL Project for Bacteria. (n.d.). In Texas Commission
on Environmental Quality. Retrieved July, 2008, from www.tceq.org/goto/tmdl/
Houston-Galveston Area Council. (n.d.). Greens Bayou Watershed Brochure [Brochure].
Author. Houston-Galveston Area Council. Retrieved July, 2008, from
http://www.h-gac.com/community/water/resources/default.aspx
Sampson, R. W., Swiatnicki, S. A., McDermott, C. M., & Kleinheinz, G. T. (2006). The
Effects of Rainfall on Escherichia coli and Total Coliform Levels at 15 Lake
Superior Recreational Beaches. Water Resources Management, 20, 151-159.
SAS. (2008). Predictive Analytics Software SAS. In SAS. Retrieved July, 2008, from
http://www.sas.com/technologies/analytics/index.html
Stellin, S. (2008, August 1). Is the Water Actually Fine? New York Times. Retrieved
August, 2008, from
http://travel.nytimes.com/2008/08/01/travel/escapes/01beach.html?scp=1&sq=bea
ch%20contamination&st=cse
USA. Texas Commission on Environmental Quality. (n.d.). Basin 10 San Jacinto River.
Retrieved July, 2008, from http://www.tceq.state.tx.us
10
USA. Texas Commission on Environmental Quality. (n.d.). FY 2009 Monitoring
Priorities for Category 5c Impairments. Retrieved July, 2008, from
http://www.tceq.state.tx.us
USA. Texas Commission on Environmental Quality. (n.d.). Water Quality Sampling and
Shipping Procedures. Texas Commission on Environmental Quality. Retrieved
July, 2008, from http://www.tceq.state.tx.us
USA. Texas Commission on Environmental Quality. (2008, March 19). Texas 303(d) List.
Retrieved June, 2008, from
http://www.tceq.state.tx.us/assets/public/compliance/monops/water/08twqi/2008_
303d.pdf
USA. U.S. Environmental Protection Agency. Office of Water. (2003). Bacterial Water
Quality Standards for Recreational Waters (Freshwater and Marine Waters)
Status Report. Washington, DC. Www.epa.gov. Retrieved July, 2008, from
http://www.epa.gov/waterscience/beaches/local/statrept.pdf tatrept.pdf
11