Analysis of Multiple Linear Regression in Three Segments of the Houston Ship Channel Zachary J. Van Brunt Advisors: Dr. Hanadi Rifai, Anu Desai Table of Contents Abstract 2 Introduction 2 Background & Theory 3 Procedures 5 Results 7 Discussion 8 Acknowledgements 9 References 10 1 Abstract With the understanding that clean water fit for contact recreation is a priority in a modern urban environment, the Texas Commission on Environmental Quality (TCEQ) and various other environmental sampling agencies have sampled water quality in Houston for more than 30 years, more or less continuously. Data from six sampling sites, all within the Houston ship channel, were analyzed to study relationships between pH, temperature, specific conductance, dissolved oxygen, and ammonia in relation to levels of fecal coliform, using multiple linear regression analysis. At three of the sites there were shown to be statistically significant relationships (p-value < 0.05), while the other three showed no discernable relationships between any of the variables examined. Introduction Contaminated surface water is an issue that has become crucially important in recent years, and is still gaining public prominence. Surface water quality is at a meeting point, with concerns over environmental responsibility and sustainability on one side, and an enduring watchfulness for public sanitation and health on another. Beach closures, particularly those related to high levels of Escherichia coli (E. coli) or fecal coliform, seem to have increased public awareness of this issue. (Stellin 2008) Beaches are clearly not the only water bodies affected though, and in Houston levels of fecal coliform and E. coli in the city’s extensive system of bayous have regularly failed to meet standards set by the Environmental Protection Agency (EPA). In order to assess the level of impairment various water bodies face, programs known as Total Maximum Daily Loads (TMDLs) are undertaken. The purpose of a TMDL for bacteria is to sample at various points in a water body over time in order to eventually produce a single number, which is the total maximum load of bacteria that can be put into a water body each day without the water becoming unsafe for human use. The TCEQ instituted a TMDL for two central Houston bayous, Buffalo and White Oak Bayous, in April of 2000, and then issued another TMDL for five additional bayous in the Houston 2 Metropolitan Area in June of 2005 (“Houston Metropolitan Area” 2008). This paper discusses three segments added onto the Houston Metro Area TMDL as of 2008 for review, if not necessarily for sampling, due to their lower likelihood of being used for contact recreation, due to all of them being geographically located in the vicinity of the Houston Ship Channel. Background & Theory The Houston Ship Channel, formed from the channelization of Buffalo Bayou in 1915 (“Basin 10 San Jacinto River” 2008) has exhibited problematic water quality for a number of years, not just concerning contact recreation, but also dioxin and PCB ingestion through fish tissue, and a general public perception in the region of pollution. Three specific segments, as defined by the TCEQ, are being examined in this paper. The first of these is defined as “Buffalo Bayou (US 59 to upstream of 69th Street WWTP)” in the TCEQ’s 303(d) list, a list of all impaired water bodies in the state of Texas. A visual examination of this segment was made during sampling for an unrelated project over the summer, and the whole segment has clearly been a part of the industrial ship channel area for quite some time and subject to regular boat traffic. The land surrounding the channel in this area includes open-air storage, warehouses, and as the title of the segment indicates, a large Wastewater Treatment Plant. This area is the furthest west of the three segments being examined, and at its westernmost point nears the northern section of downtown Houston. Three of the sampling stations being examined (11295, 11296, and 15841) are located in this segment. The second segment is defined as “ Houston Ship Channel Tidal – Greens Bayou confluence to Patrick Bayou confluence.” Its location, shown by the central red box in Figure 1, is, again, an industrialized section of the Houston Ship Channel, in a more frequently trafficked section of the channel to the east of the first segment. Greens Bayou and Patrick Bayou are frequently traveled by barges and their confluences define the bounds for this segment. This segment includes stations 11271 and 16617. The third and final segment being examined is described in the 303(d) list as “Houston Ship Channel Tidal- Lynchburg Ferry Road to Goose Island.” This segment exists near a turn in the Ship Channel, and includes the constant ship traffic 3 created by the running of the Lynchburg Ferry. It includes station 11258, and is shown by the red rectangle on the right in Figure 1. Figure 1: Station Locations in Houston Ship Channel Multiple linear regression analysis is a widely used method of statistically analyzing a dataset where a large number of variables could potentially be impacting a final result. (Ge & Frick 2007) A reasonably intuitive method for understanding the concept behind Multiple linear regression analysis is to first look at a simple linear regression. Finding a linear regression for a dataset is a simple, well-understood tool for finding the relationship between an independent and a dependent variable. A simple linear regression would give an equation: Log(FC) = b0 + b1(pH) (The equation above would show the values of the log of Fecal Coliform values that would result from a certain pH value.) Multiple linear regression is a good deal more complicated, but in principal is achieving a similar result, in that it is determining the relationship between variables. Now, however, 4 there are a large number of independent variables that could potentially be affecting the dependent variable, and all by various amounts. An equation for multiple linear regression would generally take a form similar to: Log(FC) = b0 + b1(pH) + b2(Temp) + b3(Cond) + b4(DO) + b5(Ammonia) + b6(Salinity) (The above equation projects a value for the log of a Fecal Coliform value based on different values for pH, Temperature, Conductivity, Dissolved Oxygen, Ammonia Concentration, and Salinity.) Calculating a multiple linear regression analysis for a large set of data would be immensely time-intensive manually, but luckily statistical analysis software can be used to quickly run the test. The results from such a software, given a certain data set, explain how much of the dependent variable can be explained by the other variables (R2 value,) and also would tell you with how much confidence (p-value) any single variable can be said to be contributing to the final result. As a method multiple linear regression analysis has been used widely to study the effects of various parameters on water quality. (Eleria & Vogel 2005) It is naturally suited to examining complex systems such as streams, rivers, estuaries, etc. when it is felt that sufficient data has been collected to potentially describe the cause of something like fecal coliform levels. (Ge & Frick 2007) Procedures When analyzed from a temporal perspective, the vast majority of the procedural undertakings for a multiple regression analysis, given data from the TCEQ, is in the manipulation of and transfer of data into a more readily accessibly and useful format. The data are provided in text files, accessible to the general public on the TCEQ website (http://www.tceq.state.tx.us). These text files are delimited by the character “|,” so that they are then imported into Microsoft Access as databases. The headings for the data, necessary to understand what the columns are, have been provided in a separate text file from the TCEQ, which must then be incorporated into the Access database by hand. At this point there will be two Access databases for every sub-basin: one database 5 containing the “Event” data, such as what type of sample was taken and when and where, and another database containing the “Result” data, such as what parameters were collected when the sample was taken, and what the actual value of that parameter was for that sampling time. In these two databases, there are unique IDs assigned to each and every sample that allow the “Event” file and the “Result” file to be matched together. A query in Microsoft Access is then used to pair up the “Event” data with their matching “Result” data, creating a new database that includes all pertinent data. This is done individually for each of the 3 sub-basins that contain TCEQ sampling stations that are in the three segments of the Houston Ship Channel being examined. At this point the data has reached roughly its desired form. After this the desired stations are selected out of all stations in the database, and this data is copied into Microsoft Excel. The next step in this process is to check to make sure that the data is in the right format, and to pare it down to just the parameters desired. Any reading in Fahrenheit must be converted to Celsius, and data for parameters such as phosphate, though available, isn’t being examined in the scope of this report. All of this data is culled, leaving a substantially smaller excel file, which can be readily worked with. At this point Excel’s “VLOOKUP” function is used to create a new spreadsheet that will find, by date, all the various parameters that were sampled at any single point, and pair them up into a single line in the spreadsheet. At this point all that remains of the Excel formatting portion is to remove all the date-ranges that are missing data for one or more of the parameters. Unfortunately, but perhaps reasonably considering the large temporal spread of the data, the parameters that were measured have not remained constant, but have changed from year to year and from sampling agency to sampling agency. The practical implication of this is that approximately 60% of the data are missing one or more parameters, and are discarded. The data that remain are analyzed using Statistical Analysis Software (SAS). (SAS 2008) The data from Excel is loaded into the SAS software. Within the software various 6 programs may be run to analyze data, and the one chosen is a program written in FORTRAN specifically to call on and use SAS’s multiple linear regression capabilities. This program is used for each of the six stations individually, and the results of each are copied into a text file. Results The results (Table 1) show that three Sampling stations returned both R2 Values in the desired range and had p-values<0.05. These stations were 11258, 11271, and 16617. Both 11258 and 11271 gave very low p-values (<0.0001) with temperature and conductance, and p-values near p<0.03 for dissolved oxygen. The R2 value for station 11258 was 0.30, while the R2 value for station 11271 was 0.31, suggesting that approximately 30% of the fecal coliform concentrations in these two stations can be accounted for by the temperature, conductance, and dissolved oxygen. The results for these two stations are noticeably similar, though no clear reason for this is evident. Station 16617, in the same segment as 11271, had an R2 value of 0.20, with the only parameter with a p-value <0.05 being pH. This suggests that for the data from that station, pH is being shown to contribute substantially to the fecal coliform levels. Station 11258 11271 16617 11295 11296 15841 Temperature Conductance Dissolved Salinity Ammonia R2 Oxygen Value ----<0.0001 <0.0001 0.0257 ----0.30 ----<0.0001 <0.0001 0.0203 ----0.31 0.0324 --------------------0.20 --------------------0.00 --------------------0.03 --------------------0.17 pH Table 1: Results obtained from SAS software Three other stations found no statistically significant relationship between any of the parameters and the fecal coliform concentrations present. 7 Discussion The results obtained from the multiple linear regression analysis show, generally, nothing so decisive as to lead to a clear conclusion. In two stations, 11258 and 11271, it seemed that not only was there clearly an effect on the fecal coliform levels by the various parameters, but the same parameters were producing the effect, and with a high confidence (p-value <0.03) These sites are reasonably close to each other geographically but not actually in the same segment, so though there might be similar conditions at the sites, it is not as easy to state that the results make sense as if 11271 and 16617, which are in the same segment, produced such similar results. In station 16617 it was found that pH had a statistically significant effect on fecal coliform values, though 16617 was the only station where this was found. In general, an expected result would be for pH to have an effect, due to the preference of close-to-neutral pHs by bacteria. Dissolved Oxygen, as a measure of the ability for certain wildlife to live in the water, would also have a potential effect, as would any of the other parameters for various reasons. None of this, despite the theory, is born out in the data in a widespread fashion, and with no R2 value exceeding 0.31, it is clear that there must be other factors involved in determining fecal coliform values in the Houston Ship Channel. There could be a number of explanations for this. One is that the nature of the data could provide too much variation: not only have the people gathering the data changed, but the ways in which they gather it have evolved, and the site at which samples are collected is almost always an estimate of its location. (“Water Quality Sampling and Shipping Procedures” 2008) Another explanation for the lack of clear trends in the multiple regression analysis is that there are factors that are known to effect fecal coliform levels, such as storm water runoff, that are not taken into account in the data set, due to them never being consistently measured historically. A third explanation is that the Ship Channel itself has changed too much. The entire bank of the ship channel has been created and recreated by old and new construction, it has been dredged to keep it 8 sufficiently deep for large ships, old barges have been incorporated into the banks to form sturdy walls and left slowly to rust, new highways have been built, old bridges demolished, and various chemical plants have released various different effluents over the years. Much water body analysis concerns water bodies that are changing slowly in relation to the data being collected, or in only a few noticeable ways at a time. (Eleria & Vogel 2005) It seems reasonable to conclude, after examining the results from this multiple linear regression analysis, that only a much larger dataset would allow for a large number of clear trends to emerge explaining the relationships between fecal coliform and other variables in the Houston Ship Channel. Acknowledgments The research study described herein was sponsored by the National Science Foundation under the Award No. EEC-0649163. The opinions expressed in this study are those of the authors and do not necessarily reflect the views of the sponsor. Additional acknowledgment goes to Dr. Hanadi Rifai, and Anu Desai for immense professional support and assistance. 9 References Eleria, A., & Vogel, R. M. (2005). Predicting Fecal Coliform Bacteria Levels in the Charles River, Massachusetts, USA. Journal of the American Water Resources Association, 41, 1195-1209. Ge, Z., & Frick, W. E. (2007). Some statistical issues related to multiple linear regression modeling of beach bacteria concentrations. Environmental Research, 103, 358364. Houston Metropolitan Area: A TMDL Project for Bacteria. (n.d.). In Texas Commission on Environmental Quality. Retrieved July, 2008, from www.tceq.org/goto/tmdl/ Houston-Galveston Area Council. (n.d.). Greens Bayou Watershed Brochure [Brochure]. Author. Houston-Galveston Area Council. Retrieved July, 2008, from http://www.h-gac.com/community/water/resources/default.aspx Sampson, R. W., Swiatnicki, S. A., McDermott, C. M., & Kleinheinz, G. T. (2006). The Effects of Rainfall on Escherichia coli and Total Coliform Levels at 15 Lake Superior Recreational Beaches. Water Resources Management, 20, 151-159. SAS. (2008). Predictive Analytics Software SAS. In SAS. Retrieved July, 2008, from http://www.sas.com/technologies/analytics/index.html Stellin, S. (2008, August 1). Is the Water Actually Fine? New York Times. Retrieved August, 2008, from http://travel.nytimes.com/2008/08/01/travel/escapes/01beach.html?scp=1&sq=bea ch%20contamination&st=cse USA. Texas Commission on Environmental Quality. (n.d.). Basin 10 San Jacinto River. Retrieved July, 2008, from http://www.tceq.state.tx.us 10 USA. Texas Commission on Environmental Quality. (n.d.). FY 2009 Monitoring Priorities for Category 5c Impairments. Retrieved July, 2008, from http://www.tceq.state.tx.us USA. Texas Commission on Environmental Quality. (n.d.). Water Quality Sampling and Shipping Procedures. Texas Commission on Environmental Quality. Retrieved July, 2008, from http://www.tceq.state.tx.us USA. Texas Commission on Environmental Quality. (2008, March 19). Texas 303(d) List. Retrieved June, 2008, from http://www.tceq.state.tx.us/assets/public/compliance/monops/water/08twqi/2008_ 303d.pdf USA. U.S. Environmental Protection Agency. Office of Water. (2003). Bacterial Water Quality Standards for Recreational Waters (Freshwater and Marine Waters) Status Report. Washington, DC. Www.epa.gov. Retrieved July, 2008, from http://www.epa.gov/waterscience/beaches/local/statrept.pdf tatrept.pdf 11
© Copyright 2026 Paperzz