MENDMC01_0131877062.QXD 3/29/06 11:48 PM Page 1 MARKED SET CHAPTER 1 Introduction OBJECTIVE To identify the role of statistics in the analysis of data from engineering and the sciences CONTENTS • • • 1.1 Statistics: The Science of Data 1.2 Fundamental Elements of Statistics 1.3 Types of Data 1.4 The Role of Statistics in Critical Thinking 1.5 A Guide to Statistical Methods Presented in This Text STATISTICS IN ACTION Contamination of Fish in the Tennessee River: Collecting the Data 1 MENDMC01_0131877062.QXD 3/29/06 11:48 PM Page 2 MARKED SET 2 Chapter 1 Introduction 1.1 Statistics: The Science of Data According to The Random House College Dictionary, statistics is “the science that deals with the collection, classification, analysis, and interpretation of numerical facts or data.” In short, statistics is the science of data. Definition 1.1 Statistics is the science of data. This involves collecting, classifying, summarizing, organizing, analyzing, and interpreting data. The science of statistics is commonly applied to two types of problems: 1. Summarizing, describing, and exploring data 2. Using sample data to infer the nature of the data set from which the sample was selected As an illustration of the descriptive applications of statistics, consider the United States census, which involves the collection of a data set that purports to characterize the socioeconomic characteristics of the approximately 295 million people living in the United States. Managing this enormous mass of data is a problem for the computer software engineer, and describing the data utilizes the methods of statistics. Similarly, an environmental engineer uses statistics to describe the data set consisting of the daily emissions of sulfur oxides of an industrial plant recorded for 365 days last year. The branch of statistics devoted to these applications is called descriptive statistics. Definition 1.2 The branch of statistics devoted to the organization, summarization, and description of data sets is called descriptive statistics. Sometimes the phenomenon of interest is characterized by a data set that is either physically unobtainable or too costly or time-consuming to obtain. In such situations, we obtain a subset of the data—called a sample—and use the sample information to infer its nature. To illustrate, suppose the phenomenon of interest is the drinking-water quality on an inhabited, but remote, Pacific island. You might expect water quality to depend on such factors as temperature of the water, the level of the most recent rainfall, etc. In fact, if you were to measure the water quality repeatedly within the same hour at the same location, the quality measurements would vary, even for the same water temperature. Thus, the phenomenon “drinking-water quality” is characterized by a large data set that consists of many (actually, an infinite number of) water quality measurements—a data set that exists only conceptually. To determine the nature of this data set, we sample it—i.e., we record quality for n water specimens collected at specified times and locations, and then use this sample of n quality measurements to infer the nature of the large conceptual data set of interest. The branch of statistics used to solve this problem is called inferential statistics. Definition 1.3 The branch of statistics concerned with using sample data to make an inference about a large set of data is called inferential statistics. 1.2 Fundamental Elements of Statistics In statistical terminology, the data set that we want to describe, the one that characterizes a phenomenon of interest to us, is called a population. Then, we can define a sample as a subset of data selected from a population. Sometimes, the words population and sample are used to represent the objects upon which the measurements MENDMC01_0131877062.QXD 3/29/06 11:48 PM Page 3 MARKED SET 1.2 Fundamental Elements of Statistics 3 are taken (i.e., the experimental units). In a particular study, the meaning attached to these terms will be clear by the context in which they are used. Definition 1.4 A statistical population is a data set (usually large, sometimes conceptual) that is our target of interest. Definition 1.5 A sample is a subset of data selected from the target population. Definition 1.6 The object (e.g., person, thing, transaction, specimen, or event) upon which measurements are collected is called the experimental unit. (Note: A population consists of data collected on many experimental units.) In studying populations and samples, we focus on one or more characteristics or properties of the experimental units in the population. The science of statistics refers to these characteristics as variables. For example, in the drinking-water quality study, two variables of interest to engineers are the chlorine-residual (measured in parts per million) and the number of fecal coliforms in a 100-milliliter water specimen. Definition 1.7 A variable is a characteristic or property of an individual experimental unit. Example 1.1 Engineers with the University of Kentucky Transportation Research Program have collected data on accidents occurring at intersections in Lexington, Kentucky. One of the goals of the study was to estimate the rate at which left-turn accidents occur at intersections without left-turn-only lanes. This estimate will be used to develop numerical warrants (or guidelines) for the installation of leftturn lanes at all major Lexington intersections. The engineers collected data at each of 50 intersections without left-turn-only lanes over a 1-year period. At each intersection, they monitored traffic and recorded the total number of cars turning left that were involved in an accident. a. Identify the variable and experimental unit for this study. b. Describe the target population and the sample. c. What inference do the transportation engineers want to make? Solution a. Since the engineers collected data at each of 50 intersections, the experimental unit is an intersection without a left-turn-only lane. The variable measured is the total number of cars turning left that were involved in an accident. b. The goal of the study is to develop guidelines for the installation of left-turn lanes at all major Lexington intersections; consequently, the target population consists of all major intersections in the city. The sample consists of the subset of 25 intersections monitored by the engineers. c. The engineers will use the sample data to estimate the rate at which left-turn accidents occur at all major Lexington intersections. (We learn, in Chapter 7, that this estimate is the number of left-turn accidents in the sample divided by the total number of cars making left turns in the sample.) The preceding definitions and example identify four of the five elements of an inferential statistical problem: a population, one or more variables of interest, a sample, and an inference. The fifth element pertains to knowing how good the inference is— that is, the reliability of the inference. The measure of reliability that accompanies an inference separates the science of statistics from the art of fortune-telling. A palm MENDMC01_0131877062.QXD 3/29/06 11:48 PM Page 4 MARKED SET 4 Chapter 1 Introduction reader, like a statistician, may examine a sample (your hand) and make inferences about the population (your future life). However, unlike statistical inferences, the palm reader’s inferences include no measure of how likely the inference is to be true. To illustrate, consider the transportation engineers’ estimate of the left-turn accident rate at Lexington, Kentucky, intersections in Example 1.1. The engineers are interested in the error of estimation (i.e., the difference between the sample accident rate and the accident rate for the target population). Using statistical methods, we can determine a bound on the estimation error. This bound is simply a number (e.g., 10%) that our estimation error is not likely to exceed. In later chapters, we learn that this bound is used to help measure our “confidence” in the inference. The reliability of statistical inferences is discussed throughout this text. For now, simply realize that an inference is incomplete without a measure of reliability. Definition 1.8 A measure of reliability is a statement (usually quantified) about the degree of uncertainty associated with a statistical inference. A summary of the elements of both descriptive and inferential statistical problems is given in the following boxes. Four Elements of Descriptive Statistical Problems 1. The population or sample of interest 2. One or more variables (characteristics of the population or sample units) that are to be investigated 3. Tables, graphs, or numerical summary tools 4. Identification of patterns in the data Five Elements of Inferential Statistical Problems 1. The population of interest 2. One or more variables (characteristics of the experimental units) that are to be investigated 3. The sample of experimental units 4. The inference about the population based on information contained in the sample 5. A measure of reliability for the inference Applied Exercises 1.1 of Ottawa, is a collection of publicly available data sets to serve researchers in building prediction software models. A PROMISE data set on software reuse, saved in the SWREUSE file, provides information on the success or failure of reusing previously developed software for each project in a sample of 24 new software development projects. (Data source: IEEE Transactions on Software Engineering, Vol. 28, 2002.) Of the 24 projects, 9 were judged failures and 15 were successfully implemented. a. Identify the experimental units for this study. b. Describe the population from which the sample is selected. c. Use the sample information to make an inference about the population. Steel anticorrosion study. Researchers at the Department of Materials Science and Engineering, National Technical University (Athens, Greece), examined the anticorrosive behavior of different epoxy coatings on steel. (Pigment & Resin Technology, Vol. 32, 2003.) Flat panels cut from steel sheets were coated with one of four different types of epoxy (S1, S2, S3, and S4). After exposing the panels to water for one day, the corrosion rate (nanoamperes per square centimeter) was determined for each panel. a. What are the experimental units for the study? b. Suppose you are interested in describing only the corrosion rates of steel panels coated with epoxy type S1. Define the target population and relevant sample. SWREUSE 1.2 Success/failure of software reuse. The PROMISE Soft- ware Engineering Repository, hosted by the University 1.3 Orchard contamination from insecticides. Pesticides ap- plied to an extensively grown crop can result in inadvertent MENDMC01_0131877062.QXD 3/29/06 11:48 PM Page 5 MARKED SET 1.3 Types of Data market. The system employs an air and water mixture designed to yield effective cooling with a much lower water flow than conventional hydrocooling. To compare the effectiveness of the two systems, 20 batches of green tomatoes were divided into two groups; one group was precooled with the new method, and the other with the conventional method. The water flow (in gallons) required to effectively cool each batch was recorded. a. Identify the population, the samples, and the type of statistical inference to be made for this problem. b. How could the sample data be used to compare the cooling effectiveness of the two systems? ambient air contamination. Environmental Science & Technology (Oct. 1993) reported on thion residues of the insecticide chlorpyrifos used on dormant orchards in the San Joaquin Valley, California. Ambient air specimens were collected daily at an orchard site during an intensive period of spraying—a total of 13 days—and the thion level (ng/m3) was measured each day. a. Identify the population of interest to the researchers. b. Identify the sample. 1.4 Ground motion of earthquakes. In the Journal of Earth- quake Engineering (Nov., 2004), a team of civil and environmental engineers studied the ground motion characteristics of 15 earthquakes that occurred around the world between 1940 and 1995. Three (of many) variables measured on each earthquake were the type of ground motion (short, long, or forward directive), earthquake magnitude (Richter scale) and peak ground acceleration (feet per second). One of the goals of the study was to estimate the inelastic spectra of any ground motion cycle. a. Identify the experimental units for this study. b. Do the data for the 15 earthquakes represent a population or a sample? Explain. 1.5 1.6 COGAS 1.7 Weekly carbon monoxide data. The World Data Centre for Greenhouse Gases collects and archives data for greenhouse and related gases in the atmosphere. One such data set lists the level of carbon monoxide gas (measured in parts per billion) in the atmosphere each week at the Cold Bay, Alaska, weather station. The weekly data for the years 2000–2002 are saved in the COGAS file. a. Identify the variable measured and the corresponding experimental unit. b. If you are interested in describing only the weekly carbon monoxide values at Cold Bay station for the years 2000–2002, does the data represent a population or a sample? Explain. 1.8 Monitoring defective items. Checking all manufactured Computer power system load currents. Electrical engi- neers recognize that high neutral current in computer power systems is a potential problem. To determine the extent of the problem, a survey of the computer power system load currents at 146 U.S. sites was taken (IEEE Transactions on Industry Applications, July/Aug. 1990). The survey revealed that less than 10% of the sites had high neutral to full-load current ratios. a. Identify the population of interest. b. Identify the sample. c. Use the sample information to make an inference about the population. Precooling vegetables. Researchers have developed a new precooling method for preparing Florida vegetables for 5 items coming off an assembly line for defectives would be a costly and time-consuming procedure. One effective and economical method of checking for defectives involves the selection and examination of a portion of the items by a quality control engineer. The percentage of examined items that are defective is computed and then used to estimate the percentage of all items manufactured on the line that are defective. Identify the population, the sample, and a type of statistical inference to be made for this problem. 1.3 Types of Data Data can be one of two types, quantitative or qualitative. Quantitative data are those that represent the quantity or amount of something, measured on a numerical scale. For example, the power frequency (measured in megahertz) of a semiconductor is a quantitative variable, as is the breaking strength (measured in pounds per square inch) of steel pipe. In contrast, qualitative (or categorical) data possess no quantitative interpretation. They can only be classified. The set of n occupations corresponding to a group of n engineering graduates is a qualitative data set. The type of pigment (zinc or mica) used in an anticorrosion epoxy coating also represents qualitative data.* *A finer breakdown of data types into nominal, ordinal, interval, and ratio data is possible. Nominal data are qualitative data with categories that cannot be meaningfully ordered. Ordinal data are also qualitative data, but a distinct ranking of the groups from high to low exists. Interval and ratio data are two different types of quantitative data. For most statistical applications (and all the methods presented in this introductory text), it is sufficient to classify data as either quantitative or qualitative. MENDMC01_0131877062.QXD 3/29/06 11:48 PM Page 6 MARKED SET 6 Chapter 1 Introduction Definition 1.9 Quantitative data are those that represent the quantity or amount of something. Definition 1.10 Qualitative data are those that have no quantitative interpretation, i.e., they can only be classified into categories. Example 1.2 The Journal of Performance of Constructed Facilities (Feb., 1990) reported on the performance dimensions of water distribution networks in the Philadelphia area. For one part of the study, the following variables were measured for each sampled water pipe section. Identify the data produced by each as quantitative or qualitative. a. Pipe diameter (measured in inches) b. Pipe material (steel or PVC) c. Pipe location (Center City or suburbs) d. Pipe length (measured in feet) Solution Both pipe diameter (in inches) and pipe length (in feet) are measured on a meaningful numerical scale; hence, these two variables produce quantitative data. Both type of pipe material and pipe location can only be classified—material is either steel or PVC; location is either Center City or the suburbs. Consequently, pipe material and pipe location are both qualitative variables. The proper statistical tool used to describe and analyze data will depend on the type of data. Consequently, it is important to differentiate between quantitative and qualitative data. Applied Exercises 1.9 Drinking-water quality study. Disasters (Vol. 28, 2004) published a study of the effects of a tropical cyclone on the quality of drinking water on a remote Pacific island. Water samples (size 500 milliliters) were collected approximately 4 weeks after Cyclone Ami hit the island. The following variables were recorded for each water sample. Identify each variable as quantitative or qualitative. a. Town where sample was collected b. Type of water supply (river intake, stream, or borehole) c. Acidic level (pH scale, 1 to 14) d. Turbidity level (nephalometric turbidity units = NTUs) e. Temperature (degrees Centigrade) f. Number of fecal coliforms per 100 milliliters g. Free chlorine-residual (milligrams per liter) h. Presence of hydrogen sulphide (yes or no) 1.10 Extinct New Zealand birds. Environmental engineers at the University of California (Riverside) are studying the patterns of extinction in the New Zealand bird population. (Evolutionary Ecology Research, July 2003.) The following characteristics were determined for each bird species that inhabited New Zealand at the time of the Maori colonization (i.e., prior to European Contact). a. Flight capability (volant or flightless) b. Habitat type (aquatic, ground terrestrial, or aerial terrestrial) c. Nesting site (ground, cavity within ground, tree, cavity d. e. f. g. h. above ground) Nest density (high or low) Diet (fish, vertebrates, vegetables, or invertebrates) Body mass (grams) Egg length (millimeters) Extinct status (extinct, absent from island, present) 1.11 Computer power system load currents. Refer to the IEEE Transactions on Industry Applications (July/Aug. 1990) survey of computer power system load currents in Exercise 1.5. In addition to the ratio of neutral current to full-load current, the researchers also recorded the type of load (lineto-line or line-to-neutral) and the computer system vendor. Identify the type of data for each variable recorded. 1.12 CT scanning for lung cancer. A new type of screening for lung cancer, computed tomography (CT), has been developed. Medical physicists believe CT scans are more sensitive than regular X-rays in pinpointing small tumors. The H. Lee Moffitt Cancer Center at the University of South MENDMC01_0131877062.QXD 3/29/06 11:48 PM Page 7 MARKED SET 1.4 The Role of Statistics in Critical Thinking 7 Florida is currently conducting a clinical trial of 50,000 smokers nationwide to compare the effectiveness of CT scans with X-rays for detecting lung cancer. (Todays’ Tomorrows, Fall 2002.) Each participating smoker is randomly assigned to one of two screening methods, CT or chest X-ray, and their progress tracked over time. In addition to the type of screening method used, the physicists recorded the age at which the scanning method first detects a tumor for each smoker. a. Identify the experimental units of the study. b. Identify the two variables measured for each experimental unit. c. Identify the type (quantitative or qualitative) of the variables measured. d. What is the inference that will ultimately be drawn from the clinical trial? variables in the drilling process are described here. Identify the data type for each variable. a. Chip discharge rate (number of chips discarded per minute) b. Drilling depth (millimeters) c. Oil velocity (millimeters per second) d. Type of drilling (single-edge, BTA, or ejector) e. Quality of hole surface 1.14 National Bridge Inventory. All highway bridges in the 1.13 Deep hole drilling. “Deep hole” drilling is a family of drilling processes used when the ratio of hole depth to hole diameter exceeds 10. Successful deep hole drilling depends on the satisfactory discharge of the drill chip. An experiment was conducted to investigate the performance of deep hole drilling when chip congestion exists (Journal of Engineering for Industry, May 1993). Some important United States are inspected periodically for structural deficiency by the Federal Highway Administration (FHWA). Data from the FHWA inspections are compiled into the National Bridge Inventory (NBI). Several of the nearly 100 variables maintained by the NBI are listed below. Classify each variable as quantitative or qualitative. a. Length of maximum span (feet) b. Number of vehicle lanes c. Toll bridge (yes or no) d. Average daily traffic e. Condition of deck (good, fair, or poor) f. Bypass or detour length (miles) g. Route type (interstate, U.S., state, county, or city) 1.4 The Role of Statistics in Critical Thinking Experimental research in engineering and the sciences typically involves the use of experimental data—a sample—to infer the nature of some conceptual population that characterizes a phenomenon of interest to the experimenter. This inferential process is an integral part of the scientific method. Inference based on experimental data is first used to develop a theory about some phenomenon. Then the theory is tested against additional sample data. How does the science of statistics contribute to this process? To answer this question, we must note that inferences based on sample data will almost always be subject to error, because a sample will not provide an exact image of the population. The nature of the information provided by a sample depends on the particular sample chosen and thus will change from sample to sample. For example, suppose you want to estimate the proportion of all steel alloy failures at U.S. petrochemical plants caused by stress corrosion cracking. You investigate the cause of failure for a sample of 100 steel alloy failures and find that 47 were caused by stress corrosion cracking. Does this mean that exactly 47% of all steel alloy failures at petrochemical plants are caused by stress corrosion cracking? Of course, the answer is “no.” Suppose that, unknown to you, the true percentage of steel alloy failures caused by stress corrosion cracking is 44%. One sample of 100 failures might yield 47 that were caused by cracking, whereas another sample of 100 might yield only 42. Thus, an inference based on sampling is always subject to uncertainty. On the other hand, suppose one petrochemical plant experienced a steel alloy failure rate of 81%. Is this an unusually high failure rate, given the sample rate of 47%? The theory of statistics uses probability to measure the uncertainty associated with an inference. It enables engineers and scientists to calculate the probabilities of observing specific samples or data measurements, under specific assumptions about the population. These probabilities are used to evaluate the uncertainties associated with sample inferences; for example, we can determine whether the plant’s steel alloy failure rate MENDMC01_0131877062.QXD 3/29/06 11:48 PM Page 8 MARKED SET 8 Chapter 1 Introduction Descriptive Probability Reliability Chapters 3, 4, 5, 6 Qualitative Section 2.1 Data Inferential Study Chapter 17 Quantitative Sections 2.2–2.7 Quantitative Qualitative Data Parameter Parameter Proportions Model relationships One proportion Two proportions Three or more proportions Chapters 10, 11, 12 Section 15.7 Sections 7.9, 8.11 16.6 Sections 7.10, 8.12 Chapter 9 Means Variances One mean Two means Three or more means One variance Two variances Sections 7.6, 8.7, 15.2, 16.3 Sections 7.7, 7.8, 8.9, 8.10 15.3, 15.4 Chapter 14, Sections 15.5, 15.6 Sections 7.11, 8.13, 16.4 Sections 7.12, 8.14 FIGURE 1.1 Flowchart of statistical methods described in the text of 81% is unusually high by calculating the chance of observing such a high rate given the sample information. Thus, a major contribution of statistics is that it enables engineers and scientists to make inferences—estimates and decisions about the target population—with a known measure of reliability. With this ability, an engineer can make intelligent decisions and inferences from data; that is, statistics helps engineers to think critically about their results. MENDMC01_0131877062.QXD 3/29/06 11:48 PM Page 9 MARKED SET Quick Review 9 Definition 1.11 Statistical thinking involves applying rational thought and the science of statistics to critically assess data and inferences. 1.5 A Guide to Statistical Methods Presented in This Text Although we present some useful methods for exploring and describing data sets (Chapter 2), the major emphasis in this text and in modern statistics is in the area of inferential statistics. The flowchart in Figure 1.1 is provided as an outline of the chapters in this text and as a guide to selecting the statistical method appropriate for your particular analysis. Quick Review Key Terms Data Descriptive statistics Experimental unit Inference Inferential statistics Measure of reliability Measurement error Population Qualitative data Quantitative data Reliability Sample Statistical thinking Statistics Variable Chapter Summary Notes • • • • • Two types of statistical applications: descriptive and inferential Fundamental elements of statistics: population, experimental units, variable, sample, inference, measure of reliability Descriptive statistics involves summarizing and describing data sets. Inferential statistics involves using a sample to make inferences about a population. Two types of data: quantitative and qualitative Supplementary Exercises 1.15 Reliability of a computer system. The reliability of a com- puter system is measured in terms of the lifelength of a specified hardware component (e.g., the hard disk drive). To estimate the reliability of a particular system, 100 computer components are tested until they fail, and their lifelengths are recorded. a. What is the population of interest? b. What is the sample? c. Are the data quantitative or qualitative? d. How could the sample information be used to estimate the reliability of the computer system? 1.16 Traveling turtle hatchlings. Hundreds of sea turtle hatch- lings, instinctively following the bright lights of condominiums, wandered to their deaths across a coastal highway in Florida (Tampa Tribune, Sept. 16, 1990). This incident led researchers to begin experimenting with special lowpressure sodium lights. One night, 60 turtle hatchlings were released on a dark beach and their direction of travel noted. The next night, the special lights were installed and the same 60 hatchlings were released. Finally, on the third night, tar paper was placed over the sodium lights. Conse- quently, the direction of travel was recorded for each hatchling under three experimental conditions—darkness, sodium lights, and sodium lights covered with tar paper. a. Identify the population of interest to the researchers. b. Identify the sample. c. What type of data were collected, quantitative or qualitative? 1.17 Acid neutralizer experiment. A chemical engineer con- ducts an experiment to determine the amount of hydrochloric acid necessary to neutralize 2 milliliters (ml) of a newly developed cleaning solution. The chemist prepares five 2-ml portions of the solution and adds a known concentration of hydrochloric acid to each. The amount of acid necessary to achieve neutrality of the solution is determined for each of the five portions. a. Identify the experimental units for the study. b. Identify the variable measured. c. Describe the population of interest to the chemical engineer. d. Describe the sample. MENDMC01_0131877062.QXD 3/29/06 11:48 PM Page 10 MARKED SET 10 Chapter 1 Introduction 1.18 PCB in soil. A preliminary study was conducted to obtain information on the background levels of the toxic substance polychlorinated biphenyl (PCB) in soil samples in the United Kingdom (Chemosphere, Feb. 1986). For each soil sample taken, the researchers recorded the location (rural or urban) and the PCB level (measured in grams per kilogram of soil). Identify the variables measured as quantitative or qualitative. 1.19 Intellectual development of engineering students. Perry’s model of intellectual development was applied to undergraduate engineering students at Penn State (Journal of Engineering Education, Jan. 2005). Perry scores (ranging from 1 to 5) were determined for 21 students in a firstyear, project-based design course. (Note: A Perry score of 1 indicates the lowest level of intellectual development, and a Perry score of 5 indicates the highest level.) The average Perry score for the 21 students was 3.27. a. Identify the experimental units for this study. b. What is the population of interest? The sample? c. What type of data, quantitative or qualitative, are collected? d. Use the sample information to make an inference about the population. 1.20 Type of data. State whether each of the following data sets a. Arrival times of 16 reflected seismic waves b. Types of computer software used in a database man- agement system c. Brands of calculator used by 100 engineering students on campus d. Ash contents in pieces of coal from three different mines e. Mileages attained by 12 automobiles powered by alcohol f. Numbers of print characters per line of computer out- put for 20 line printers g. Shift supervisors in charge of computer operations at an airline company h. Accident rates at 46 machine shops 1.21 Structurally deficient bridges. Refer to Exercise 1.14. The most recent NBI data were analyzed, and the results published in the Journal of Infrastructure Systems (June 1995). Using the FHWA inspection ratings, each of the 470,515 highway bridges in the United States was categorized as structurally deficient, functionally obsolete, or safe. About 26% of the bridges were found to be structurally deficient, and 19% were functionally obsolete. a. What is the variable of interest to the researchers? b. Is the variable of part a quantitative or qualitative? c. Is the data set analyzed a population or a sample? Explain. d. How did the researchers obtain the data for their study? is quantitative or qualitative. • • • STATISTICS IN ACTION Contamination of Fish in the Tennessee River: Collecting the Data hemical and manufacturing plants often discharge toxic waste materials into nearby rivers and streams. These toxicants have a detrimental effect on the plant and animal life inhabiting the river and the river’s bank. One type of pollutant, commonly known as DDT, is especially harmful to fish and, indirectly, to people. The Food and Drug Administration sets the limit for DDT content in individual fish at 5 parts per million (ppm). Fish with DDT content exceeding this limit are considered potentially hazardous to people if consumed. A study was undertaken to examine the DDT content of fish inhabiting the Tennessee River (in Alabama) and its tributaries. The Tennessee River flows in a west–east direction across the northern part of the state of Alabama, through Wheeler Reservoir, a national wildlife refuge. Ecologists fear that contaminated fish migrating from the mouth of the river to the reservoir could endanger other wildlife that prey on the fish. This concern is more than academic. A manufacturing plant was once located along Indian Creek, which enters the Tennessee River 321 miles upstream from the mouth. Although the plant has been inactive for over 10 years, there is evidence that the plant discharged toxic materials into the creek, contaminating all the fish in the immediate area. Have the fish in the Tennessee River and its tributary creeks also been contaminated? And if so, how far upstream have the contaminated fish migrated? To answer these and other questions, members of the U.S. Army Corps of Engineers collected fish specimens at different locations along the Tennessee River and three tributary creeks: Flint Creek (which enters the river 309 miles upstream from the river’s mouth), Limestone Creek (310 miles upstream), and Spring Creek (282 miles upstream). Each fish was first weighed (in grams) and measured (length in centimeters), then the fillet of the fish was extracted and the DDT concentration (in parts per million) in the fillet was measured. The FISHDDT file contains the length, weight, and DDT measurements for a total of 144 fish specimens. FISHDDT Obviously, not all the fish in the Tennessee River and its tributaries were captured. Consequently, the data are based on a sample collected from the population of all fish inhabiting the Tennessee River. Here, the C MENDMC01_0131877062.QXD 3/29/06 11:48 PM Page 11 MARKED SET Statistics in Action 11 words population and sample are used to describe the objects upon which the measurements are taken, i.e., the fish. We could also use the terms to represent data sets. For example, the 144 DDT measurements represent a sample collected from the population consisting of DDT measurements for all fish inhabiting the river. In addition to the quantitative variables length, weight, and DDT concentration, notice that the data set also contains information on the qualitative variables location (i.e., where the fish were captured) and species of the fish. Three species of fish were examined: channel catfish, largemouth bass, and smallmouth buffalo. The different symbols for location are interpreted as follows. The first two characters represent the river or creek, and the remaining characters represent the distance (in miles) from the mouth of the river or creek. For example, FCM5 indicates that the fish was captured in Flint Creek (FC), 5 miles upstream from the mouth of the creek (M5). Similarly, TRM380 denotes a fish sample collected from the Tennessee River (TR), 380 miles upstream from the river’s mouth (M380). The U.S. Army Corps of Engineers used the data in the FISHDDT file to compare the DDT contents of fish at different locations and among the different species, and to determine the relationship (if any) of length and weight to DDT content. In subsequent chapters, we demonstrate several of these analyses in examples, exercises, and in a statistics in Action section. • MENDMC01_0131877062.QXD 3/29/06 11:48 PM Page 12 MARKED SET
© Copyright 2025 Paperzz