Chapter 1: Introduction What is statistics? Statistics is the science of data. It involves collecting, classifying, summarizing, organizing, analyzing, and interpreting numerical information. Statistics: Descriptive Statistics: mean, median, variance, standard deviation, histogram plot, pie chart. Inferential Statistics: hypothesis testing (drawing conclusion and making decision). 1.1 Some Important Terminology Population: is the set of all measurements of interest to the researcher. Could be finite or infinite. Could be real or hypothetical Sample: is a subset of measurements selected from the population of interest. Random Sample: every random sample of size n that can possibly be selected from the population has the same probability of been selected. Sample of Convenience: Samples which are available and convenient. Statistic: is a measure computed from sample data. EX: Parameter: is a constant that determines the specific form of a density function. Usually unknown, we need to estimate them. EX: Random Variable: the numerical data on which we perform statistical analyses are the outcomes of a random sampling procedure. A set of such outcomes is called a random variable. Continuous Variable: A random variable is continuous if the values that it can assume consist of all real numbers in some interval. Discrete Variable: A random variable can assume only a finite or countably infinite number of values. Population vs. Sample: We try to describe or predict the behavior of the population on the basis of information obtained from a representative sample from the population. 1 1.2 Hypothesis Testing Statistical inference: Hypothesis testing: A statement about one or more populations. Interval estimation: next section A statistical test of hypothesis consist: 1. The null hypothesis, denoted by H 0 . 2. 3. 4. 5. The alternative hypothesis, denoted by H 1 . The test statistic and its p-value. The decision rule. The conclusion. Alternative hypothesis H 1 : generally the hypothesis that the researcher wishes to support. Null hypothesis H 0 : a contradiction of the alternative hypothesis. A statistical researcher always begins by assuming H 0 is true. Use sample data to conclude. Statement 1: The average height for CMU students is different from 75 inches. H0 : H1 : Statement 2: The average height for CMU students is greater than 75 inches. H0 : H1 : Statement 3: The average height for CMU students is less than 75 inches. H0 : H1 : Alternative hypothesis One-tailed H1 : µ < µ0 H1 : µ > µ0 Two-tailed H1 : µ ≠ µ0 Test statistic: a single number calculated from the sample data. p-value: the probability of observing, when H 0 is true. Either or both of these measures act as decision makers for the researcher in deciding whether to reject or accept H 0 . Rejection region: consisting of values that support the H 1 and lead to rejection of H 0 . Critical values: separate the acceptance and rejection regions. 2 Level of Significance α : the risk you are willing to take of making an incorrect decision. Decision Rule: Reject H 0 if α ≥ p-value. In court room: True State of Nature Innocent Correct decision Mistake Innocent Guilty Evidence against suspect Guilty Mistake Correct decision Risk of a test of hypothesis True State of Nature Statistical Conclusion Reject H 0 H 0 is true Fail to reject H 0 Correct decision H 0 is false Type II error (probability β) Type I error (probability α) Correct decision Note that: α = P(Type I error) = P(Reject H 0 | H 0 is true)= P(rejection region| H 0 is true) β = P(Type II error) = P(Fail to reject H 0 | H 0 is false) Power of a test =1- β = P(Reject H 0 | H 0 is false) High power is always a desirable characteristic of a test. Increasing power by increasing sample size Increasing power by increasing significance level: α = 0.05 will result in a more powerful test than will a choice of α = 0.01 . Increased power as a result of size of effect to be detected Efficiency of hypothesis tests: The relative efficiency of test A to test B (for the same H 0 , H 1 ,α and β ) is the ratio n B / n A , where n A and n B are the sample sizes, respectively, of tests A and B. 3 1.3 Estimation Point estimation: a single value called the estimate from sample data. Interval estimation: two possible values of the parameter being estimated-a lower value and an upper value. It’s a more desirable and useful estimate. Called confidence interval. The probability that a confidence interval will contain the estimated parameter is called the confidence coefficient, designated by (1 − α ) . The probabilistic interpretation is based on the fact that in repeated sampling, 100(1 − α )% of the intervals constructed in the same manner (and with the same sampling sizes) contain the parameter being estimated. The practical interpretation, we said we are 100(1 − α )% ”confident” that the single interval constructed contains the parameter we are estimating. Confidence interval for the parameter θ in probability terms: P( L0 < θ < U 0 ) = 1 − α where L0 and U 0 are random variables. Confidence interval for the parameter θ : C ( L < θ < U ) = 100(1 − α )% where L and U are specified values Hypothesis testing and confidence interval are related. Possible values of the parameter θ contained in the 100(1 − α )% confidence interval L < θ < U are those values that are compatible with hypothesis testing. 4 1.4 Measurement Scales Nominal scale(Count Data, Categorical Data): distinguishes one object or event from another on the basis on a name. EX: Ordinal scale (Categorical Data): distinguishes objects or events measured on the ordinal scale from one another on the basis of the relative amount of some characteristic they possess. EX: Interval scale: When objects or events can be distinguished one from another and ranked, and when the differences between measurements also have meaning. EX: Ratio scale: When measurements have the properties of the first three scales and the additional property that their ratios are meaningful, the scale of measurement is the ratio scale. A property of the ratio scale is a true zero, indicating a complete absence of the trait being measured. EX: 5 1.5 Nonparametric Statistics Why nonparametric statistics? Nonparametric procedures: a) Truly nonparametric procedures. b) Distribution-free procedures. Advantage of Nonparametric statistics: a) depend on a minimum of assumption. b) for some nonparametric procedures, the computations can be quickly and easily performed. c) required minimum preparation in mathematics and statistics. d) when the data are measured on a weak measurement scale, as when only count data or rank data are available. Disadvantage of Nonparametric statistics: a) the nonparametric procedures are used when parametric procedures are more appropriate. Such a practice often wastes information. b) the arithmetic in many instances is tedious when sample size is large. When to use nonparametric procedure: a) The hypothesis to be tested does not involve a population parameter. b) The data have been measured on a scale weaker than required for parametric procedure. c) The assumptions necessary for the valid use of a parametric procedure are not met. d) Results are needed in a hurry, a computer is not readily available and calculations must be done by hand. 1.6 Availability and Use of Computer Programs in Nonparametric Statistical Analysis 6
© Copyright 2026 Paperzz