A frequency distribution is a kind of probability distribution. It gives the frequency or relative frequency at which given values have been observed among the data collected. For example, for age, From To 0.000000<x<=10.00000 10.00000<x<=20.00000 20.00000<x<=30.00000 30.00000<x<=40.00000 40.00000<x<=50.00000 50.00000<x<=60.00000 60.00000<x<=70.00000 Missing Frequency table: Var2 (Spreadsheet1) Count Cumulative Percent Cumulative Count Percent 75 75 58.59375 58.5938 26 101 20.31250 78.9063 13 114 10.15625 89.0625 7 121 5.46875 94.5313 5 126 3.90625 98.4375 2 128 1.56250 100.0000 0 128 0.00000 100.0000 0 128 0.00000 100.0000 From the above frequency distribution we can determine that The probability of a patient being between 10 and 20 years of age = 0.2031. The probability of a patient being less than or equal to 30 years = 0.8906. A random variable = a quantity that can assume a number of different values such that any particular outcome is determined by chance. Discrete random variables take on isolated values, e.g. gender=male female number of bacteria in a slide or a volume of suspension Continuous random variables take on any value within a specified interval, e.g. weight or height or serum cholesterol. A probability distribution describes the behaviour of a random variable. For discrete variables, it specifies all possible outcomes along with the probability that each will occur. For continuous variables, a probability distribution specifies the probability associated with specified ranges of values. 1 Three theoretical probability distributions 1. The binomial distribution Consider a random variable that can take on one of two possible values, where one of the values can be viewed as a success and the other as a failure, e.g., Y=the presence/absence of disease. Y is a Bernoulli random variable. Let X=the number of successes out of n trials, i.e., the number of patients who are diseased out of n patients examined, then The probability distribution for X is the Binomial distribution from which we can determine the following expression for the probability that X takes on a specific value, x, n x n− x P( X = x ) = p (1 − p) x where p=probability of disease. Binomial probability distributions: n=10,p=0.71 2 Application of Binomial probability distribution. (from Pagano & Gauvreau) Suppose we are interested in investigating the probability that a patient who has been stuck with a needle infected with hepatitis B actually develops the disease. Let Y = the disease status =1 if individual develops hepatitis =0 if not. If 30% of the patients who are exposed to hepatitis B become infected, then P(Y=1)=0.30 and P(Y=0)=1-0.3=0.70. Suppose we select 5 individuals from the population of patients who have been stuck with a needle infected with hepatitis B. Then X= number of patients who develops the disease is a binomial random variable with n=5 and p=0.30. So we can calculate the probability that X assumes given values as follows: 5 p( X = 2) = (0.30) 2 (1 − 0.30) 5− 2 = 10 × 0.30 2 0.703 = 0.309 2 In addition, the mean number of people who develop the disease in repeated samples of size 5 is np = 5 × 0.30 = 1.5 and the standard deviation is np(1 − p) = 5 × 0.3 × 0.7 = 1.03 2. The Poisson distribution: -used to model discrete events that occur infrequently in time or space. More specifically, let X= the number of occurrences of some event of interest over a given interval. Let λ= the average number of occurrences of the event in the interval. Then P ( X = x) = e − λ λx x! The Poisson distribution is used extensively in bacteriology. In this case we are not counting the number of events in a fixed time interval or fixed one-dimensional length, but rather the number of bacteria in a two-dimensional microscopic slide of given area, or the number of bacteria in a fluid suspension of given volume. 3 Poisson probability distributions Example: (From Armitage & Berry) Distribution of counts of root nodule bacterium in Petroff-Hausser counting chamber. No. of bacteria Number of squares The mean number of organisms per square per square Observed =(34x0+68x1+112x2+94x3+55x4+21x5+12x6+4x7) /400 =2.50 0 34 1 68 2 112 3 94 4 55 5 21 6 12 7 4 -----------400 The use the Poisson probability function with lambda=2.5 to calculate the p (of a given no. of bacteria), e.g., P(2 bacteria per square)= P ( X = 2) = e −2.5 2.52 = 0.2565 2! To get the expected frequency, multiply this probability with the total number of squares (400), 400 × 0.2565 = 102.6 4 Example: (From Armitage & Berry) Distribution of counts of root nodule bacterium in Petroff-Hausser counting chamber. No. of bacteria Number of squares The mean number of organisms per square per square Observed Expected 0 34 32.8 =(34x0+68x1+112x2+94x3+55x4+21x5+12x6+4x7) /400 =2.50 1 68 82.1 2 112 102.6 The use the Poisson probability function with lambda=2.5 to calculate the p (of a given no. of bacteria), e.g., 3 94 85.5 P(2 bacteria per square)= 4 55 53.4 5 21 26.7 6 12 11.1 7 4 5.7 ------------ --------- 400 399.9 P ( X = 2) = e −2.5 2.52 = 0.2565 2! To get the expected frequency, multiply this probability with the total number of squares (400), 400 × 0.2565 = 102.6 We note that the observed and expected frequencies correspond quite well, indicating that the bacteria do indeed follow a Poisson distribution. 3. The normal distribution -To model continuous variables like height, weight, serum cholesterol level, etc. - Based on the Central Limit Theorem that states that any variable that can be viewed as a sum of a large number of random increments will follow a Normal distribution .015 -It is bell-shaped and symmetrical -It is characterised by its mean (µ) and 0 .005 Density .01 standard deviation (σ) 0 50 100 Cmax_S 150 200 5 A normal distribution with mean=0 and standard deviation=1 is called the Standard Normal distribution Probability Density Function y=normal(x,0,1) 0.6 0.5 0.4 0.3 0.2 0.1 0.0 -3 -2 -1 0 1 2 3 We know that 95% of the values of a variable that follows the N(0;1) distribution lie between -1.96 and +1.96. We will be using these theoretical probability distributions to construct confidence intervals for population parameters. x−µ 2 ~ N (0;1) The sample mean , x ~ N (µ ;σ / n ) − − > σ/ n However, most of the time we do not know σ2 and estimate it using s2. In that case, x−µ ~ t n −1 s/ n Where tn-1 is the Student’s t-distribution with n-1 degrees of freedom. 6 STATISTICAL INFERENCE In statistics we wish to make statements about the true values and relationships for a complete population. However, it is not practical to measure the entire population. So we take random samples that are representative of the population and use the information contained in these random samples to estimate the values or relationships for the population. If we repeat our sampling and draw another random sample from the population, we will come up with slightly different values of the measures and relationships that we are estimating. So our statistics are subject to uncertainty. We can summarize the different values that our statistics can take on and the frequency with which they are likely to take on these values with a probability distribution. Some of the more common statistics have known probability distributions. Population parameter Sample Statistic Mean, µ x Probability distribution Normal Proportion, π p=n/N Binomial Variance, σ2 s2 Chi-square Ratio of two variances F-distribution Sample statistics are point estimates of the population parameters. Every time we draw a different random sample, we’ll get a slightly different point estimate. Hence we change the point estimate to an interval estimate that will give us two cut-off points between which we are 95% certain our true population parameter will fall. We call these interval estimates, CONFIDENCE INTERVALS. 7 The general form of a confidence interval is ( statistic − percentile × std .err; statistic + percentile × std .err ) where percentile=cutoff value that demarcates required percentile of probability distribution; e.g., for the normal distribution, we know that 95% of the values lie between -1.96 and +1.96, so the percentile for a 95% confidence interval is 1.96 std.err = estimate of variability of statistic. i.e, if you had to draw many random samples and calculate a sample statistic each time, how would these statistic vary among one another. Note, std.err. = √σ2/n, i.e, the standard deviation divided by the square root of the sample size. NOTE: the difference between a standard deviation and a standard error: A standard deviation tells you how variable the data values are. 0 .005 Density .01 .015 A standard error measures the variability of the statistic – it is a function of the variability of the data and the sample size. 0 50 100 Cmax_A1 150 200 . ci Cmax_A1 Variable | Obs Mean Std. Err. [95% Conf. Interval] ------------------+-----------------------------------------------------------------------------Cmax_A1 | 125 84.66538 2.616363 79.48686 89.84389 ci = mean +/- tn-1std.err = 84.665-1.979*2.616;84.666+1.979*2.616 where tn-1 refers to the Student t-distribution with n-1 degrees of freedom 8 Hypothesis testing (from Fisher & Van Belle, pg 106) In estimation, we start with a sample statistic and make a statement about the population parameter : a confidence interval makes a probabilistic statement about straddling the population parameter. In hypothesis testing, we start by assuming a value for a parameter and a probability statement is made about the value of the corresponding statistic. The basic strategy in hypothesis testing is to measure how far an observed statistic is from a hypothesized value of the parameter, or how “likely” an observed value for a statistic is given the hypothesized value of the parameter . Probability Density Function y=student(x,127) 0.5 p-value = the probability under the null hypothesis of observing a value as unlikely or more unlikely than the value of the test statistic 0.4 0.3 0.2 0.1 0 1 of va lu e va lu e d ed 3 bs er ve he siz 2 O Hy po t ic -1 st at ist -2 -3 P-value 0.0 Hypothesis testing procedure: •H0: = a null hypothesis that specifies a real value for a parameter ( Usually what you wish to reject and a statement of equivalence) •HA = an alternative hypothesis that specifies a range of values for a parameter which will be considered when the null hypothesis is rejected. •A test statistic derived under the null hypothesis •A p-value that states the probability of observing such a value for the test statistic given that the null hypothesis is true •A decision to reject or not the null hypothesis 9 Decision Rules and Type I and II errors H0=µ=30 t= . ttest weight=30 One-sample t test x−µ s/ n -----------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------weight | 128 29.71172 1.706777 19.30998 26.33432 33.08912 -----------------------------------------------------------------------------Degrees of freedom: 127 Ho: mean(weight) = 30 Ha: mean < 30 t = -0.1689 P < t = 0.4331 Ha: mean != 30 t = -0.1689 P > |t| = 0.8661 Ha: mean > 30 t = -0.1689 P > t = 0.5669 . Probability Density Function y=student(x,1) Probability Density Function y=student(x,1) Probability Density Function y=student(x,1) 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 -3 -2 -1 0 1 2 3 0.2 0.1 0.0 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 Always use a 2-sided HA unless there are good a priori reasons to motivate a 1-sided HA. 10 Comparing two means: If data follow Normal distribution, use t-test. T-test has different forms depending on whether variances in two groups are equal or not. To compare variance ,use F-test: . sdtest Cmax_A1,by(trt) Variance ratio test -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------trtA | 62 87.55352 3.776138 29.73334 80.00267 95.10437 trtB | 63 81.82308 3.618681 28.72239 74.58944 89.05672 ---------+-------------------------------------------------------------------combined | 125 84.66538 2.616363 29.25182 79.48686 89.84389 -----------------------------------------------------------------------------Ho: sd(trtA) = sd(trtB) F(61,62) observed = F_obs = F(61,62) lower tail = F_L = 1/F_obs = F(61,62) upper tail = F_U = F_obs = Ha: sd(trtA) < sd(trtB) P < F_obs = 0.6067 Ha: sd(trtA) != sd(trtB) P < F_L + P > F_U = 0.7871 1.072 0.933 1.072 Ha: sd(trtA) > sd(trtB) P > F_obs = 0.3933 Fn1 −1;n2 −1 = s12 s22 29.733342 28.722392 = 1.072 < F610.;0562 = 1.525531 = . ttest Cmax_A1,by(trt) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------trtA | 62 87.55352 3.776138 29.73334 80.00267 95.10437 trtB | 63 81.82308 3.618681 28.72239 74.58944 89.05672 ---------+-------------------------------------------------------------------combined | 125 84.66538 2.616363 29.25182 79.48686 89.84389 ---------+-------------------------------------------------------------------diff | 5.730441 5.228654 -4.619358 16.08024 -----------------------------------------------------------------------------Degrees of freedom: 123 t n1 + n2 − 2 = Ho: mean(trtA) - mean(trtB) = diff = 0 Ha: diff < 0 t = 1.0960 P < t = 0.8624 Ha: diff != 0 t = 1.0960 0.2752 P > |t| = Ha: diff > 0 t = 1.0960 0.1376 x1 − x2 − ( µ1 − µ 2 ) s 1 + 1 n1 n2 P > t = If variances were not equal, we should use t-test for unequal variances: . ttest Cmax_A1,by(trt) unequal Two-sample t test with unequal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------trtA | 62 87.55352 3.776138 29.73334 80.00267 95.10437 trtB | 63 81.82308 3.618681 28.72239 74.58944 89.05672 ---------+-------------------------------------------------------------------combined | 125 84.66538 2.616363 29.25182 79.48686 89.84389 ---------+-------------------------------------------------------------------diff | 5.730441 5.230112 -4.622509 16.08339 -----------------------------------------------------------------------------Satterthwaite's degrees of freedom: 122.685 Ho: mean(trtA) - mean(trtB) = diff = 0 Ha: diff < 0 t = 1.0957 P < t = 0.8623 Ha: diff != 0 t = 1.0957 P > |t| = 0.2754 Ha: diff > 0 t = 1.0957 P > t = 0.1377 t= x1 − x 2 −( µ1 − µ 2 ) s12 / n1 + s22 / n2 11 If data do not follow Normal distribution, use non-parametric Mann-Whitney U-test: . ranksum Cmax_A1,by(trt) Two-sample Wilcoxon rank-sum (Mann-Whitney) test trt | obs rank sum expected -------------+--------------------------------trtA | 62 4115 3906 trtB | 63 3760 3969 -------------+--------------------------------combined | 125 7875 7875 unadjusted variance adjustment for ties adjusted variance 41013.00 0.00 ---------41013.00 Ho: Cmax_A1(trt==trtA) = Cmax_A1(trt==trtB) z = 1.032 Prob > |z| = 0.3021 Confidence intervals and Hypothesis testing: Recall that to construct a confidence interval for the mean, we use α α ( x − tn −21 × std .err; x + t n −21 × std .err ) And to test whether the mean differs from zero, we use the test statistic x −0 x t= = s / n std .error The two procedures are thus related and in fact it can be shown that if the α% confidence interval includes zero, then the hypothesis test will not be significant at the α% level of significance. On the other hand, if the confidence interval lies to the right or left of zero, it does correspond to a significant result. 12
© Copyright 2026 Paperzz