DESCRIPTIVE STATISTICS Lectures delivered by Prof. K. K. Achary YENEPOYA RESEARCH CENTRE Background assumed: data, data types, scales of measurement, data collection ,organization and presentation(visualization or visual summary ) In this module we study summarization of data through different descriptive measures ,like measures of central tendency and measures of variability (dispersion or spread ),contexts and data types for using the different measures, computation and interpretation of the measures etc. • A descriptive measure of a set of values ( observations/data set ) is a representative value( generally a single number) which summarizes the characteristic properties/features of data. The common features observed are • concentrate around a central value • scatter( dispersion/spread) around a central value • asymmetry in data distribution • peakedness of data distribution Contd… • The ‘central value’ and ‘spread’ are numerical summaries of the data. The center of a data set is commonly called the average. There are many ways to describe the average value of a distribution. The most appropriate measures of center and spread depend on the shape of the distribution. • Once these three characteristics of the distribution are known, we can analyze the data for interesting features, including unusual data values, called outliers. • • • • Consider the following data sets. Age at first child birth: 22,24,24,23,21,25,26,23,24,25,26,26,23,24,24 Blood cholesterol: 145,145,143,189,180,196,145,198,205,209,227,2 21,292,302, 253,233,283 • Day temperature: 27,26,27,28,29,29,29,30,30,30,29,30,31,31,31,32 ,31,32,32,35 • What are your general comments? • The descriptive measures can be computed for the whole mass of data( if available(?)) i.e. population ,related to a study or only a representative part of the population, i.e. sample (random sample to be specific ). • If we compute a measure from a random sample, it is termed “ statistic”/descriptive statistics. • What is it called if we compute it for a population? ( It is called parameter value.) • Measures of central Tendency:(Averages) • Some of the measures of central tendency are: • Mean( arithmetic mean or average value as commonly used) • Median • Mode • We have positional averages like, quartiles, deciles and percentiles • Arithmetic mean: • Definition: Arithmetic mean of a sample ( or population ) is the “average” of all observations in the sample/population. • It is given by • Arithmetic mean = (Total of all values in the sample/population)/Number of values in the sample/population. • Let there be n values in the sample and they are recorded as : • x1 , x2 , x3 ,… xn ( you can also use y1 ,y2 , y3,… yn ) • Sum or total of n observations-( x1 + x2 + x3 +…. + xn ) = ∑xi x • Mean (Arithmetic mean/average ) x n • This formula is used to compute the mean of ungrouped data.( What is ungrouped data? What is grouped data? ) i • • • • • • • • • • • Computation steps: Add all observations(Sum up) Find no. of observations in the sample data set ( n ) Divide the total by n. This value is mean Example: Consider the age data. 22,24,24,23,21,25,26,23,24,25,26,26,23,24,24 Here n=15 Now ,the sum( total) is 360. Hence mean= (360/15) = 24 The mean age at first child birth is 24 years. • Compute the mean level of blood cholesterol. • Computation of mean in grouped data: • Grouped data may be in the form of a discrete frequency distribution or continuous frequency distribution. • Example of a discrete frequency table (distribution) • Score level (xi) • 0 • 1 • 2 • 3 • 4 • 5 No. of students (fi) 3 12 23 46 18 6 • The mean is computed in the same way as before. • Total of all observations = 0x3+1x12+2x23+3x46+4x18+5x6 = 298 • No. of observations=108 • Mean = 298/108 =2.759259 =2.76 • For the above case ,computational formula can be developed as follows. • Let the distinct values be x1 , x2 , x3 ,…. Xk where xi repeats fi times. Then the total of all observations is given by x1f1 +x2f2 + … +xkfk.. The total number of observations is f1+f2+ … +fk = n • Hence the formula for mean is= (∑xifi)/n • Computation of mean from continuous frequency table ( distribution ): • Recollect the method of constructing a frequency table when observations are recorded on a continuous variable. Let the midpoint of the ith class interval be denoted as xi and the corresponding class frequency be fi. Then the mean can be computed using the above formula by taking midpoint values for xi’ s and • f1+f2+ … +fk = n, where k is the number of classes. Properties of the mean For a given set of data mean is unique and it is computed using all the observations. Computation is easy. Can be computed for interval and ratio scale data. Extreme values influence the mean and it may be completely distorted by such values. Ex: consider the age of mothers data 22,24,24,23,21,25,26,23,24,25,26,26,23,24,24 Mean age is 24 years • Suppose one observation is 38 ( say the last one) • The revised total is 360-24+38=374 • Mean is 374/15 =24.9 • Suppose 48 and 45 are two observations replacing 21 and 23. Then the mean age becomes = 27.3 MEDIAN • The median of a given data set is the value which divides the ordered set of values into two equal parts such that the number of values equal to or greater than the median is equal to the number of values less than or equal the median. • In a sample data set having n values, the median is (n+1)/2 th value, when the observations have been arranged in increasing order. • Note that when n is odd the median is the middlemost value. • If n =11 then median is the sixth observation. • If n=12 then the median is the value at 6.5th position. • Ex: age data 22,24,24,23,21,25,26,23,24,25,26,26,23,24,24 • Arrange in increasing order: • 21,22,23,23,23,24,24,24,24,24,25,25,26,26,26 • • • • • • • • n=15 hence position of the median is (15+1)/2 =8th observation= 24 The median age at first childbirth is 24 years. ex: 34,45,67,87,39,93,56,48 (ages of persons ) N=8 & median is 4.5th value Ordered sample : 34,39,45,48,56,67,87,93 Median is average of 4th and 5th values Median=(48+56)/2=104/2=52 years • Compute the mean age for the sample. • Mean age is 58.63 years • Mean is large here due to the influence of large values • Computation of median in grouped data: – For discrete frequency table use the definition and identify the value corresponding to the cum.freq (n+1)/2 • Ex • Score level (xi) • 0 • 1 • 2 • 3 • 4 • 5 (fi) 3 12 23 46 18 6 cum.freq. 3 15 38 84 102 108 n=108 & (n+1)/2=54.5 Hence median is the 54.5th value in the ordered sample. Observe that the 54th and 55th observations are having the value 3. Hence median score is 3. Try more examples from books. Computation of median from grouped data(continuous freq. table ) • First compute the true class limits(if required ) and the cumulative frequency of each class • Identify the first class interval for which the cumulative frequency is greater than or equal to (n/2). This class is the median class. • Use the following formula to compute the median . Median = • • • • • • i n L ( c) f 2 Here the notations have the following meaning L = lower limit of the median class I = width of the median class f= frequency of the median class n= total no. of observations ( sample size) c= cum. Freq. of class preceding the median class Example Class interval Freq. Cum.freq. 5 10 15 20 25 30 35 3 8 16 12 9 6 6 3 11 27 39 48 54 60 - 10 15 20 25 30 35 40 • Here n= 60, so n/2 = 30 • Hence median class is 20 – 25, L=20 & i=5 • f=12 • c=27 • So median = 20 + (5/12)(30-27) • median = 21.25 50% of the observations fall below 21.25 and 50% of the observations are greater than or equal the median value. • HW: Compute mean for the above data and compare with median. What are your observations? • Properties of median: • It is unique. It is computed from the ordered data. It is a positional average. It is not influenced by extreme values. If the extreme class intervals are open ended, then also we can compute median. MODE • Mode is the most commonly occurring value in a data set. • For ungrouped data mode can be obtained by finding the value which appears /repeats max. no. of times. • Ex:Consider the data set: • 3,4,3,5,5,6,4,3,5,5,7,8,8,5,8,6,9,5 • Here 5 repeats 6 times. Hence mode=5 • Note: Mode is not unique. In a dataset more than one value might repeat max. time. • Ex: Consider the data set: • 3,4,3,5,5,6,4,3,5,5,3,8,8,5,3,6,3,5 • Here 3 and 5 repeat 6 times. Hence both 3 and 5 are modal values. • Question: If in a data set all values repeat equal no. of times, what is the mode? Grouped discrete frequency distribution • In a grouped frequency table,mode is identified as the value with the largest frequency. • Score level (xi) (fi) • 0 3 • 1 12 • 2 23 • 3 46 • 4 18 Mode= 3 • 5 6 Computing mode in continuous frequency distribution • Mode is the value that has the highest frequency in a data set. • For grouped data, modal class is the class with the highest frequency. • To find mode for grouped data, use the following formula: • • • • • 1 Mode L ( )i 1 2 Here L = lower limit/boundary of modal class i = width of modal class ∆1= difference between the frequency of modal class and the class before the modal class ∆2 = difference between the frequency of modal class and the frequency of the class after the modal class • The above formula can be written in the following form: • ( f1 f 0 ) i Mode= L + ( 2 f1 f 0 f 2 ) 1 L= lower limit/boundary of modal class i= width of modal class f0 = freq. of class preceding the modal class f1 = freq. of modal class f2 = freq. of class next to modal class Note: compare with the previous formula ∆1 = f1-f0, ∆2 = f1- f2 • Summary & comments Descriptive statistics are summary figures or measures to summarise a data set using a single number . The measures of central tendency provide a summary statistic giving an idea about the concentration of the data around a central value. Mean, median and mode are the three important measures of central tendency. Use mean for data in interval or ratio scale and median for ordinal scale. Mode is valid in all cases but it is an ill-defined measure. While mean is used in parametric inference,median is an important measure used in non-parametric inference. • Median can be computed graphically from the cumulative frequency curves/ogive curve • Some remarks: • While computing mean,median and mode from grouped frequency tables, we are not using the original sample valuesa(this information is lost when we group the data). • To compute mean we assume that the values in each class are concentrated at the midpoint. Hence we are getting an estimate of the mean. • To compute mean and median, we use interpolation technique to estimate the measures . • Hence , in grouped data we estimate the measures of central tendency. Positional Averages • Positional averages are measures computed from ordered data. These are defined with reference to the position of the measure. The important positional measures are: • Quartiles: These are measures which divide the ordered data into four equal parts. The first quartile, Q1 is the value which divides Q the ordered dataset in such a way that 25% of the observations fall below it and the remaining observations lie above it. In an ungrouped data, 1 is the value at the position ( n 1) 4 • The second quartile is the median. • The third quartile is the value which divides the ordered dataset such that 75% of the observations fall below it and the remaining observations lie above it. It is denoted by Q3 . 3 Q Contd… • Computation of Quartiles: --Explained in the class with an example for ungrouped and grouped data. ---- Note down the formulae for computing quartiles from continuous frequency distribution ( as discussed in the class ) ----Solve more examples for practice Percentiles • Self study • Prepare a study report on percentiles– definition, computation, usage and interpretation.
© Copyright 2026 Paperzz