Chapter 4: Sampling Learning Objectives By the end of this chapter you should be able to: • distinguish between a random and non-random methods of sample selection • describe the advantages of random sample selection • identify different methods of sample selection • match different methods of sample selection to research design • describe the factors influencing sample size • describe how to calculate the appropriate sample size Introduction Sampling is a crucial issue in health and social care research design. When undertaking a quantitative approach, researchers are striving to be able to generalise their findings to the wider world (achieve external validity). It is therefore essential that both the sampling method used and the sample size achieved are appropriate firstly, if the results are to be representative, and secondly, if statistically significant associations or differences are to be identified. It is important to achieve representativeness in a sample, because it is this very ‘representativeness’ that allows the researcher to generalise the findings derived from a sample to the wider population. Obtaining a representative sample is therefore desirable as this means that your sample accurately reflects the population under investigation. Definition: Definition Internal validity relates to the validity of the study itself itself including both the design and the instruments used. Definition: External validity relates to the extent to which the findings of the study can be generalised to a wider population and be claimed to be representative. In qualitative research, issues of sampling are different, although no less important. Here, the objective of research is more concerned with internal validity – with providing an accurate representation of reality – than with the external validity that will enable generalisation to other settings. Hence, appropriate sampling to achieve representativeness is of importance. In this chapter we will concentrate on the different sampling techniques that can be used in health and social care research to achieve the desired validity objectives. We will examine random and non-random methods of sample selection, which are typically used in quantitative and qualitative research respectively. In the second part of this chapter we will discuss issues and techniques of sample size. 4-1 Why do we need to select a sample anyway? In some circumstances it is not necessary to select a sample. If the subjects of your study are so rare, for example a child with a disease seen once in 100,000 children, then you might want to study every case you can find. However, more generally you are likely to find yourself in a situation where the potential subjects of your study are more common and you cannot include everybody. Definition: Definition A census is an enumeration of a population originally for purposes of taxation and military service. Derived from the Latin, censes meaning ‘rate’. If you were to include everybody who is eligible in your study, for example everybody in the UK who has been diagnosed as suffering from asthma, then this would be defined as a census. In many instances, a census is simply too large to handle. It would take too long and cost too much money. If a census is too large, it will be necessary to use a sample to carry out the research. Before looking at different sampling techniques it is important to differentiate between two terms: population and sample. A population refers to ‘units’ (i.e. people, schools, cities) that you wish to make a statement about whereas the sample refers to the group of units selected from the population from which information would be collected. For example imagine you wish to carry out a study investigating the height of seven-year-old boys in England. The population would be all seven-year-old boys in England whereas the sample would be a group of seven-year-old boys in Sheffield from which you could gather data on. There are a number of different sample techniques but overall they can be split into two main groups: probability and nonprobability sampling, or as they are more commonly referred to random vs. non-random methods. Probability Sampling Definition: Definition Inferential statistics are used to make generalisations from a sample to a population. they allow the use of inferential statistics. Inferential statistics provide ways of judging whether it is possible to generalise the results between the sample and the population. There are five different types of probability sampling which are described in Fig 4.1 below: 4-2 Figure 4.1. Type of sampling and Selection Strategy (adapted from Henry 1990: 27) Simple Random Each member of the study group has an equal probability of being selected Systematic Each member of the study population is either assembled or listed, a random start is designated, then members of the population are selected at equal intervals Stratified Each member of the study population is assigned to a group or stratum, then a simple random sample is selected from each stratum Cluster Each member of the study population is assigned to a group or cluster, then clusters are selected at random and all members of selected clusters are included in the sample Multistage Clusters are selected as in a cluster sample, and then sample members are selected from the cluster members by simple random sampling. Clustering may be done at more than one stage. Before we look at examples of these sample designs in more detail it is important to just clarify the difference between random sampling and randomisation. Random Sampling versus Randomisation For random sampling (probability sampling) to take place, you must first have defined your potential subjects or population. The group from which you will extract your sample is also known as the ‘sampling frame’. So, for instance, you may be interested in doing a study of children aged between two and ten years diagnosed within the last month with ear infections, or you may be interested in studying adults (aged 16-65 years) diagnosed as having asthma and receiving drug treatment for asthma in the last six months, and living in a defined geographical region. Having defined your potential sampling frame, random sampling is a way of selecting a sample of people who will be representative of this population. However, if the research design is based on an experimental design, such as a RCT, with two or more groups, then the population frame may often be more tightly defined with strict eligibility criteria. Within a RCT potential subjects are randomly allocated to either the intervention treatment group or the control group. By randomly allocating subjects to each of the groups, potential differences between the comparison groups will be negated. In this way confounding variables (i.e. variables you haven’t thought of or controlled for) will be equally distributed between each of the groups and will be less 4-3 likely to influence the outcome or dependent variables in either of the groups. It is important to remember that randomisation is different from random sampling. As we described in chapter 2, randomisation within an experimental design is a way of ensuring control over confounding variables and as such it allows the researcher to have a greater confidence in identifying real associations between an independent variable (the cause) and a dependent variable (the effect or outcome measure). The term ‘random’ may imply to you that it is possible to take some sort of haphazard or ad hoc approach, for example stopping the first 20 people you meet in the street for inclusion in your study. This is not random in the true sense of the word. To be a ‘random’ sample, every individual in the population must have an equal probability of being selected. In order to carry out random sampling properly, strict procedures need to be adhered to. Random sampling techniques can be split into simple random sampling and systematic random sampling. Simple Random Sampling If selections are made purely by chance this is known as simple random sampling. So, for instance, if we had a population containing 5,000 people, we could allocate every individual a different number. If we wanted to achieve a sample size of 200, we could achieve this by pulling 200 of the 5000 numbers out of a hat. This is an example of simple random sampling. Another way of selecting the numbers would be to use a table of random numbers. These tables are usually to be found in the appendices of most statistical text-books. Simple random sampling although correct is a very laborious way of carrying out sampling. A simpler and quicker way is to use systematic random sampling Systematic Random Sampling Systematic random sampling is a more commonly employed method. After numbers are allocated to everybody in the population frame, the first individual is picked using a random number table and then subsequent subjects are selected using a fixed sampling interval, i.e. every nth person. Assume, for example, that we wanted to carry out a survey of patients with asthma attending clinics in one city. There may be too many to interview everyone, so we want to select a representative sample. If there are 3,000 people attending the 4-4 clinics in total and we only require a sample of 200, we would need to: • calculate the sampling interval by dividing 3,000 by 200 to give a sampling fraction of 15 • select a random number between one and 15 using a set of random tables • if this number were 13, we select the individual allocated number 13 and then go on to select every 15th person • this will give us a total sample size of 200 as required. Care needs to be taken when using a systematic random sampling method in case there is some bias in the way that lists of individuals are compiled, for example if all the husbands’ names precede wives’ names and the sampling interval is an even number, then we may end up selecting all women and no men. Stratified Random Sampling Stratified sampling is a way of ensuring that particular strata or categories of individuals are represented in the sampling process. For example, if we know that approximately 4% of our population frame was made up of a particular ethnic minority group then there may be a chance that with simple or systematic random sampling, we could end up with no ethnic minorities in our sample. If we wanted to ensure that our sample was representative of the population frame, then we would employ a stratified sampling method. • First we would split the population into the different strata, in this case, separating out those individuals with the relevant ethnic background. • We would then use random sampling techniques, to sample each of the two ethnic groups separately, using the same sampling interval in each group. • This would ensure that the final sampling frame was representative of the minority group we wanted to include, in a pro-rata basis to the actual population. If, however, we actually want to be able to compare the results of our minority group with the larger group, then we would have difficulty in doing so, using this form of proportionate stratified sampling, because the numbers achieved in the minority group, although pro-rata those of the population, would not be large enough to demonstrate statistical differences. 4-5 If we really want to be able to compare the survey results of the minority individuals with those of the larger group, then it is necessary to use a disproportionate sampling method. With disproportionate sampling, the strata selected are not selected pro-rata for their size in the wider population. For instance, if we are interested in comparing the views and behaviour of particular minority groups with other larger groups, then it is necessary to over sample the smaller categories in order to achieve statistical power, i.e. in order to be able to demonstrate statistically significant differences between groups. If during data analysis we wish to refer to the total sample as a whole; representative of the wider population, then it will become necessary to re-weight the categories back into the proportions in which they are represented in reality. For example, if we wanted to compare the views and satisfaction levels of women who gave birth in a birth pool compared with those who gave birth ‘normally’ in a bed, a systematic random sample, although representative of all women giving birth would not produce a sufficient number of women giving birth in water to be able to compare the results, unless the total sample was so big that it would take many years to collate. We would also end up interviewing more women than we needed who have given birth ‘normally’. In this case it would be necessary to over sample or over represent those women giving birth in water to have enough individuals in each group in order to compare them. We would therefore use disproportionate stratified random sampling to select the sample in this instance. The important thing to note here is that random sampling is still taking place within each strata or category. So we would use systematic random selection to select a sample of women giving birth in water and the same process to select women giving birth ‘normally’. Cluster Sampling Cluster sampling is a method frequently employed in national surveys where it is uneconomic to carry out interviews with individuals scattered across the country. Cluster sampling allows individuals to be selected in geographic batches. So, for instance, before selecting at random, the researcher may decide to focus on certain towns, electoral wards, hospitals or general practices. An example of a piece of research that used a cluster sampling method can be found in: 4-6 Kennedy et al (2003) A cluster-randomised controlled trial of a patient-centred guidebook for patients with ulcerative colitis: effect on knowledge, anxiety and quality of life. Health and Social Care in the Community, 11: 64 You can find this article on line, at: http://www.shef.ac.uk/library/elecjnls/ The aim of the study was evaluate the impact of an evidence based guidebook on anxiety, knowledge and quality of life in patients with ulcerative colitis Stage 1: A random sample of six hospitals were chosen from the 20 district hospitals in the North West of England. Stage 2: Hospitals were randomised to either receiving the guidebook (intervention group) or not receiving the guidebook (control group). Multi-stage sampling Multi-stage sampling allows the individuals within the selected cluster units to then be selected at random. Obviously care must be taken to ensure that the cluster units selected are generally representative of the population and are not strongly biased in any way. An example of a piece of research that used a multistage sampling method is: Hughes et al (1997) Young people, alcohol and designer drinks: quantitative and qualitative study. BMJ, 314: 414You can find this article on-line via the University of Sheffield library, at: http://www.shef.ac.uk/library/elecjnls/brbz.html Aim of the study: to examine the appeal of designer alcoholic drinks to young people aged between 12-17 years. Stage 1: A list of all postcode sectors in the Argyll and Clyde Health Board (excluding those rural parts of the area i.e. islands and those with fewer than 500 households) Stage 2: A random sample of 30 postcode sectors was chosen Stage 3: From each of the 30 postcode sectors, 40 people aged between 12-17 years were identified using a random procedure which stratified for age and sex. 4-7 Non-Random Sampling Non-random or non-probability sampling refers to sampling methods that do not adhere to the principles of probability sampling, i.e. that not everyone in the population has an equal chance of being selected. For this reason, non-random sampling is not used very often in quantitative health and social care research. Since the objective of qualitative research is to understand and give meaning to a social process, rather than quantify and generalise to a wider population, it is inappropriate to use random sampling and apply statistical tests. The issues for qualitative studies are therefore somewhat different, and the approach to sampling is distinctive. Sample sizes used in qualitative research are usually very small and the application of statistical tests would be neither appropriate nor feasible. For example, a study may wish to consider attitudes towards compliance with asthma treatment. A small sample of local asthmatic teenagers may be interviewed. The sample is thus chosen on the basis of ‘theory’: this kind of theoretical sampling aims to maximise the range of responses, but does not strictly seek to ‘represent’ all the people in the community and other stakeholders. Non-random sampling techniques are also important to consider as they are being used increasingly in market research and commissioned studies such as political opinion polling. There are three main types of non-random sampling: quota sampling, convenience sampling and snowball sampling. The technique most commonly used is known as quota sampling Quota Sampling Quota sampling is a technique for sampling whereby the researcher decides in advance on certain key characteristics that they will use to stratify the sample. Interviewers are often set sample quotas in terms of age and sex. So, for example, with a sample of 200 people, they may decide that 50% should be male and 50% should be female; and 40% should be aged over 40 years and 60% aged 39 years or less. The difference with a stratified sample is that the respondents in a quota sample are not randomly selected within the strata. The respondents may be selected just because they are accessible to the interviewer. Because random sampling is not employed, it is not possible to apply inferential statistics and generalise the findings to a wider population 4-8 Convenience or Opportunistic Sampling. Selecting respondents purely because they are easily accessible is known as convenience sampling. Although quantitative researchers generally frown upon this technique, it is an acceptable approach when using a qualitative design, since generalisability is not a main aim of qualitative approaches. In fact qualitative data are often collected using a convenience or opportunistic sampling approach, for instance where the researcher selects volunteers amongst his or her work colleagues. However, as this is rather a haphazard method, many qualitative researchers employ a purposive sample to identify specific groups of people who exhibit the characteristics of the social process or phenomenon under study. For example, a researcher may be seeking to interview people who have recently been bereaved, or are being rehabilitated into intermediate care; another researcher may be seeking people who have experienced long-term unemployment and suffer from chronic asthma. Sometimes researchers try to find people who typify the characteristics they are looking for. This is known as an ‘ideal type’. Occasionally it is useful to deliberately include people who exhibit the required characteristics in the extreme. Close examination of extreme cases can sometimes be very illuminating, when trying to formulate a theory. However, caution must be adopted with this approach, to ensure these extreme cases are not used to generalise. Snowball or Networking Sampling Snowballing occurs when one respondent supplies you with the names of other individuals in a like position, who may be interested in talking to you also. It is another sampling technique commonly used in qualitative research. Snowballing is particularly useful when trying to reach individuals with rare or socially undesirable characteristics. Lets look at an example. Imagine you wish to explore the safe sex practices of female sex workers working in Sheffield. Once a female sex worker had participated in the study they could then approach their colleagues to see if they would also be interested in participating in the research. A snowballing sampling method may be particularly useful in this situation given the sensitive nature of the research study. Theoretical Sampling If a researcher wishes to develop a social theory, a ‘theoretical’ sampling technique may be used. The idea is that the researcher selects the subjects, collates and analyses the 4-9 data to produce an initial theory that is then used to guide further sampling and data collection from which further theory is developed. Since theoretical sampling is one of the main methods used in qualitative research, this raises issues about the degree to which findings can be generalised or achieve transferability. For more information on this topic and qualitative research in general see Chapter 7. We have introduced you to random and non-random methods of sampling. As mentioned above, quantitative and qualitative research typically use random and non-random sampling techniques to obtain a sample. The main differences between quantitative and qualitative sampling techniques are shown in Figure 4.2. Figure 4.2 Sampling qualitative research in quantitative and Quantitative Research Qualitative Research Sample big enough statistical Inference for Often very small, occasionally a single case Selected to representative be Rarely attempts to be representative: sample chosen to maximise range of responses Please now complete the following SAQ. Read the descriptions and decide what type of sample selection has taken place, and write in the boxes below. 4 - 10 Self-Assessment Questions 4.1 Sampling methods 1. A sample of school children some with asthma and some without is selected from GP records. The children are selected randomly within each of the two groups and the number of children in each group is representative of the total patient population for this age group. 2. A sample of children is selected from social workers records. The children are selected so that 50% have been placed into foster care and 50% are waiting for foster parents. The children are randomly selected within each group. 3. A survey is carried out to examine the attitudes of mothers with children under one year. The sample is selected by interviewers stopping likely-looking women pushing prams in the street. The number of respondents who fall into different age bands and social classes is strictly collated. 4. A sample of drug users is gathered by advertising in the local newspaper for potential respondents. 5. All male adults whose National Insurance numbers end in ‘5’ are selected for a survey. 6. All patients attending wards 3, 5 and 10 in a hospital are selected for a study. 4 - 11 Now that you have been introduced to sampling techniques used in qualitative and quantitative research we would like you to read more on these sampling methods. Please read chapter four of Alan Bryman’s book ‘Sampling’, which is provided for you in the supplementary reading. When you have finished, please complete the following SAQ. You may wish to take notes as you read. SAQ 4.2 Sampling Having read the chapter from Bryman, answer the following: 1. What are the disadvantages often associated with using quota sampling? 2. For which research methodology are you most likely to use a snowball sampling technique and why? 3. What are main advantages of using a stratified sampling method? 4. What is meant by non-response? In what way could this affect obtaining a representative sample? 4 - 12 Answers to SAQ 4.1 1. Stratified random sample. The sample is stratified because the sample has been selected to ensure that two different groups are represented. 2. Disproportionate stratified random sample. This sample is stratified to ensure that patients from the two different groups are picked up, however the two groups are selected so that they are equal in size and are not representative of the patient base. 3. Quota. The sample is not randomly selected but the respondents are selected to meet certain criteria. 4. Convenience. The sample is not randomly selected and no quotas are applied. 5. Systematic random sample. 6. Cluster sample because the patients are selected only from certain wards. So far in this chapter we have covered the different sampling techniques that are used in health and social care research and how random and non-random methods are typically applied in quantitative and qualitative research respectively. However, as well as correctly using the most appropriate sampling technique, it is also important to consider another aspect of sampling: sample size, which we will go on to look at in a moment. First, please complete the following for your log book. 4 - 13 Reflective Exercise 4.1 Sampling We would like you to look at the practical issues surrounding sampling techniques in research. Please read the journal article: Helder D et al (2002) Living with Huntington's disease: Illness perceptions, coping mechanisms, and patients' well-being. British Journal of Health Psychology, 7 (4): 449 – 462. (Tip: read the methods and the discussion). 1. How was the sample selected for this survey? 2. Did the researchers use random or non-random sampling methods? 3. Was the sample representative? 4. How might the sampling be improved? 4 - 14 Answers to SAQs 4.2 1. Non-representative because of the possible interviewer bias involved in selecting the respondents, respondents may be untypical as only those that are available and in the interviewers vicinity are typically selected, the interviewer is often prone to bias in selecting respondents, i.e. they may assume that a person is younger than required for the study and therefore not approach them, the reliance upon using social class often causes problems in making sure that respondents are correctly from these social class groups. 2. Qualitative because generalisability which is an essential requirement within quantitative research aren’t possible with this non-probability sampling technique. 3. The main advantage of using a stratified sample is that the sample should be distributed in the same way as the population which may not be achieved with just using a simple random sample. 4. Non-response refers to social survey research and those subjects participating in the research who do not return self completed questionnaires. Non-response can affect obtaining a representative sample as those participants who have not agreed to participate may differ from those subjects who have participated (i.e. in terms of age social class, reading and writing abilities) which may make the sample left in the study unrepresentative of the population. Please take a break before moving on to the second part of this chapter. 4 - 15 Part 2: Sample size and the ‘power’ of research In the previous section, we looked at methods of sampling. Now we want to turn to another aspect of sampling: how big a sample needs to be in quantitative research to enable a study to have sufficient ‘power’ to do the job of testing a hypothesis. At first glance, many pieces of research seem to choose a sample size merely on the basis of what ‘looks’ about right: or perhaps simply for reasons of convenience: ten seems a bit small, and hundred would be difficult to obtain, so 40 is a happy compromise! Unfortunately, a lot of published research uses precisely this kind of logic. In the following section, we want to show you why using such reasoning could make your research worthless. Choosing the correct size of sample is not a matter of preference, it is a crucial element of the research process without which you may well be spending months trying to investigate a problem with a tool which is either completely useless, or over expensive in terms of time and other resources. The truth is out there: hypotheses and samples Most (but not all) quantitative studies aim to test a hypothesis, and we looked at the logic of hypothesis testing in Chapter 1. A hypothesis is a kind of ‘truth claim’ about some aspect of the world: the properties of a cell, the behaviour of a social group, or whatever. Research sets out to try to prove this truth claim right (or more properly, to disprove the null hypothesis - a truth claim phrased as a negative). For example, let us think about the following hypothesis: Levels of childhood asthma are affected by traffic pollution and the related null hypothesis: Levels of childhood asthma are not affected by traffic pollution Definition: Definition An association between two variables represents some sort of relationship. An association can be a causal one or it might be spurious. spurious. Associations can be positive or negative. Let us imagine that we have this as our research hypothesis, and we are planning research to test it. We will undertake a trial, comparing groups of children living in different traffic environments, to assess the extent of asthma in these different groups. Now, if you recall the chapter on validity, external validity is the extent to which the findings of a hypothesis can be generalised. Obviously the findings of a study -while interesting in themselves - only have value if they can be generalised, to discover something about the topic that can be applied. Were we to find an association, then we should want to do something to reduce exposure to traffic pollution. So our study has to have the capacity to be generalised beyond the few children actually in the study. 4 - 16 The measurement of such generalisability of a study is done by statistical tests of inference. You may be familiar with some such tests: tests such as chisquared, the t-test, and tests of correlation. We will not look at these tests in any detail in this course, but we need to understand that the purpose of these and other tests of statistical inference is to assess the extent to which the findings of a study can be accepted as valid for the population from which the study sample has been drawn. If the statistics we use suggest that the findings are ‘true’, then we can be happy to conclude that (within certain limits of probability), we can assume that the study’s findings can be generalised, and we can act on them (to reduce exposure to traffic by school children, for instance). See Chapter 6 for further details of these tests. From common sense, we see that the larger the sample is, the easier it is to be satisfied that it is representative of the population from which it is drawn: but how large does it need to be? This is the question that we need to answer, and to do so, we need to think a little more about the possibilities that our findings may not reflect reality: that we have committed an error in our conclusions. Type 1 and Type 2 errors What any researcher wants is to be right! She wants to discover that there is an association between two variables: say, asthma and traffic pollution, if such an association really exists. Or, if there is no such association, she wants her study to support the null hypothesis that the two are not related. (While the former may be more exciting, both are important findings, both accurately reflect the reality that our studies try to tap into). What no researcher wants is to be wrong! No-one wants to find an association which does not really exist, or - just as importantly - NOT find an association which DOES exist. Both such situations can arise in any piece of research. The first (finding an association which is not really there) is called a Type 1 error. It is the error of falsely rejecting a true null hypothesis. (Think this through carefully: an example might be a study that rejects the null hypothesis that there is no association between asthma and pollution. The findings suggest such an association, but in reality, no such relationship exists.) The second kind of error, called a Type 2 error (sometimes written as Type II), occurs when a study fails to find an association that really does exist. It is then a matter of wrongly accepting a false null hypothesis. (Using the traffic and 4 - 17 asthma example again, we conduct a study and find no association, missing one that really does exist.) Both types of error are serious. Both have consequences: imagine the money that might be spent on reducing traffic pollution, and all the time it does not really affect asthma (result of a Type 1 error). Or imagine allowing traffic pollution to continue, while it really is affecting children’s health (result of a Type 2 error). Good research will minimise the chances of committing both Type 1 and Type 2 errors as far as possible, although they can never be ruled out absolutely. Statistical Significance and Statistical Power Fig 5.1 shows the four possible outcomes of a piece of research diagrammatically. Each cell in the figure represents a possible relationship between the findings of the study and the 'real-life' situation in the population under investigation. (Of course, we cannot actually know the latter unless we surveyed the whole population: that is the reason we conduct studies that can be generalised through statistical inference). Cells 1 and 4 represent desirable outcomes, while cells 2 and 3 represent potential outcomes of a study that are undesirable and need to be minimised. The relationship between these possible outcomes, and two concepts, that of statistical significance and of statistical power. The former is well-known by most researchers who use statistics, the latter is less well understood. Figure 4.1: The Null Hypothesis (Ho), Statistical Significance and Statistical Power POPULATION Null Hypothesis is: STUDY False True False True Correct Result Type I Error (alpha) ❶ ❷ Type II Error (beta) Correct Result ❸ ❹ 4 - 18 Cell 1. The null hypothesis has been disproved by the results of the study (that is, there is support for a hypothesis which suggests some differences between groups or association between variables). This is also the situation in the population. Thus, we can be satisfied that the study is reflecting the world outside the limits of the study and it is to be accepted as a 'correct' result. Cell 4. The results from the study support the null hypothesis. This is the situation which pertains in the population, so we can be satisfied that our study reflects the circumstances in the population. Once again, this is a 'correct' result. Cell 2. In this cell, as in cell 1, the study results falsify the null hypothesis, indicating some kind of difference or association between variables. However, in the world beyond the study, the null hypothesis is actually true and there is no effect. This is the Type I error: the error of wrongly rejecting a true null hypothesis. The likelihood of committing a Type I error is known as the alpha value or the statistical significance of the test. Some of you may be familiar with alpha as the quoted p level of significance of a test. The p value marks the probability of committing a Type I error; thus a p value of 0.05 indicates a five per cent (or one in 20) chance of committing a Type I error. Cell 2 thus reflects an incorrect finding from a study, and the alpha value represents the likelihood of this occurring. Cell 3. This cell similarly reflects an undesirable outcome of a study. Here, as in Cell 4, a study supports the null hypothesis, implying that there is no difference or association in the population under investigation. But in reality, the null hypothesis is false and there is some kind of difference or association that the study is missing. This mistake is the Type II error of wrongly accepting a false null hypothesis. The likelihood of committing a Type II error is the beta value of a statistical test, and the value (1 - beta) is the statistical power of the test. Thus, the statistical power of a test is the likelihood of avoiding a Type II error. Conventionally, a value of 0.80 or 80% is the target value for statistical power, representing the likelihood that four times out of five that a false null hypothesis will be rejected. Outcomes of studies that fall into cell 3 are incorrect; beta or its complement (1-beta) are the measures of power: the likelihood of such an outcome of a study. All research should seek to avoid both Type I and Type II errors, which lead to incorrect inferences about the world beyond the study. In practice, there is a trade-off. Reducing the likelihood of committing a Type I error by increasing the level of significance at which one is willing to accept a positive finding reduces the statistical power of the test, increasing the 4 - 19 possibility of a Type II error, and vice versa. However, both statistical significance and statistical power are affected by sample size. Most researchers who use statistics will be aware that the chances of gaining a statistically significant result will be increased by enlarging a study's sample. Similarly, the statistical power of a study is enhanced as sample size increases. Let us look at each of these aspects of quantitative research in turn. The Statistical Significance of a Study When a researcher uses a statistical test of inference, what she is doing is testing her results against a gold standard. If the test gives a positive result (this is usually known as ‘achieving statistical significance’), then she can be relatively satisfied that her results are ‘true’, and that the real world situation is that discovered in the study (Cell 1 in Fig 5.1). If the test does not give significant results (non-significant or NS), then she can be reasonably satisfied that the results reflect Cell 4, where she has found no association and no such association exists. However, we can never be absolutely certain that we have a result that falls in Cells 1 or 4. Statistical significance represents the likelihood of committing a Type 1 error (Cell 2). Let us imagine that we have results suggesting an association between traffic pollution and asthma, and a t-test (a test to compare the results of two different groups) gives a value which indicates that at the 5% or 0.05 level of statistical significance, the level of asthma is higher in children exposed to traffic fumes. What this means is that 95 per cent of the time, we can be certain that this result reflects a true effect (Cell 1). Five per cent of the time, it is a chance result, resulting from random associations in the sample we chose. If the t-test value is higher, we might reach 1% or 0.01 significance. Then only one per cent of the time would the result be a chance association. Tests of statistical significance are designed to account for sample size, thus the larger the sample, the ‘easier’ it is for results to reach significance. A study that compares two groups of 10 children will have to demonstrate a much greater difference between the groups than a study with 1000 children in each group. This is fair: the larger study is much more likely to be ‘representative’ of a population than the smaller one. To summarise: statistical significance is a measure of the likelihood that positive results reflect a real effect, and that the findings can be used to make conclusions about differences which really exist. 4 - 20 The Statistical Power of a Study As we have just seen, statistical tests build in a safety margin to avoid generalising false positive results, possibly with disastrous or expensive consequences. Researchers who use small samples thus run the risk of not being able to demonstrate differences or associations that really do exist. Thus they are in danger of committing a Type 2 error (Cell 3 in Fig 5.1), of accepting a false null hypothesis. Such studies are ‘underpowered’, not possessing sufficient statistical power to detect the effects they set out to detect. Conventionally, the target is a power of 80% or 0.8, meaning that a study has an 80 per cent likelihood of detecting a difference or association that exists. Examination of research undertaken in various fields of study suggests that many studies do not meet this 0.8 conventional target for power (Fox and Mathers 1997). What this means is that many studies have much reduced likelihood of being able to discern the effects which they set out to seek: a study with a power of 0.66 will only detect an effect two times out of three, while studies with power of 0.5 or less will detect effects at levels less frequent than those achieved by tossing a coin. A non-significant finding of a study may thus simply reflect the inadequate power of the study to detect differences or associations at levels that are conventionally accepted as statistically significant. In such situations one must ask the simple question of such research: ‘Why did you bother, when your study had little chance of finding what you set out to find?’ A hypothetical example will illustrate the importance of being able to detect a false null hypothesis. Imagine that researchers wish to compare two inhalers for their efficacy in delivering drugs for countering asthma attacks. Inhaler A is twice as expensive as inhaler B. In the study, inhaler A demonstrates a slightly higher level of efficacy, but because the study is small (only 50 people in each group), this difference does not reach significance, and the researchers conclude there is no demonstrable difference. They report this finding, and as a consequence, the cheaper but less effective inhaler comes to be the one recommended to doctors to prescribe. If this study is under-powered, this decision is based on the acceptance of a false null hypothesis (Cell 3: false negative), the consequences for children using a less effective inhaler could be fatal. Statistical power calculations can be undertaken after a study has been completed, to assess the likelihood of a study discovering effects. More importantly, such calculations need to be undertaken prior to a study to avoid both the wasteful consequences of under-powering, (or of overpowering in which sample sizes are excessively large, leading to very high 4 - 21 power at the expense of higher than necessary study costs). Power is a function of three variables: sample size, the chosen level of statistical significance (alpha) and effect size. While calculation of power entails recourse to tables of values for these variables, the calculation is relatively straightforward in most cases. Effect size and sample size However, as was mentioned earlier, there is a trade-off between significance and power, because as one tries to reduce the chances of generating false positive results, the likelihood of a false negative result increases. Researchers need to decide which is more crucial, and set the significance level accordingly. In the example of the two inhalers, it might be concluded that wrongly accepting the null hypothesis of no difference between A and B was the greater danger clinically, and use a relatively low level of significance (0.05), to increase the power of the study to discern an effect. Fortunately statistical significance and power are both increased by increasing sample size, so increasing sample size will reduce both Type 1 and Type 2 errors. However, that does not mean that researchers necessarily need to vastly increase the size of their samples, at great expense of time and resources. The other factor affecting the power of a study is the effect size (ES) that is under investigation in the study. This is a measure of how `wrong’ the null hypothesis is. For example, in the study of inhalers, the ES is the difference in efficacy in treating the symptoms of an asthma attack. An effect size may be a difference between groups or the strength of an association between variables such as asthma and traffic pollution. If an ES is small, then many studies with small sample sizes are likely to be under-powered. But if an ES is large, then a relatively small-scale study could have sufficient power to identify the effect under investigation. It is sometimes possible to increase effect size, but usually this is the intractable element in the equation, and accurate estimation of the effect size is essential for calculating power before a study begins, and hence the necessary sample size. An Effect Size can be estimated in four ways: 1. From a review of literature or meta-analysis, which can suggest the size of ES that may be expected. 2. By means of a pilot study which can gather data from which the size of effect may be estimated. 4 - 22 3. One can make a decision about the smallest size of effect which it is worth identifying. To consider the earlier example of two rival inhalers, if we are willing to accept the two inhalers as equivalent if there is no more than a ten per cent difference in their efficacy of treatment, then this effect size may be set, acknowledging that smaller effects will not be discernible. 4. As a last resort, one can use a `guesstimate’ as to whether an ES is `small’, `medium’, or `large’. These definitions and values for `small’, `medium’ and `large’ effects are conventions, as described by Cohen (1970). A `medium’ effect is defined as one which is 'visible to the naked eye' - in other words - which could be discerned from everyday experience without recourse to formal measurement. For example, the difference between male and female adult heights in the UK would be counted as a medium ES. Most effects encountered in biomedical and social research should be assumed to be small, unless there is a good reason to claim a medium effect, while a `large’ effect size would probably need to be defined as one which is so large that it hardly seems necessary to undertake research into something so well established. Cohen offers the example of the difference between the heights of 13 and 18 year old girls as a `large’ effect. Power calculations may be used as part of the critical appraisal of research papers. Unfortunately it is rare to see beta values quoted for tests in research reports, and indeed often the results reported are inadequate to calculate effect sizes. Studies have evaluated various scientific subjects, including nursing, education, management and medicine. Calculating sample sizes in inferential studies Calculating sample sizes to achieve sufficient power can be a complicated matter, and you should consult a statistician if you plan to use a quantitative sample methodology. However there is a short-cut that can be used to give an estimate of sample size for a power calculation. Table 4.2 gives some rough and ready figures for a range of simple tests, based on a power (1 - beta) value of 0.8 and a significance level of 0.05 (5%), for ‘small’ and ‘medium’ effect sizes. You may be surprised to see how large the samples may need to be. 4 - 23 Table 4.2: Necessary sample sizes for statistical tests where alpha (p) = 0.05 Test Degrees of Freedom ES = Small ES = Medium 300 per group 50 per group 2 322 per group 64 per group 3 274 per group 45 per group 4 240 per group 49 per group 1 785 total 87 total 2 964 total 107 total 3 1090 total 121 total 618 68 T-test F-test Chi squared Pearson’s correlation Statistical power is important for you to consider when you plan your research study if you are using a quantitative design. If in doubt, you should discuss this with your supervisor and get expert statistical advice before planning your research protocol. Now please undertake the following reflective exercise for your log book. 4 - 24 Reflective Exercise 4.2: Statistical power in health research Please read the article in the associated reading by Fox and Mathers entitled ‘Empowering your research: statistical power in general practice research’ Now answer these questions: 1. What was the average power of research in the papers surveyed? 2. What proportion of papers had too high a power, and why is this an issue? 3. What recommendations do Fox and Mathers suggest to writers of papers which report tests of statistical inference? 4. At what point in your PhD studies, should you consult a statistician, to ensure your study is neither under- or over-powered? 4 - 25 Summary Key points to remember when deciding on sample selection are: 1. Always try to use a random method where possible and remember that random does not mean haphazard. 2. Random selection means that everybody in your sampling frame has an equal opportunity of being included in your study. 3. If you need to be able to generalise about small or minority groups and to compare those with larger groups, consider using disproportionate stratified sampling, but remember to re-weight the results afterwards if you wish to generalise from the whole sample. Key points to remember when deciding on sample size are: 1. A Type I error is the error of falsely rejecting a true null hypothesis. 2. The likelihood of committing a Type I error is known as alpha. The conventional level for alpha or p is usually 0.05 or 0.01. 3. A Type II error is the error of failing to reject a false null hypothesis or wrongly accepting a false null hypothesis. 4. The likelihood of committing a Type II error is known as beta. The conventional level of statistical power is set at 0.8 or 80%. 5. There is a trade off between committing a Type I error and a Type II error, but historically science has placed the emphasis on avoiding Type I errors. 6. Increasing the sample size reduced both Type I and Type II error, but remember that it is costly and unethical to have too large a sample size. 7. To calculate statistical power, you need to estimate the effect size. 4 - 26 Recommended Reading For details of sampling techniques Bland M. An Introduction to Medical Statistics. Oxford: University Press, 1995. Clegg F. Simple Statistics: A Course Book for the Social Sciences. Cambridge: Cambridge University Press, 1982. For details of power calculations Campbell MJ et al. Estimating sample sizes for binary, ordered, categorical and continuous outcomes in two group comparisons. BMJ 1995; 311: 1145-8. Cohen J. Statistical Power Analysis for the Behavioural Sciences. New York: Academic Press, 1977. Machin D, Campbell MJ. Statistical Tables for the Design of Clinical Trials. Oxford: Blackwell Scientific, 1987. Studies of power in different scientific disciplines Fox NJ, Mathers NJ. Empowering your research: statistical power in general practice research. Family Practice 1997; 14: 324-9. Polit DF, Sherman RE. Statistical power in nursing research. Nursing Research 1990; 39: 365-369. Reed JF, Slaichert W. Statistical proof in inconclusive 'negative' trials. Archives of Internal Medicine 1981; 141: 1307-1310. 4 - 27 4 - 28 Learning Review Form: Chapter 4 Please complete this form when you have completed the chapter. If you do not consider you have achieved the learning outcomes, you need to go back and do more work on the chapter or read the recommended reading. When you have completed the form, save it to hand in with your log book 1. I am confident that I can: Not at all Partl y Quite well Very well • distinguish between a random and non-random methods of sample selection 1 2 3 4 • describe the advantages of random sample selection 1 2 3 4 • identify different methods of sample selection 1 2 3 4 • match different methods of sample selection to research design 1 2 3 4 • describe the factors influencing sample size 1 2 3 4 • describe how to calculate the appropriate sample size 1 2 3 4 4 - 29 4 - 30
© Copyright 2026 Paperzz