PUBLIC HEALTH intelligence Maintaining confidentiality in disseminated data - Dealing with small numbers and the risk of disclosure May 2013 This briefing provides a summary of the confidentiality issues which can arise when providing data involving small numbers, and describes the methods commonly used to deal with them. The following information is based on the guidance published by the Office for National Statistics (ONS): http://www.ons.gov.uk/ons/guide-method/best-practice/disclosure-control-of-healthstatistics/index.html What do we mean by confidentiality and disclosure in disseminated data? Protecting confidentiality applies to more than just individual-level data which might contain personal information (e.g. name, date of birth, address or NHS number). The principles of confidentiality should also be considered with summary data, such as numbers in tables, particularly when numbers are small. Although a table may not reveal personal information, the availability of other information or the way the table is presented could mean an individual’s identity or characteristics are revealed; this is often referred to as ‘deductive disclosure’. Providing or publishing data which risks disclosure could breach public trust, the National Statistics Code of Practice or other agreements under which data are collected or received. When might data carry a disclosure risk? The risk of disclosure is increased when data are for small geographic areas or organisation units (e.g. wards or GP practices), specific population sub-groups (e.g. ethnic groups), short or recent time periods, or where the data are considered particularly sensitive. Table cells with counts of 1 or 2 are considered as potentially ‘unsafe’, and for particularly small areas and sensitive data, this may also apply to counts of 3 and 4; care should also be taken where rows or columns contain a lot of zeros. The ONS considers three potential situations for disclosure: General attribute disclosure – Someone who already knows something about a person who is counted in a data table may learn something new about them due to the way in which the data are presented or broken down. The motivated intruder – Small numbers in tables may prompt people to find other local sources of information in order to identify those individuals and/or reveal information about them. Identification or self-identification – Small numbers could lead to the discovery of rareness or uniqueness in a population by those individuals, which may cause them harm or distress if they feel exposed as a result. How can the risk of disclosure be controlled? Where unsafe cells are identified in tables, there are a number of options which can be used to reduce the disclosure risk. Page 1 of 4 Redesigning tables Data may be aggregated across a number of time periods, presented at a higher level geography, and/or be grouped into broader categories. For example, data may be presented as 3-year aggregates or averages rather than by year, at local authority district rather than ward level, or by broad age groups rather than 5-year age groups. This method can increase the numbers in tables and reduce the risk of disclosure by making any identification more difficult. For example, it would be much harder for a ‘motivated intruder’ to identify an individual in a bigger population and across a longer time period; a greater time lag also increases the chance an individual has moved. This method can also smooth out random variation, which helps with interpreting the data; it can, however, reduce the level of detail presented. Modifying cell values Cell suppression – Unsafe numbers are replaced by a dash or an ‘X’, or if applied to all cells containing numbers less than 5, ‘<5’ may be inserted. Secondary suppression of other numbers may also be needed if it is possible to work out the unsafe numbers by subtraction from a total; this is a disadvantage. The exact counts of most cells, however, are maintained. Rounding – All counts in the table are rounded to a specified base, often of 5. In controlled rounding, numbers are rounded down to the nearest multiple of the base; in random rounding, numbers are rounded up or down in a random manner. Conventional rounding, to the nearest multiple, isn’t considered a sufficient level of protection. Rounding ensures counts are still provided in all cells but creates uncertainty about the true values; it does mean, however, that all cells may be changed. Barnardisation – All counts in the table may be adjusted by -1, 0 or +1 based on a specified probability, introducing a level of uncertainty into the counts. Information loss is minimised but many small numbers remain ‘unsafe’ and so this method is rarely used. Record swapping This method involves introducing a small level of error in individual records before the data are tabulated. Pairs of records are matched based on certain characteristics, e.g. age, and then other characteristics, e.g. geographical location, are swapped; some of the data presented in a table for one geographical area will therefore actually be data for another location. Small numbers will still be visible in data where record swapping has been used but the method introduces a level of uncertainty in all of the data, which protects against disclosure. Disadvantages are that a high level of swapping may be required to protect all unsafe cells, and the method is not obvious to users. This method was applied to the 2001 and 2011 Census data. Alternative presentation Rather than providing counts, data may be presented as percentages, rates, percentage change over time, or by graph, map or a text-based commentary. Alternative presentation methods can enable the presentation of the data without disclosing the actual counts. They also provide the opportunity to guide the interpretation of the data and provide a more informative analysis. Care should be taken with percentages and rates, however, as if the denominators used are widely available, it may be possible to back-calculate the original counts. Page 2 of 4 Examples Methods frequently used by Public Health Intelligence teams to avoid disclosure when providing data are table re-design, cell suppression, and rounding, sometimes in combination. These methods are presented below using false data for an unknown condition. Original data table – Number of cases of the condition by year and age group, 2006-2010. Year 2006 2007 2008 2009 2010 0-4 5-9 1 0 1 4 3 3 2 1 0 1 10-14 3 4 4 1 6 15-19 5 1 3 5 5 20-24 15 16 12 18 21 25-29 24 22 19 45 26 30-34 29 29 27 26 30 35-39 24 20 29 35 42 Age group (years) 40-44 45-49 20 34 21 36 26 38 28 33 31 40 50-54 44 40 42 49 52 55-59 56 58 59 60 61 60-64 58 60 63 66 65 65-69 62 60 58 61 62 70-74 30 35 42 45 45 75-79 15 18 21 22 20 80-84 7 6 4 9 12 85+ 3 1 2 0 8 Total 433 429 451 507 530 70-74 107 122 132 75-79 54 61 63 80-84 17 19 25 85+ 6 3 10 Total 1313 1387 1488 A number of cells for the youngest and oldest age groups are potentially unsafe and present a disclosure risk (shaded). Table redesign 1 Year 2006 2007 2008 2009 2010 0-14 7 6 6 5 10 15-24 20 17 15 23 26 25-34 53 51 46 71 56 Age group (years) 35-44 45-54 55-64 44 78 114 41 76 118 55 80 122 63 82 126 73 92 126 65-74 92 95 100 106 107 75+ 25 25 27 31 40 Total 433 429 451 507 530 Using broader age groups increases the numbers in the cells so that the counts are no longer unsafe. Table redesign 2 Year 2006-2008 2007-2009 2008-2010 0-4 5-9 2 5 8 6 3 2 10-14 11 9 11 15-19 9 9 13 20-24 43 46 51 25-29 65 86 90 30-34 85 82 83 35-39 73 84 106 Age group (years) 40-44 45-49 67 108 75 107 85 111 50-54 126 131 143 55-59 173 177 180 60-64 181 189 194 65-69 180 179 181 Using broader time periods increases the numbers in the cells. Although a few cells still contain small numbers, increasing the time period that they relate to reduces the disclosure risk. Page 3 of 4 Cell suppression Year 2006 2007 2008 2009 2010 0-4 <5 <5 <5 <5 <5 5-9 <5 <5 <5 <5 <5 10-14 <5 <5 <5 <5 6 15-19 5 <5 <5 5 5 20-24 15 16 12 18 21 25-29 24 22 19 45 26 30-34 29 29 27 26 30 35-39 24 20 29 35 42 Age group (years) 40-44 45-49 20 34 21 36 26 38 28 33 31 40 50-54 44 40 42 49 52 55-59 56 58 59 60 61 60-64 58 60 63 66 65 65-69 62 60 58 61 62 70-74 30 35 42 45 45 75-79 15 18 21 22 20 80-84 7 6 4 9 12 85+ <5 <5 <5 <5 8 Total 433 429 451 507 530 25-29 20 20 15 45 25 30-34 25 25 25 25 30 35-39 20 20 25 35 40 Age group (years) 40-44 45-49 20 30 20 35 25 35 25 30 30 40 50-54 40 40 40 45 50 55-59 55 55 55 60 60 60-64 55 60 60 65 65 65-69 60 60 55 60 60 70-74 30 35 40 45 45 75-79 15 15 20 20 20 80-84 5 5 0 5 10 85+ Total 430 425 450 505 530 All numbers less than 5 are replaced with '<5'. Rounding Year 2006 2007 2008 2009 2010 0-4 5-9 0 0 0 0 0 0 0 0 0 0 10-14 0 0 0 0 5 15-19 5 0 0 5 5 20-24 15 15 10 15 20 0 0 0 0 5 All numbers are rounded to a base of 5. Further information Please follow the link to the Office for National Statistics website on page 1 or contact the Peterborough City Council or Cambridgeshire County Council Public Health Intelligence Teams: Peterborough: [email protected] Cambridgeshire: [email protected] Page 4 of 4
© Copyright 2025 Paperzz