PUBLIC - Cambridgeshire Insight

PUBLIC
HEALTH
intelligence
Maintaining confidentiality in disseminated data
- Dealing with small numbers and the risk of disclosure
May 2013
This briefing provides a summary of the confidentiality issues which can arise when providing data
involving small numbers, and describes the methods commonly used to deal with them. The
following information is based on the guidance published by the Office for National Statistics (ONS):
http://www.ons.gov.uk/ons/guide-method/best-practice/disclosure-control-of-healthstatistics/index.html
What do we mean by confidentiality and disclosure in disseminated data?
Protecting confidentiality applies to more than just individual-level data which might contain
personal information (e.g. name, date of birth, address or NHS number). The principles of
confidentiality should also be considered with summary data, such as numbers in tables, particularly
when numbers are small. Although a table may not reveal personal information, the availability of
other information or the way the table is presented could mean an individual’s identity or
characteristics are revealed; this is often referred to as ‘deductive disclosure’. Providing or
publishing data which risks disclosure could breach public trust, the National Statistics Code of
Practice or other agreements under which data are collected or received.
When might data carry a disclosure risk?
The risk of disclosure is increased when data are for small geographic areas or organisation units
(e.g. wards or GP practices), specific population sub-groups (e.g. ethnic groups), short or recent time
periods, or where the data are considered particularly sensitive. Table cells with counts of 1 or 2 are
considered as potentially ‘unsafe’, and for particularly small areas and sensitive data, this may also
apply to counts of 3 and 4; care should also be taken where rows or columns contain a lot of zeros.
The ONS considers three potential situations for disclosure:
General attribute disclosure – Someone who already knows something about a person who
is counted in a data table may learn something new about them due to the way in which the
data are presented or broken down.
The motivated intruder – Small numbers in tables may prompt people to find other local
sources of information in order to identify those individuals and/or reveal information about
them.
Identification or self-identification – Small numbers could lead to the discovery of rareness
or uniqueness in a population by those individuals, which may cause them harm or distress if
they feel exposed as a result.
How can the risk of disclosure be controlled?
Where unsafe cells are identified in tables, there are a number of options which can be used to
reduce the disclosure risk.
Page 1 of 4
Redesigning tables
Data may be aggregated across a number of time periods, presented at a higher level geography,
and/or be grouped into broader categories. For example, data may be presented as 3-year
aggregates or averages rather than by year, at local authority district rather than ward level, or by
broad age groups rather than 5-year age groups.
This method can increase the numbers in tables and reduce the risk of disclosure by making any
identification more difficult. For example, it would be much harder for a ‘motivated intruder’ to
identify an individual in a bigger population and across a longer time period; a greater time lag also
increases the chance an individual has moved. This method can also smooth out random variation,
which helps with interpreting the data; it can, however, reduce the level of detail presented.
Modifying cell values
Cell suppression – Unsafe numbers are replaced by a dash or an ‘X’, or if applied to all cells
containing numbers less than 5, ‘<5’ may be inserted. Secondary suppression of other numbers may
also be needed if it is possible to work out the unsafe numbers by subtraction from a total; this is a
disadvantage. The exact counts of most cells, however, are maintained.
Rounding – All counts in the table are rounded to a specified base, often of 5. In controlled
rounding, numbers are rounded down to the nearest multiple of the base; in random rounding,
numbers are rounded up or down in a random manner. Conventional rounding, to the nearest
multiple, isn’t considered a sufficient level of protection. Rounding ensures counts are still provided
in all cells but creates uncertainty about the true values; it does mean, however, that all cells may be
changed.
Barnardisation – All counts in the table may be adjusted by -1, 0 or +1 based on a specified
probability, introducing a level of uncertainty into the counts. Information loss is minimised but
many small numbers remain ‘unsafe’ and so this method is rarely used.
Record swapping
This method involves introducing a small level of error in individual records before the data are
tabulated. Pairs of records are matched based on certain characteristics, e.g. age, and then other
characteristics, e.g. geographical location, are swapped; some of the data presented in a table for
one geographical area will therefore actually be data for another location.
Small numbers will still be visible in data where record swapping has been used but the method
introduces a level of uncertainty in all of the data, which protects against disclosure. Disadvantages
are that a high level of swapping may be required to protect all unsafe cells, and the method is not
obvious to users. This method was applied to the 2001 and 2011 Census data.
Alternative presentation
Rather than providing counts, data may be presented as percentages, rates, percentage change over
time, or by graph, map or a text-based commentary. Alternative presentation methods can enable
the presentation of the data without disclosing the actual counts. They also provide the opportunity
to guide the interpretation of the data and provide a more informative analysis. Care should be
taken with percentages and rates, however, as if the denominators used are widely available, it may
be possible to back-calculate the original counts.
Page 2 of 4
Examples
Methods frequently used by Public Health Intelligence teams to avoid disclosure when providing data are table re-design, cell suppression, and rounding,
sometimes in combination. These methods are presented below using false data for an unknown condition.
Original data table – Number of cases of the condition by year and age group, 2006-2010.
Year
2006
2007
2008
2009
2010
0-4
5-9
1
0
1
4
3
3
2
1
0
1
10-14
3
4
4
1
6
15-19
5
1
3
5
5
20-24
15
16
12
18
21
25-29
24
22
19
45
26
30-34
29
29
27
26
30
35-39
24
20
29
35
42
Age group (years)
40-44
45-49
20
34
21
36
26
38
28
33
31
40
50-54
44
40
42
49
52
55-59
56
58
59
60
61
60-64
58
60
63
66
65
65-69
62
60
58
61
62
70-74
30
35
42
45
45
75-79
15
18
21
22
20
80-84
7
6
4
9
12
85+
3
1
2
0
8
Total
433
429
451
507
530
70-74
107
122
132
75-79
54
61
63
80-84
17
19
25
85+
6
3
10
Total
1313
1387
1488
A number of cells for the youngest and oldest age groups are potentially unsafe and present a disclosure risk (shaded).
Table redesign 1
Year
2006
2007
2008
2009
2010
0-14
7
6
6
5
10
15-24
20
17
15
23
26
25-34
53
51
46
71
56
Age group (years)
35-44 45-54 55-64
44
78
114
41
76
118
55
80
122
63
82
126
73
92
126
65-74
92
95
100
106
107
75+
25
25
27
31
40
Total
433
429
451
507
530
Using broader age groups increases the numbers in the cells so that the counts are no longer unsafe.
Table redesign 2
Year
2006-2008
2007-2009
2008-2010
0-4
5-9
2
5
8
6
3
2
10-14
11
9
11
15-19
9
9
13
20-24
43
46
51
25-29
65
86
90
30-34
85
82
83
35-39
73
84
106
Age group (years)
40-44
45-49
67
108
75
107
85
111
50-54
126
131
143
55-59
173
177
180
60-64
181
189
194
65-69
180
179
181
Using broader time periods increases the numbers in the cells. Although a few cells still contain small numbers, increasing the time period that they relate
to reduces the disclosure risk.
Page 3 of 4
Cell suppression
Year
2006
2007
2008
2009
2010
0-4
<5
<5
<5
<5
<5
5-9
<5
<5
<5
<5
<5
10-14
<5
<5
<5
<5
6
15-19
5
<5
<5
5
5
20-24
15
16
12
18
21
25-29
24
22
19
45
26
30-34
29
29
27
26
30
35-39
24
20
29
35
42
Age group (years)
40-44
45-49
20
34
21
36
26
38
28
33
31
40
50-54
44
40
42
49
52
55-59
56
58
59
60
61
60-64
58
60
63
66
65
65-69
62
60
58
61
62
70-74
30
35
42
45
45
75-79
15
18
21
22
20
80-84
7
6
4
9
12
85+
<5
<5
<5
<5
8
Total
433
429
451
507
530
25-29
20
20
15
45
25
30-34
25
25
25
25
30
35-39
20
20
25
35
40
Age group (years)
40-44
45-49
20
30
20
35
25
35
25
30
30
40
50-54
40
40
40
45
50
55-59
55
55
55
60
60
60-64
55
60
60
65
65
65-69
60
60
55
60
60
70-74
30
35
40
45
45
75-79
15
15
20
20
20
80-84
5
5
0
5
10
85+
Total
430
425
450
505
530
All numbers less than 5 are replaced with '<5'.
Rounding
Year
2006
2007
2008
2009
2010
0-4
5-9
0
0
0
0
0
0
0
0
0
0
10-14
0
0
0
0
5
15-19
5
0
0
5
5
20-24
15
15
10
15
20
0
0
0
0
5
All numbers are rounded to a base of 5.
Further information
Please follow the link to the Office for National Statistics website on page 1 or contact the Peterborough City Council or Cambridgeshire County Council
Public Health Intelligence Teams:
Peterborough: [email protected]
Cambridgeshire: [email protected]
Page 4 of 4