D2.2.2: Data format definition focusing on Competition

www.visceral.eu
Data format definition focusing on Competition 2
and beyond
Deliverable number
D2.2.2
Dissemination level
Public
Delivery date
31 October 2013
Status
Final
Author(s)
Tomas Salas
This project is supported by the European
Commission under the Information and
Communication Technologies (ICT) Theme of
the 7th Framework Programme for Research
and Technological Development.
Grant Agreement Number: 318068
D2.2.2 Data format definition focusing on Competition 2 and beyond
Executive Summary
VISCERAL will provide a very large data set of medical images which will be used for an image
retrieval benchmark and the automated annotation of these images.
These data will come mostly from electronic health records, and have been collected to provide health
care.
Original data will have to go through a series of transformations in order to address legal issues and
also to ensure that conforms to the needs of the benchmarking process.
This deliverable describes the format conventions for the collection, storage, and distribution of data in
the VISCERAL project with a focus on competition 2. The deliverable provides a detailed description
of these conventions, and how they should be implemented in the VISCERAL project.
To keep data management overhead at a minimum, the conventions are fixed at the beginning of the
project, and it is planned to keep these conventions throughout the project lifetime. The conventions
are formulated in a way that allows appending additional information later in the project, if relevant
annotation aspects arise later. In any case newer conventions should be backwards compatible, so that
existing pipelines can stay fixed.
Page 2 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
Table of Contents
1
2
3
4
Introduction ........................................................................................................................ 4
Original data ....................................................................................................................... 4
Information classification .................................................................................................. 4
Final dataset ........................................................................................................................ 8
4.1
Objects included in the final dataset .................................................................................... 8
4.2
Objects excluded from final dataset .................................................................................... 9
4.3
Transformations performed on the final data set ............................................................ 10
4.4
DICOM headers .................................................................................................................. 11
4.5
Metadata .............................................................................................................................. 15
4.6
Pixel data .............................................................................................................................. 22
4.7
Quality control ..................................................................................................................... 22
4.7.1
Conformity with DICOM standard ............................................................................... 22
4.7.2
Information within the study ......................................................................................... 22
4.8
Remaining re-identification risks....................................................................................... 23
5
6
Conclusion ......................................................................................................................... 23
References ......................................................................................................................... 24
List of Abbreviations
DICOM
Digital Imaging and Communications in Medicine
MRI
Magnetic Resonance Imaging
CT
Computed Tomography
RSNA
Radiological Society of North America
HIPAA
Health Insurance Portability and Accountability Act
CDA
Clinical Document Architecture
PDF
Portable Document File
Page 3 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
1 Introduction
Images provided by GENCAT to the VISCERAL project are a subset of the imaging studies stored in
its medical imaging central archive. This information consists solely of DICOM[1] objects.
This deliverable covers main characteristics of the original data, criteria used in order to obtain a
subset of this data, transformations performed on the final data set to meet data protection legal
requirements and a detailed description of the final dataset.
Prior to their distribution, original DICOM objects and images will have to be processed in order to
meet specific project requirements, like ensuring data privacy or provide basic metadata that allows to
perform targeted extractions.
2 Original data
Data to be provided to the VISCERAL project will be a subset obtained from a central medical
imaging archive containing more than 8.000.000 procedures.
It’s important to notice that the system only contains DICOM objects, and only will provide images
(pixel data) and some metadata associated with them.
The current system is designed to grant authorised professionals access to the information they need in
order to provide healthcare. While it provides comprehensive information about one single patient, it
does not allow exploitation for research purposes.
Prior to exploitation, pre-processing, data analysis and modelling of the dataset are needed in order to
obtain a subset which addresses VISCERAL needs.
Main issues to solve through this process are:
-
The system contains personal health data (patient information) and may also include nonhealth personal data (data identifying different healthcare professionals, patient’s relatives,
etc.).
-
The original system only provides aggregated information about the modality performing the
study, with no information about the procedure that has been performed.
Considering this, pre-processing process goals are:
-
Make the original information as close to anonymous as possible.
-
Obtain from DICOM headers information that will help to create a dataset according to
VISCERAL needs: patient’s sex and age, body part examined and, optionally, some additional
details about the performed procedure (anatomical focus, reason for study, etc.).
3 Information classification
Resulting from processing prior to data extraction, the information provided in order to classify
imaging procedures will be:
Page 4 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
-
Modality
-
Patient’s age
-
Patient’s sex
-
Body Region
-
Anatomical Focus (option)
-
Modality modifier (option)
Body Region, Anatomical Focus, and Modality modifier had been obtained from the Study
Description, which is, in most of the cases, the description of the requested procedure to the imaging
service.
RADLEX Playbook has been used for standardized classification of the information obtained from the
Study Description. RadLex Playbook[2] is a component of the RadLex controlled terminology that
provides a standard lexicon for radiology orderables. RadLex terminology has been developed by and
it is maintained by the Radiological Society of North America.
The final data set will be created on demand, according to available modalities, body regions,
anatomical focus and modality modifiers.
Available information at the moment of writing this document is as follows.
MRI
Body Region
Abdomen
Abdomen
Abdomen
Abdomen
Abdomen
Abdomen
Abdomen
Abdomen
Abdomen
Abdomen
Bone
Breast
Cervical Spine
Chest
Chest
Chest
Chest
Chest
Chest
Chest
Chest
Chest
Face
Face
Face
Anatomic Focus
Modality Modifier
Gastrointestinal Tract
Gastrointestinal Tract
Gastrointestinal Tract
Kidney
Kidney
Liver
Liver
Pancreas
Pancreas
Colonography
Enterography
Angiography
Cholangiography
Cholangiography
Cervical
Chest Wall
Heart
Mediastinum
Pulmonary Veins
Ribs
Sternoclavicular Joint
Thoracic
Thoracic
Angiography
Angiography
Maxillofacial
Parotid Gland
Page 5 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
Body Region
Head
Head
Head
Head
Head
Head
Head
Lower Extremity
Lower Extremity
Lower Extremity
Lower Extremity
Lower Extremity
Lower Extremity
Lower Extremity
Lower Extremity
Lower Extremity
Lower Extremity
Lower Extremity
Lumbar Spine
Lumbosacral Spine
Neck
Pelvis
Pelvis
Pelvis
Pelvis
Pelvis
Spine
Thoracic Spine
Trunk
Upper Extremity
Upper Extremity
Upper Extremity
Upper Extremity
Upper Extremity
Upper Extremity
Upper Extremity
Upper Extremity
Upper Extremity
Upper Extremity
Anatomic Focus
Modality Modifier
Brain
Brain
Internal Auditory Canal
Paranasal Sinuses
Pituitary Gland
Sella Turcica
Angiography
Ankle
Femur
Fingers
Foot
Knee
Knee
Knee
Knee
Leg
Thigh
Lumbar
Lumbar
Arthrography
Arthrography
Total Arthrography
Hip
Prostate
Rectum
Sacrum
Thoracic
Arm
Carpal Bone
Elbow
Fingers
Forearm
Hand
Humerus
Shoulder
Wrist
CT
Body Region
Anatomic Focus
Modality Modifier
Abdomen
Page 6 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
Body Region
Anatomic Focus
Modality Modifier
Abdomen
Abdomen
Abdomen
Abdomen
Abdomen
Abdomen
Abdomen
Bone
Cervical Spine
Chest
Chest
Chest
Chest
Chest
Chest
Chest
Chest
Chest
Chest
Chest
Face
Face
Face
Face
Head
Head
Head
Head
Head
Head
Head
Head
Head
Lower Extremity
Lower Extremity
Lower Extremity
Lower Extremity
Lower Extremity
Lower Extremity
Lower Extremity
Lower Extremity
Lumbar Spine
Gastrointestinal Tract
Gastrointestinal Tract
Gastrointestinal Tract
Kidney
Liver
Pancreas
Peritoneum
Colonography
Enterography
Cervical
Chest Wall
Clavicle
Coronary Arteries
Heart
Lung
Pulmonary Veins
Ribs
Sternoclavicular Joint
Sternum
Thoracic
Maxillofacial
Orbits
Paranasal Sinuses
Brain
Brain
Brain
Ear
Internal Auditory Canal
Middle Ear
Paranasal Sinuses
Sella Turcica
Angiography
Perfusion
Ankle
Femur
Fingers
Foot
Knee
Leg
Thigh
Lumbar
Page 7 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
Body Region
Anatomic Focus
Lumbosacral Spine
Neck
Neck
Neck
Pelvis
Pelvis
Pelvis
Spine
Thoracic Spine
Trunk
Upper Extremity
Upper Extremity
Upper Extremity
Upper Extremity
Upper Extremity
Upper Extremity
Upper Extremity
Upper Extremity
Lumbar
Modality Modifier
Larynx
Thyroid Gland
Hip
Sacrum
Thoracic
Arm
Carpal Bone
Elbow
Forearm
Humerus
Shoulder
Wrist
4 Final dataset
The final dataset will be produced from studies selected in the previous list. It will consist of a
Windows File System storing a modified copy of most of the original DICOM objects and a database
or Excel file with metadata describing those objects and their location within the file system.
The process will preserve the original structure of the study, with objects grouped into series, and
series grouped into studies. This structure will be transmitted both through the metadata database and
the images header.
No modification will be performed on pixel data.
Reasons to modify or to exclude DICOM objects from the final dataset are related to privacy concerns
or ethical issues.
4.1 Objects included in the final dataset
An imaging procedure will generate a collection of DICOM objects, mainly images. However another
DICOM objects may be present within the study.
This table shows objects that could be included within the study, image ones identified by an asterisk
(*).
The rest of the objects offer additional information consisting on annotations, measurements and
visualization parameters.
The table includes the DICOM unique identifier for each object (SOP Class UID).
Detailed information on objects definition can be found at PS 3.3- 2012 Digital Imaging and
Communications in Medicine (DICOM) Part 3: Information Object Definitions[3]
Page 8 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
SOP Class UID
Description
Comments
*
1.2.840.10008.5.1.4.1.1.2
CTImageStorage
1.2.840.10008.5.1.4.1.1.2.1
1.2.840.10008.5.1.4.1.1.4
Enhanced CT Image Storage
*
MRImageStorage
1.2.840.10008.5.1.4.1.1.4.1
Enhanced MR Image Storage
*
*
*
1.2.840.10008.5.1.4.1.1.7
SecondaryCaptureImageStorage
1.2.840.10008.5.1.4.1.1.66.4
Segmentation Storage
1.2.840.10008.5.1.4.1.1.11.1
GrayscaleSoftcopyPresentationState
1.2.840.10008.5.1.4.1.1.88.59
KeyObjectSelectionDocument
Secondary Captures are strong
candidates to present personal
information burned into the pixel
data.
As indicated, instances with the
tag ‘BurnedInAnnotation’ set to
‘YES’ will be removed from the
data set. Privacy concerns will
make necessary a manual
revision of Secondary Captures
present in the final dataset.
Results of this check will
determine whether Secondary
Captures will be or will be not
included into the final dataset.
Inclusion of this object and in
which conditions should be
discussed as it may potentially
introduce bias in the benchmark
process.
Inclusion of this object and in
which conditions should be
discussed as it may potentially
introduce bias in the benchmark
process.
Inclusion of this object and in
which conditions should be
discussed as it may potentially
introduce bias in the benchmark
process.
4.2 Objects excluded from final dataset
1. Final data set will not include imaging procedures from patients under 18 years
2. Studies known to be performed because of rare diseases or, in general, when the number of
procedures available for a given category is too small, will not be included within the final
dataset.
3. Objects which may include personal health data as non-structured data or binary data will be
excluded from the final dataset. These objects are mainly reports in different forms: DICOM
Structure Reports (SR), Adobe PDFs or CDAs (XML documents according to HL7
specifications for a clinical document), but also objects containing raw data or procedure logs.
These objects share the characteristic to potentially present personal health data as non
structured information. Anonymization of this kind of information not only presents problems
from the technical point of view, but also lacks a clear definition on how to deal with it, as far
as laws don’t address properly the need of processing personal data in order to anonymize it.
As above, DICOM unique identifier for each object has been included, and additional
information can be found within the DICOM Part 3 document.
Page 9 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
SOP Class UID
1.2.840.10008.5.1.4.1.1.66
Description
Raw Data
1.2.840.10008.5.1.4.1.1.88.11
Basic Text SR
1.2.840.10008.5.1.4.1.1.88.22
Enhanced SR
1.2.840.10008.5.1.4.1.1.88.33
Comprehensive SR
1.2.840.10008.5.1.4.1.1.88.40
Procedure Log
1.2.840.10008.5.1.4.1.1.88.65
Chest CAD SR
1.2.840.10008.5.1.4.1.1.88.67
X-Ray Radiation Dose SR
1.2.840.10008.5.1.4.1.1.88.69
Colon CAD SR
1.2.840.10008.5.1.4.1.1.88.70
Implantation Plan SR Document
1.2.840.10008.5.1.4.1.1.104.1
Encapsulated PDF
1.2.840.10008.5.1.4.1.1.104.2
Encapsulated CDA
4. Individual instances from a study which present privacy concerns will be excluded from the
final dataset. At the moment this restriction affects DICOM objects including the following
tags when its value is set to ‘YES’:
DICOM Tag
(0028,0301)
(0028,0302)
Description
Value
Exclusion reason
BurnedInAnnotation
YES
Indicates that personal data has been included
as pixels within the object.
RecognizableVisual Features
YES
Indicates that the object contains sufficiently
recognizable visual features to allow the image
or a reconstruction from a set of images to
identify the patient.
4.3 Transformations performed on the final data set
The final data set will be processed in order to remove personal data and personal health data.
Transformations performed on DICOM objects have been performed taking into account HIPAA safe
harbour specifications[4] and DICOM supplement 142[5], and include:
1. Removal of ‘identifiers’ and ‘quasi identifiers’. Those more commonly found in image
headers are:
 Names: Patient’s names, Patient’s relatives names and healthcare professional’s
names
Page 10 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
 Any reference to the healthcare provider ordering, performing, or reporting the study
 Dates: birth date, admission date, discharge date, study date
 Medical record numbers
 Device identifiers and serial numbers
2. As an exception to the above, patient’s age will be provided, but with some limitations:
a. Ages have been grouped into ranks:
18-24
35-39
50-54
65-69
25-29
40-44
55-59
70-74
30-34
45-49
60-64
75-79
As DICOM only supports a single value for patient’s age, each patient has been
assigned an age within their rank.
b. Patients aged 80 or older have been grouped and age has been set to ‘80’ for all of
them
3. All dates within the object have been modified or removed, specifically the patient’s date of
birth has been removed
4. UIDs have been modified, with the exception of SOP Class UID that has been preserved
5. Tags intended to contain free text information have been extracted to the database and
removed from the DICOM object.
6. Vendor proprietary tags have been removed
4.4 DICOM headers
Contents of the DICOM header will depend on the modality, manufacturer, configurations decided by
the image provider, decisions made by professionals while performing the test, and further postprocessing tasks of the images.
Even if exact content cannot be predicted, it would be very similar to these real examples offered here:
CT
Tag
Attribute Name
VR
Value
(0002,0001)
FileMetaInformationVersion
OB
00\01
(0002,0002)
MediaStorageSOPClassUID
UI
(0002,0003)
MediaStorageSOPInstanceUID
UI
1.2.840.10008.5.1.4.1.1.2
1.2.3.4.5.29672964508301581263868831094053736059
0
(0002,0010)
TransferSyntaxUID
UI
1.2.840.10008.1.2.1
(0002,0012)
ImplementationClassUID
UI
1.2.40.0.13.1.1
(0002,0013)
ImplementationVersionName
SH
dcm4che-1.4.27
Page 11 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
Tag
Attribute Name
VR
Value
(0008,0005)
SpecificCharacterSet
CS
ISO_IR 100
(0008,0008)
ImageType
CS
ORIGINAL\PRIMARY\AXIAL
(0008,0012)
InstanceCreationDate
DA
20081211
(0008,0016)
SOPClassUID
UI
(0008,0018)
SOPInstanceUID
UI
1.2.840.10008.5.1.4.1.1.2
1.2.3.4.5.29672964508301581263868831094053736059
0
(0008,0020)
StudyDate
DA
20081211
(0008,0021)
SeriesDate
DA
20081211
(0008,0023)
ContentDate
DA
20081211
(0008,0030)
StudyTime
TM
(0008,0031)
SeriesTime
TM
(0008,0050)
AccessionNumber
SH
153940361384910000735631294983738438801
(0008,0060)
Modality
CS
CT
(0008,0090)
ReferringPhysiciansName
PN
(0010,0010)
PatientName
PN
Anonymous
(0010,0020)
PatientID
LO
Anonymous-ID
(0010,0040)
PatientSex
CS
M
(0010,1010)
PatientAge
AS
048Y
(0012,0062)
Undefined
UN
YES
(0012,0063)
Undefined
UN
DICOM-S142-Baseline
(0018,0022)
ScanOptions
CS
AXIAL MODE
(0018,0050)
SliceThickness
DS
10.0
(0018,0060)
KVP
DS
100.0
(0018,0090)
DataCollectionDiameter
DS
250.0
(0018,1020)
SoftwareVersion
LO
07MW11.10
(0018,1100)
ReconstructionDiameter
DS
250.0
(0018,1110)
DistanceSourceToDetector
DS
949.075
(0018,1111)
DistanceSourceToPatient
DS
541.0
(0018,1120)
GantryDetectorTilt
DS
0.0
(0018,1130)
TableHeight
DS
179.9
(0018,1140)
RotationDirection
CS
CW
(0018,1150)
ExposureTime
IS
500
(0018,1151)
XRayTubeCurrent
IS
100
(0018,1152)
Exposure
IS
50
(0018,1160)
FilterType
SH
HEAD FILTER
(0018,1170)
GeneratorPower
IS
10000
(0018,1190)
FocalSpot
DS
1.2
(0018,1210)
ConvolutionKernel
SH
SOFT
(0018,5100)
PatientPosition
CS
HFS
(0020,000D)
StudyInstanceUID
UI
1.2.3.4.5.87617896750017244363385293660019016200
(0020,000E)
SeriesInstanceUID
UI
1.2.3.4.5.32665506069341097807191958223385598552
(0020,0010)
StudyID
SH
(0020,0011)
SeriesNumber
IS
200
(0020,0012)
AcquisitionNumber
IS
5
Page 12 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
Tag
Attribute Name
VR
Value
(0020,0013)
InstanceNumber
IS
5
(0020,0032)
ImagePositionPatient
DS
-128.0\-119.7\-243.5
(0020,0037)
ImageOrientationPatient
DS
(0020,0052)
FrameOfReferenceUID
UI
1.0\0.0\0.0\0.0\1.0\0.0
1.2.3.4.5.18525705700927828592346014678953405909
7
(0020,1040)
PositionReferenceIndicator
LO
SN
(0020,1041)
SliceLocation
DS
-243.5
(0028,0002)
SamplesPerPixel
US
1
(0028,0004)
PhotometricInterpretation
CS
MONOCHROME2
(0028,0010)
Rows
US
512
(0028,0011)
Columns
US
512
(0028,0030)
PixelSpacing
DS
0.488281\0.488281
(0028,0100)
BitsAllocated
US
16
(0028,0101)
BitsStored
US
16
(0028,0102)
HighBit
US
15
(0028,0103)
PixelRepresentation
1
(0028,0120)
PixelPaddingValue
US
US|S
S
(0028,1050)
WindowCenter
DS
150.0
(0028,1051)
WindowWidth
DS
700.0
(0028,1052)
RescaleIntercept
DS
-1024.0
(0028,1053)
RescaleSlope
DS
1.0
(0028,1054)
RescaleType
LO
PerformedProcedureStepStartDa
te
DA
HU
(0040,0244)
-2000
20081211
MR
Tag
Attribute Name
FileMetaInformationVersio
n
VR
Value
OB
00\01
UI
1.2.840.10008.5.1.4.1.1.4
(0002,0003)
MediaStorageSOPClassUID
MediaStorageSOPInstance
UID
UI
1.2.3.4.5.69439814550354249985023682275304301129
(0002,0010)
TransferSyntaxUID
UI
1.2.840.10008.1.2.4.70
(0002,0012)
UI
1.2.40.0.13.1.1
(0002,0013)
ImplementationClassUID
ImplementationVersionNa
me
SH
dcm4che-1.4.27
(0008,0005)
SpecificCharacterSet
CS
ISO_IR 100
(0008,0008)
ImageType
CS
ORIGINAL\PRIMARY\DIFFUSION\NONE\ND\NORM
(0008,0012)
InstanceCreationDate
DA
20090311
(0008,0016)
SOPClassUID
UI
1.2.840.10008.5.1.4.1.1.4
(0008,0018)
SOPInstanceUID
UI
1.2.3.4.5.69439814550354249985023682275304301129
(0008,0020)
StudyDate
DA
20090311
(0008,0021)
SeriesDate
DA
20090311
(0008,0023)
ContentDate
DA
20090311
(0008,0030)
StudyTime
TM
(0008,0031)
SeriesTime
TM
(0002,0001)
(0002,0002)
Page 13 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
Tag
Attribute Name
VR
Value
(0008,0050)
AccessionNumber
SH
135515459970539662550887127401933904768
(0008,0060)
Modality
CS
MR
(0008,0090)
ReferringPhysiciansName
PN
(0010,0010)
PatientName
PN
Anonymous
(0010,0020)
PatientID
LO
Anonymous-ID
(0010,0040)
PatientSex
CS
M
(0010,1010)
PatientAge
AS
057Y
(0012,0062)
Undefined
UN
YES
(0012,0063)
Undefined
UN
DICOM-S142-Baseline
(0018,0020)
ScanningSequence
CS
EP
(0018,0021)
SequenceVariant
CS
SK\SP
(0018,0022)
ScanOptions
CS
PFP\FS
(0018,0023)
MRAcquisitionType
CS
2D
(0018,0024)
SequenceName
SH
*ep_b1000#5
(0018,0025)
AngioFlag
CS
N
(0018,0050)
SliceThickness
DS
4.0
(0018,0080)
RepetitionTime
DS
6300.0
(0018,0081)
EchoTime
DS
100.0
(0018,0083)
NumberOfAverages
DS
1.0
(0018,0084)
ImagingFrequency
DS
123.259445
(0018,0085)
ImagedNucleus
SH
1H
(0018,0086)
EchoNumber
IS
1
(0018,0087)
MagneticFieldStrength
DS
3.0
(0018,0088)
5.2
(0018,0089)
SpacingBetweenSlices
DS
NumberOfPhaseEncodingSt
eps
IS
(0018,0091)
EchoTrainLength
IS
1
(0018,0093)
PercentSampling
DS
100.0
(0018,0094)
PercentPhaseFieldOfView
DS
100.0
(0018,0095)
PixelBandwidth
DS
1002.0
(0018,1020)
SoftwareVersion
LO
syngo MR B17
(0018,1251)
TransmittingCoil
SH
Body
(0018,1310)
AcquisitionMatrix
US
192\0\0\192
(0018,1312)
PhaseEncodingDirection
CS
COL
(0018,1314)
FlipAngle
DS
90.0
(0018,1315)
VariableFlipAngleFlag
CS
N
(0018,1316)
SAR
DS
0.109896615
(0018,1318)
dBdt
DS
0.0
(0018,5100)
PatientPosition
CS
HFS
(0020,000D)
StudyInstanceUID
UI
1.2.3.4.5.183201517982418290293516120352495178384
(0020,000E)
SeriesInstanceUID
UI
1.2.3.4.5.66601054998603300312696841387804239516
(0020,0010)
StudyID
SH
(0020,0011)
SeriesNumber
IS
2
(0020,0012)
AcquisitionNumber
IS
6
143
Page 14 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
Tag
Attribute Name
VR
Value
(0020,0013)
InstanceNumber
IS
150
(0020,0032)
ImagePositionPatient
DS
(0020,0037)
ImageOrientationPatient
DS
-126.54206\-70.64552\91.29098
0.9985181\-0.040072843\0.036820725\0.048835676\0.9583674\0.28133065
(0020,0052)
FrameOfReferenceUID
UI
1.2.3.4.5.82849207386744743147565708052495260877
(0020,1040)
PositionReferenceIndicator
LO
(0020,1041)
SliceLocation
DS
70.60576
(0028,0002)
SamplesPerPixel
US
1
(0028,0004)
PhotometricInterpretation
CS
MONOCHROME2
(0028,0010)
Rows
US
192
(0028,0011)
Columns
US
192
(0028,0030)
PixelSpacing
DS
1.25\1.25
(0028,0100)
BitsAllocated
US
16
(0028,0101)
BitsStored
US
12
(0028,0102)
HighBit
US
11
(0028,0103)
PixelRepresentation
0
(0028,0106)
SmallestImagePixelValue
(0028,0107)
LargestImagePixelValue
US
US|S
S
US|S
S
(0028,1050)
WindowCenter
DS
81.0
(0028,1051)
WindowWidth
WindowCenterWidthExplan
ation
PerformedProcedureStepSt
artDate
DS
221.0
LO
Algo1
DA
20090311
(0028,1055)
(0040,0244)
0
138
4.5 Metadata
Metadata will include the following columns.
Table
Column
DICOM Tag
STUDY
STUDYDATETIME
STUDY
NUMBEROFSERIES
STUDY
MODALITIESINSTUDY
(0008,0061)
STUDY
PATIENTSAGE
(0010,1010)
STUDY
PATIENTSAGETYPE
(0010,1010)
PATIENTSAGETYPE
Comments
D=Days
W=Weeks
M=Months
Y=Years
STUDY
PATIENTSWEIGHT
(0010,1030)
STUDY
PATIENTSSEX
(0010,0040)
M = male
F = female
O = other
STUDY
TOTALINSTANCIES
Study total instancies
STUDY
ADMITTINGDIAGNOSISDESCRIPTION
(0008,1080)
Page 15 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
Table
Column
DICOM Tag
STUDY
MEDICALALERTS
(0010,2000)
STUDY
ALLERGIES
(0010,2110)
STUDY
ADDITIONALPATIENTHISTORY
(0010,21B0)
STUDY
PREGNANCYSTATUS
(0010,21C0)
Comments
0001 = not pregnant
0002 = possibly pregnant
0003 = definitely pregnant
0004 = unknown
STUDY
PATIENTCOMMENTS
(0010,4000)
STUDY
MAGNETICFIELDSTRENGTH
(0018,0087)
STUDY
STUDYCOMMENTS
(0032,4000)
STUDY
SPECIALNEEDS
(0038,0050)
STUDY
PATIENTSTATE
(0038,0500)
STUDY
PREMEDICATION
(0040,0012)
STUDY
BODYREGION
NA
Head
Face
Neck
Chest
Breast
Upper Extremity
Abdomen
Pelvis
Lower Extremity
Spine
Cervical Spine
Lumbar Spine
Lumbosacral Spine
Thoracic Spine
Thoracolumbar Spine
Trunk
Bone
STUDY
ANATOMICFOCUS
NA
Acetabulum
Aorta
Appendix
Aortic Root
Carotid Arteries
Coronary Arteries
Perforator Arteries
Pulmonary Arteries
Joint
Sternoclavicular Joint
Temporomandibular Joint
Left Atrium
Forearm
Page 16 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
Table
Column
DICOM Tag
Comments
Arm
Bladder
Popliteal Fossa
Leg
Wrist
Oral Cavity
Brain
Cervical
Clavicle
STUDY
ANATOMICFOCUS
Elbow
Internal Auditory Canal
Heart
Vocal Cord
Ribs
Thigh
Fingers
Epidural Space
Shoulder
Sternum
Stomach
Femur
Liver
Posterior Cranial Fossa
Knee
Parotid Gland
Salivary Gland
Thyroid Gland
Pituitary Gland
Humerus
Small Bowel
Larynx
Lumbar
Hand
Hip
Maxillofacial
Mediastinum
Spleen
Muscle
Middle Ear
Orbits
Ear
Long Bone
Temporal Bone
Carpal Bone
Page 17 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
Table
Column
DICOM Tag
Comments
Facial Bones
Pancreas
Lung Parenchyma
Chest Wall
Peritoneum
Fibula
Foot
Pleura
Circle Of Willis
STUDY
ANATOMICFOCUS
Prostate
Lung
Cyst
Renal Cyst
Rectum
Retroperitoneum
Kidney
Sacrum
Sella Turcica
Paranasal Sinuses
Thoracic Outlet
Subdiaphragm
Adrenal
Soft Tissue Of The Neck
Tibia
Thoracic
Gastrointestinal Tract
Trachea
Ankle
Coronary Veins
Pulmonary Veins
Airway
STUDY
REASONFOREXAM
NA
Ablation
Radiofrequency Ablation
ARVD
Needle
Aspiration
Biopsy
Nerve Block
Calcium Score
Calculus
Stroke
Kyphoplasty
Screw Placement
Needle Placement
Page 18 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
Table
Column
DICOM Tag
Comments
Foreign Body
Craniosynostosis
Cryoablation
Diagnostic
Donor
Drainage
Embolism
Structure
Screening
STUDY
REASONFOREXAM
Facet Block
Fiducial
Fistula
Fracture
Function
Gout
Hematuria
Hemorrhage
Inflammation
Congenital Disease
Interstitial Disease
Myelopathy
Morphology
Nanoknife
Malignant Neoplasm
Nodule
Paracentesis
Pericardiocentesis
Post Op
Pre Op
Follow-Up Procedure
Prosthesis
Puncture
Chemoembolization
Radiculopathy
Thoracentesis
Trauma
Tube
Tumor
Vascular
Vertebroplasty
STUDY
MODALITYMODIFIER
NA
3D Imaging Processing
High Resolution
Angiography
Arthrography
Page 19 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
Table
Column
DICOM Tag
Comments
Bronchography
Cisternography
Cystography
Placement
Cholangiography
Colonography
Densitometry
Dynamic
Discogram
STUDY
MODALITYMODIFIER
Low Dose
Enterography
Surgical Equipment
Dental Scan
Scanogram
Guidance
Limited
Localization
Measurement
Myelography
Single Phase
Multiphase
Pelvimetry
Perfusion
Portography
Reconstruction
Stereotaxis
Triphasic
Urography
Venography
SERIES
SERIESDATETIME
SERIES
MODALITY
(0008,0060)
SERIES
SOPCLASSUID
(0008,0016)
SERIES
SOPCLASSUID DESCRIPTION
SERIES
SERIESDESCRIPTION
(0008,103E)
SERIES
BODYPARTEXAMINED
(0018,0015)
SERIES
PROTOCOLNAME
(0018,1030)
SERIES
IMAGETYPE
(0008,0008)
SERIES
PERFORMEDPROCEDURETYPEDESCRIPTION
(0040,0255)
SERIES
SCHEDULEDPROCEDURESTEPDESCRIPTION
(0040,0007)
SERIES
REQUESTPROCEDUREDESCRIPTION
(0032,1060)
SERIES
MANUFACTURER
(0008,0070)
SERIES
MANUFACTURERMODELNAME
(0008,1090)
SERIES
REASONFORSTUDY
(0032,1030)
NA
Page 20 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
Table
Column
DICOM Tag
SERIES
REASONFORTHEREQUESTEDPROCEDURE
SERIES
NUMEROINSTANCIAS
SERIES
CONTRASTBOLUSAGENT
(0018,0010)
SERIES
SCANNINSEQUENCE
(0018,0020)
SERIES
SEQUENCEVARIANT
(0018,0021)
SERIES
SCANOPTIONS
(0018,0022)
SERIES
MRAACQUISITIONTYPE
(0018,0023)
SERIES
PATIENTPOSITION
(0018,5100)
Comments
(0040,1002)
NA
Total series instancies
HFP = Head First-Prone
HFS = Head First-Supine
SERIES
PATIENTPOSITION
HFDR=HeadFirst-Decubitus Right
HFDL = Head First-Decubitus Left
FFDR = Feet First-Decubitus Right
FFDL = Feet First-Decubitus Left
FFP = Feet First-Prone
FFS = Feet First-Supine
HFS = Head First-Supine
SERIES
LATERALITY
(0020,0060)
R = right
L = left
SERIES
REQUESTEDCONTRASTAGENT
(0032,1070)
SERIES
SLICETHICKNESS
(0018,0050)
SERIES
SPACINGBETWEENSLICES
(0018,0088)
INSTANCE INSTANCEDATETIME
NA
INSTANCE RELATIVEPATH
NA
INSTANCE IMAGECOMMENTS
(0020,4000)
Page 21 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
4.6 Pixel data
Pixel data has not been modified. Below is a view of one image as retrieved through a DICOM viewer.
4.7 Quality control
Final data will be verified prior to its transfer. Verifications will be designed to assure that DICOM
objects are well formed and that the study contains at least the minimum amount of information
required for VISCERAL.
4.7.1 Conformity with DICOM standard
Two types of verification will be performed:
1. Manual retrieval of some studies using several DICOM viewers, info in the database
consistent with viewer (series and instances within the study)
2. Automated revision of conformance of the resulting objects with DICOM standard
4.7.2 Information within the study
For a number of reasons, a well formed study may not include enough information to perform an
automated processing on it.
The most usual case relates to the post processing of image instances, that should result in a new series
within the original study. Even so, it is not uncommon to save resulting instances into a new study. As
this new study will not include the original image instances, it is considered not to have utility for the
VISCERAL project. The most effective way to address this and similar issues is to set a minimum
Page 22 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
number of instances for the object to be considered valid. This number has been set at the moment to a
minimum of 100 instances. Studies below this threshold will be removed from the dataset.
4.8 Remaining re-identification risks
Personal information is any combination of values that can make a person identifiable, and therefore a
known list of these values does not exist.
It has to be considered also that, in order to effectively anonymize data, it would be necessary to have
a better understanding of how data can be re-identified. In the present situation, it is not known which
information is required or available to re-identify a given dataset, nor is it possible to assess the
probability for an attacker to build or obtain a re-identifying database.
Under these circumstances it cannot be assured that the whole set of data is completely anonymous.
Re-identification risk could be non-significant for small datasets, but not acceptable for big ones.
This issue will have to be addressed through organizational, technical and legal measures, among
which:
-
Control access policies and mechanisms
-
Commitment ‘not to’:
o
produce additional copies of the dataset
o
re-use
o
try to re-identify
-
Delete from the data set records containing personal information and communicate to the data
provider
-
Communicate any data breach
5 Conclusion
Personal health data collected in order to provide healthcare cannot be used for research activities
without prior preparation.
In the case of GENCAT data, the main tasks performed to obtain the final data set to be provided to
VISCERAL have been:
1. A semi-automated process of information classification based on modality and body part
examined, with the body part examined obtained from the study description. The product of
this classification process is a catalogue that classifies original information according to these
criteria. This catalogue allows further processing of original information to be addressed to
specific subsets of the original information
2. Assessment of DICOM objects according to the type of information they contain and the
viability of de-identifying them where applicable. From this process a list of DICOM objects
not to be extracted has been obtained.
3. Data extraction from the production systems according to selections performed on the above
catalogue and the restrictions obtained from the assessment process. This extraction process
will create a candidate encrypted copy of the original objects and a database containing
metadata obtained from DICOM headers
4. De-identification of candidate objects
Page 23 of 24
D2.2.2 Data format definition focusing on Competition 2 and beyond
5. Review of de-identified objects in order to remove potentially remaining personal health
information
6. Quality controls
As a result of these activities a transferable data set is obtained.
6 References
[1] http://medical.nema.org/
[2] http://www.rsna.org/RadLex_Playbook.aspx
[3] http://medical.nema.org/Dicom/2011/11_03pu.pdf
[4]http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/Deidentification/guidance.html
[5] ftp://medical.nema.org/medical/dicom/final/sup142_ft.pdf
Page 24 of 24