Extracting `Hidden` Information from the Free Text

2013-11-18
EXTRACTING ‘HIDDEN’ INFORMATION
FROM THE FREE TEXT
How much extra information is there and how
can we extract it?
Rosemary Tate
CPRD Senior Research Fellow Informatics,
University of Sussex
Overview




Introduction
Challenges
Extracting information on ovarian cancer
symptoms
What next
1
2013-11-18
Challenges
1D13.00
[complains of a pain] back and legs and abdo = loss appetite
1D13.00 [complains of a pain] back and legs and abdo = loss appetite and
and
heavingh
tattwind
too- too
wind
no alterered
nbowel
noo/e
blood
pr
heavingh - tatt -toonotoo
alterered
nbowel habitno habitblood pr
angular
o/e
angular- stomatitiis
- abdo
- ? -sweling
- marginal
stomatitiis
looks unwell- -looks
abdo unwell
- ? sweling
R side
marginalRprside
ok diag
????
prdivertuculitis?ok diag ????bllods
divertuculitis?and rv 1w bllods and rv 1w
wide
variety of means of expression
linguistic expressions may be idiosyncratic to individual
GPs – many ways of describing same symptom
very limited use of full-sentence syntax,
 abbreviations, typos, missing punctuation and spaces
 negation usually very important
 only a limited amount of text available since it has to be
manually anonymised
Ovarian Cancer study – extracting information
on diagnosis and symptoms from the free text
•
Anonymised records for 344 women
(aged 40–80) diagnosed with ovarian
cancer between 2004 and 2008
2
2013-11-18
Methods: Ovarian Cancer symptoms
Found all records containing occurrences of 5 most
common symptoms, using three methods:
finding Read codes for each symptom in the code field of
records
automatically recognising mentions of the symptoms in
the text field of records by matching the Read codes’
descriptions – using extended string matching and a
negation detection algorithm
manual annotation of the text in the records – by 3
clinicians independently
Results: incidence of symptoms

For each of the symptoms, we computed the number of women
recorded as having the symptom at least once in the 12 months
prior to diagnosis of cancer.
3
2013-11-18
Summary of findings
Estimates based on codes significantly
underestimate incidence of symptoms
 A significant proportion of the extra
information can be extracted automatically
using straightforward methods
 which are applicable to un-anonymised
text

What next?

We have a rich data set that can help us to
learn generalisations.


Recognize common variations
Learn mapping from Read terms to
annotated terms


Use ontologies, like UMLS
Automatically created thesauruses
4
2013-11-18
What next?

Chunking – split text into meaningful
segments



using machine learning trained on annotated
records
apply chunker to recognise and classify
medical entities
Automated anoymisation – first attempt has
produced promising results - 10% false
positives, but algorithm being improved
Acknowledgements
John Carroll
 Rob Koeling
 Aleksandar Savkov
 Alex Martin
 Jackie Cassell

5