2013-11-18 EXTRACTING ‘HIDDEN’ INFORMATION FROM THE FREE TEXT How much extra information is there and how can we extract it? Rosemary Tate CPRD Senior Research Fellow Informatics, University of Sussex Overview Introduction Challenges Extracting information on ovarian cancer symptoms What next 1 2013-11-18 Challenges 1D13.00 [complains of a pain] back and legs and abdo = loss appetite 1D13.00 [complains of a pain] back and legs and abdo = loss appetite and and heavingh tattwind too- too wind no alterered nbowel noo/e blood pr heavingh - tatt -toonotoo alterered nbowel habitno habitblood pr angular o/e angular- stomatitiis - abdo - ? -sweling - marginal stomatitiis looks unwell- -looks abdo unwell - ? sweling R side marginalRprside ok diag ???? prdivertuculitis?ok diag ????bllods divertuculitis?and rv 1w bllods and rv 1w wide variety of means of expression linguistic expressions may be idiosyncratic to individual GPs – many ways of describing same symptom very limited use of full-sentence syntax, abbreviations, typos, missing punctuation and spaces negation usually very important only a limited amount of text available since it has to be manually anonymised Ovarian Cancer study – extracting information on diagnosis and symptoms from the free text • Anonymised records for 344 women (aged 40–80) diagnosed with ovarian cancer between 2004 and 2008 2 2013-11-18 Methods: Ovarian Cancer symptoms Found all records containing occurrences of 5 most common symptoms, using three methods: finding Read codes for each symptom in the code field of records automatically recognising mentions of the symptoms in the text field of records by matching the Read codes’ descriptions – using extended string matching and a negation detection algorithm manual annotation of the text in the records – by 3 clinicians independently Results: incidence of symptoms For each of the symptoms, we computed the number of women recorded as having the symptom at least once in the 12 months prior to diagnosis of cancer. 3 2013-11-18 Summary of findings Estimates based on codes significantly underestimate incidence of symptoms A significant proportion of the extra information can be extracted automatically using straightforward methods which are applicable to un-anonymised text What next? We have a rich data set that can help us to learn generalisations. Recognize common variations Learn mapping from Read terms to annotated terms Use ontologies, like UMLS Automatically created thesauruses 4 2013-11-18 What next? Chunking – split text into meaningful segments using machine learning trained on annotated records apply chunker to recognise and classify medical entities Automated anoymisation – first attempt has produced promising results - 10% false positives, but algorithm being improved Acknowledgements John Carroll Rob Koeling Aleksandar Savkov Alex Martin Jackie Cassell 5
© Copyright 2024 Paperzz