Skip to main content

Medical information extraction with deep neural networks

Supervisors

Suitable for

MSc in Computer Science

Abstract

Background. Electronic Health Records (EHRs) are the databases used by hospital and general practitioners to daily log all the information they record from patients (i.e. disorders, taken medications, symptoms, medical tests…). In number of subjects (e.g. 50 million patients in the case of EMIF http://www.emif.eu/), EHRs are the largest source of empirical data in biomedical research, allowing for major scientific findings in central disorders such as cancer and Alzheimer’s disease [1]. However, most of the information held in EHRs is in the form of natural language text (written by the physician during each session with each patient), making it inaccessible for research. Unlocking all this information would bring a very significant advancement to biomedical research, multiplying the quantity and variety of scientifically usable data, which is the reason why major efforts have been relatively recently initiated towards this aim (e.g. I2B2 challenges https://www.i2b2.org/NLP/ or the UK-CRIS network of EHRs https://crisnetwork.co/uk-cris-programme) In Artificial Intelligence, Information Extraction (IE) is the task of systematically extracting information from natural language text in a form that can later be processed by a computer. For instance, if a physician describes the symptoms and full treatment of a patient, an IE task could be identifying from the text alone all the drugs prescribed to the patient and their dosages. Although Natural Language Algorithms (NLPs) can perform this task with fair accuracy in the simpler situations (e.g. well-structured text, large amounts of labelled data available…), the challenge remains an unsolved problem in the more complex cases (e.g. badly structured language; no labelled samples…), which is more akin the text typically found in EHRs. Namely, physicians tend to use badly formatted shorthand and non-widespread acronyms (e.g. ‘transport pt to OT tid via W/C’ for ‘transport patient to occupational therapy three times a day via wheel chair’), while labelled records are scarce (ranging in the hundreds for a given task, with the best labelled datasets provided by I2B2). Project. Recent Deep Neural Networks (DNN)[2] architectures have shown remarkable results in traditionally unsolved NLP problems, including some IE tasks such as Slot Filling [3] and Relation Classification [4]. When transferring this success to EHRs, DNNs offer the advantage of not requiring well formatted text, while the problem remains of labelled data being scarce (ranging on the hundreds for EHRs, rather than the tens of thousands used in typical DNN studies). However, ongoing work in our lab has shown that certain extensions of recent NLP-DNN architectures can reproduce the typical remarkable success of DNNs in situations with limited labelled data (paper in preparation). Namely, incorporating interaction terms to feed forwards DNN architectures [5] can rise the performance of relation classification in I2B2 datasets from 0.65 F1 score to 0.90, while the highest performance previously reported with the same dataset was 0.74. With an F1 score of 0.90, the quality of the extracted information meets the standards required for such information to be used in subsequent biomedical studies, promising to unlock the scientific data that at present is hidden in the free text of EHRs We therefore propose to apply DNNs to the problem of information extraction in EHRs, using I2B2 and UK-CRIS data as a testbed. More specifically, the DNNs designed and implemented by the student should be able to extract medically relevant information, such as prescribed drugs or diagnoses given to patients. This corresponds to some of the challenges proposed by I2B2 during recent years (https://www.i2b2.org/NLP/Medication/), and are objectives of high interest in UK-CRIS which have sometimes been addressed with older techniques such as rules [1,6,7]. The student is free to use the extension of the feed forward DNN developed in our lab, or to explore other feed forwards or recurrent (e.g. RNN, LSTM or GRU) alternatives. The DNN should be implemented in Python’s Keras (https://keras.io/), Theano (http://deeplearning.net/software/theano/), Tensorflow (https://www.tensorflow.org/), or PyTorch (http://pytorch.org/).

Bibliography

[1] G. Perera, M. Khondoker, M. Broadbent, G. Breen, R. Stewart, Factors Associated with Response to Acetylcholinesterase Inhibition in Dementia: A Cohort Study from a Secondary Mental Health Care Case Register in London, PLOS ONE. 9 (2014) e109484. doi:10.1371/journal.pone.0109484. [2] Y. LeCun, Y. Bengio, G. Hinton, DL - Deep learning, Nature. 521 (2015) 436–444. doi:10.1038/nature14539. [3] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu, G. Zweig, Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding, IEEEACM Trans. Audio Speech Lang. Process. 23 (2015) 530–539. doi:10.1109/TASLP.2014.2383614. [4] C.N. dos Santos, B. Xiang, B. Zhou, Classifying Relations by Ranking with Convolutional Neural Networks, CoRR. abs/1504.06580 (2015). http://arxiv.org/abs/1504.06580. [5] M. Denil, A. Demiraj, N. Kalchbrenner, P. Blunsom, N. de Freitas, Modelling, Visualising and Summarising Documents with a Single Convolutional Neural Network, CoRR. abs/1406.3830 (2014). http://arxiv.org/abs/1406.3830. [6] E. Iqbal, R. Mallah, R.G. Jackson, M. Ball, Z.M. Ibrahim, M. Broadbent, O. Dzahini, R. Stewart, C. Johnston, R.J.B. Dobson, Identification of Adverse Drug Events from Free Text Electronic Patient Records and Information in a Large Mental Health Case Register, PLOS ONE. 10 (2015) e0134208. doi:10.1371/journal.pone.0134208. [7] R.G. Jackson MSc, M. Ball, R. Patel, R.D. Hayes, R.J. Dobson, R. Stewart, TextHunter – A User Friendly Tool for Extracting Generic Concepts from Free Text in Clinical Research, AMIA. Annu. Symp. Proc. 2014 (2014) 729–738.