Medical information extraction with deep neural networks
Supervisors
Suitable for
Abstract
Background
Electronic Health Records (EHRs) are the databases used by hospital and general practitioners
to daily log all the information they record from patients (i.e. disorders, taken medications, symptoms, medical tests…).
In number of subjects (e.g. 50 million patients in the case of EMIF http://www.emif.eu/), EHRs are the largest source of empirical
data in biomedical research, allowing for major scientific findings in central disorders such as cancer and Alzheimer’s
disease [1]. However, most of the information held in EHRs is in the form of natural language text (written by the physician
during each session with each patient), making it inaccessible for research. Unlocking all this information would bring a
very significant advancement to biomedical research, multiplying the quantity and variety of scientifically usable data, which
is the reason why major efforts have been relatively recently initiated towards this aim (e.g. I2B2 challenges https://www.i2b2.org/NLP/).
In Artificial Intelligence, Information Extraction (IE) is the task of systematically extracting information from
natural language text in a form that can later be processed by a computer. For instance, if a physician describes the symptoms
and full treatment of a patient, an IE task could be identifying from the text alone all the drugs prescribed to the patient
and their dosages. Although Natural Language Algorithms (NLPs) can perform this task with fair accuracy in the simpler situations
(e.g. well-structured text, large amounts of labelled data available…), the challenge remains an unsolved problem in
the more complex cases (e.g. badly structured language; no labelled samples…), which is more akin the text typically
found in EHRs. Namely, physicians tend to use badly formatted shorthand and non-widespread acronyms (e.g. ‘transport
pt to OT tid via W/C’ for ‘transport patient to occupational therapy three times a day via wheel chair’),
while labelled records are scarce (ranging in the hundreds for a given task, with the best labelled datasets provided by I2B2).
Project
Recent Deep Neural Networks (DNN)[5] architectures have shown remarkable results
in traditionally unsolved NLP problems, including some IE tasks such as Slot Filling [2] and Relation Classification [3].
When transferring this success to EHRs, DNNs offer the advantage of not requiring well formatted text, while the problem remains
of labelled data being scarce (ranging on the hundreds for EHRs, rather than the tens of thousands used in typical DNN studies).
However, ongoing work in our lab has shown that certain extensions of recent NLP-DNN architectures can reproduce the typical
remarkable success of DNNs in situations with limited labelled data (paper in preparation). Namely, incorporating interaction
terms to feed forwards DNN architectures [4] can rise the performance of relation classification in I2B2 datasets from 0.65
F1 score to 0.90, while the highest performance previously reported with the same dataset was 0.74. With an F1 score of 0.90,
the quality of the extracted information meets the standards required for such information to be used in subsequent biomedical
studies, promising to unlock the scientific data that at present is hidden in the free text of EHRs.
We therefore
propose to apply DNNs to the problem of information extraction in EHRs, using I2B2 data as a testbed. More specifically, the
DNNs designed and implemented by the student should be able to extract the drugs that were prescribed to the patient. This
corresponds to the challenge released by I2B2 in 2009 (https://www.i2b2.org/NLP/Medication/), where a high quality labelled
dataset was provided. The student is free to use the extension of the feed forward DNN developed in our lab, or to explore
other feed forwards or recurrent (e.g. RNN, LSTM or GRU) alternatives. The DNN should be implemented in Python, as an extension
of the Keras (https://keras.io/) library over either Theano (http://deeplearning.net/software/theano/) or Tensorflow (https://www.tensorflow.org/).
Bibliography
[1] G. Perera, M. Khondoker, M. Broadbent, G. Breen, R. Stewart, Factors
Associated with Response to Acetylcholinesterase Inhibition in Dementia: A Cohort Study from a Secondary Mental Health Care
Case Register in London, PLOS ONE. 9 (2014) e109484. doi:10.1371/journal.pone.0109484.
[2] G. Mesnil, Y. Dauphin, K.
Yao, Y. Bengio, L. Deng, D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu, G. Zweig, Using Recurrent Neural Networks for Slot
Filling in Spoken Language Understanding, IEEEACM Trans. Audio Speech Lang. Process. 23 (2015) 530–539. doi:10.1109/TASLP.2014.2383614.
[3] C.N. dos Santos, B. Xiang, B. Zhou, Classifying Relations by Ranking with Convolutional Neural Networks, CoRR. abs/1504.06580
(2015). http://arxiv.org/abs/1504.06580.
[4] M. Denil, A. Demiraj, N.
Kalchbrenner, P. Blunsom, N. de Freitas, Modelling, Visualising and Summarising Documents with a Single Convolutional Neural
Network, CoRR. abs/1406.3830 (2014). http://arxiv.org/abs/1406.3830.
[5] Y.
LeCun, Y. Bengio, G. Hinton, DL - Deep learning, Nature. 521 (2015) 436–444. doi:10.1038/nature14539.