Skip to main content

Architecture to solve the genome data mountain


The 100,000 genomes programme, which aims to read and decipher the genetic codes of 100,000 NHS patients, is receiving assistance from Oxford Professor of Software Engineering Jim Davies. As chief technology officer, he will define the information architecture. His goal is to sift, analyse and present the huge amounts of raw data to doctors in a way that they can understand.

The UK Government recently announced an investment of almost £300 million in Genomics England, a new initiative to read and decipher the genetic codes of 100,000 NHS patients. The project has the potential to transform the future of healthcare, with new and better tests, drugs and treatment, and to enable further scientific discoveries.

The human genome is the complete set of genetic information for humans, as recorded within DNA. Sequencing the human genome is another way of saying that the genetic code of the DNA molecule is ‘read’ letter by letter, until all three billion letters are written down in the right order. So, sequencing the genome of a person with cancer or someone with a rare disease – or better still thousands of people – has the potential to help scientists and doctors understand how disease works.

The Human Genome Project, which started in the 1980s and was completed in 2003, produced the first, complete sequence of individual human genomes. Originally it took 13 years and many hundreds of millions of pounds to sequence a full genome, but now it can be done in a few days for less than £1000. Now that gene sequencing has become relatively straightforward the real difficulty lies in making sense of the genomic information, and in designing software that can see patterns in the DNA that are too complex for the human eye.

Genomics England, which is wholly owned and funded by the Department of Health, has been set up to sequence 100,000 whole genomes from NHS patients by 2017. The data needed to do this is large and complex: to get to 100,000 genomes, Genomics England will be collecting 10 petabytes of sequence data, and detailed, relevant health data on up to 100,000 people. The genome data is also precious and needs to be stored securely and with rigorous conditions for access.

This ‘big data’ problem is where Oxford Professor of Software Engineering Jim Davies comes in. He’s the chief technology officer for Genomics England, where he is helping to define the information architecture for the UK 100,000 genomes programme.

The raw data from one genome would occupy almost all of the average laptop’s memory. Just the annotations would easily fill a DVD by themselves. This mountain of data needs to be sifted, analysed and presented in a way that is helpful to doctors, most of whom will not have specialist knowledge of gene changes.

Jim explains: ‘The largest files are those that record the alignments of the reads from the sequencing machines: a whole genome sequence generated from a blood sample (at “30x” coverage) takes up 70 gigabytes, which is 16 DVDs in old money; the file for a cancer tumour sample (at “50x” coverage, or greater) will take up 100 gigabytes or more. Many of the users of the service won’t be interested in looking at these files, but we need to keep them around.’

Further information about Genomics England:

More from Jim Davies about the big data challenges: