This is a guest blog post by Tim Downing. Dr. Tim Downing is a lecturer in genomics at DCU’s School of Biotechnology. His Infection Genomics research group investigates the evolutionary history, present-day epidemiology and future threats associated with bacterial and parasitic drug resistance using genomics, computational models and population genetics. He is chairperson of the Genetics & Cell Biology (GCB) degree, and is a member of the ARC-SYM computational science research centre. You can learn more about ARC-SYM here and follow Tim on Twitter.
A major challenge in the areas of genomics and computational modelling of cell behaviour is the incredible volume of data available for investigation. This includes a lab’s own datasets as well as how they might compare that will published work using the same quality control, processing, analysis and visualisation steps. Current methods for processing such voluminous datasets frequently favour computational efficiency at a cost of accuracy, depending on the goal. At one end is the rapid turn-around required for analysing the output from (for instance) an infection test where speed is the key factor. And at the other lies high-confidence inference of exact DNA sequences, cell movements or gene activity levels so that more detailed inferences can be made. In both cases, the problem is similar to squinting at a distant chart in an opticians: how confident you are that the chart says “one” or “none”?
Improving data accessibility, data interoperability, IT and HPC approaches is fundamental to advancing new areas of bioinformatics. Examining genomic data begins with a massive set of short DNA or RNA sequences known as “reads” decoded (typically) from light signals for each individual letter in a given read. Most frequently, these reads are compared to a group reference sequences – like a genome composed of a bundle of chromosomes. This comparison is computed by sequence alignment, during which the chance of a given read belonging to a particular location in the genome is determined based on the level of matching its letters have to these reference sequences. CloudLightning co-opts the potential for parallelising this process, which is effective because each read can be examined independently in parallel during sequence alignment. This facilitates broader application of genome sequence analysis from the need for swift deduction of medically relevant information, to exhaustive searches of huge complex datasets.
The advent of faster computational modeling of biological data has scope beyond real-time diagnostics against threats like Zika or Ebola, and enabling better research linking millions of human genomes together. Cells communicate, interact and coordinate their behaviour irrespective of whether they are bacteria in dental plaque or a limb during development or healing. Researchers need effective methods to measure how cells move in space and time and what molecules they send as messages to other cells. Like the transition from large photography machines recording low-resolution images to the modern era of fast small high-resolution camera-phones, scalable portable genome sequencing will become commonplace. At present, this is emerging as individual sequencing machines connected to laptops, but eventually it will be miniaturised and opened up to the wider population as an app on your phone.