CloudLightning: The Opportunity for Genome Processing

Genomics is the study of all of a person’s genes (the genome), including interactions of those genes with each other and with the person’s environment. Since the late nineties, academic and industry analysts have identified the potential of genomics to realise significant gains in development time and reduced investment, largely attached to realising efficiency gains.

Genomics provides pharmaceutical companies with long term upside and competitive advantage through savings right along the R&D value chain (including more efficient target discovery, lead discovery, and development) but also in better decision making accuracy resulting from more, better, and earlier information which ultimately results in higher drug success rates. The net impact is that genomics can result in more successful drug discovery.

Relatively small efficiency and accuracy gains in the pharmaceutical industry can result in disproportionately large benefits in terms of employment and GDP. However, genome processing requires substantial computational power and storage requiring significant infrastructure and specialist IT expertise. While larger organisations can afford such infrastructure, it is a significant cost burden for smaller pharmaceutical companies, hospitals and health centres, and researchers. Even when such infrastructure is place, researchers may be stymied by inadequate offsite access.

Genomics has two core activities:

  • Sequencing: a laboratory-based process involving reading DNA from the cells of an organism and digitising the results
  • Computation: the processing, sequence alignment, compression and analysis of the digitised sequence

Historically the cost of sequencing has represented the most significant percentage of the total. However this cost has decreased dramatically over the past decade due to breakthroughs in research and innovation in that area.

As the cost of sequencing has dropped, the cost of computation (alignment, compression and analysis) has formed a greater proportion of the total. The biggest consumer of compute runtime is sequence alignment – assembling the large number of individual short ‘reads’ which come out of the sequencer (typically a few hundred bases long) into a single complete genome. This can be split into many processing jobs, each processing batches of reads and aligning against a reference genome, and run in parallel. Significant input data is required but there is little or no inter-node communication needed. The most computationally intensive kernel in the overall process is local sequence alignment, using algorithms such as Smith Waterman, which is very well suited to being optimised through the use of heterogeneous compute technologies such as DFEs.

This use case will demonstrate the use of the CloudLightning system in a large-scale simulation for genome processing. We anticipate greater energy efficiency resulting in lower costs. As the cost of the raw sequencing technology drops, the computing challenge becomes the final significant technology bottleneck preventing the routine use of genomics data in clinical settings, CloudLightning will target this both through the use of heterogeneous computing technologies to offer significantly improved performance/cost and performance/Watt, but also enabling this computation to be hosted at large-scale in the cloud, making it practical for wide- scale use.

In addition to realigning the computation cost factors in genome processing with sequencing costs, it can significantly improve the genome processing throughput and speed of genome sequence computation thereby reducing the wider cycle time thus increasing the volume and quality of related research. The benefits of such a cloud solution for genome processing are obvious. Researchers, whether in large pharmaceutical companies, genomics research centres or health centres, can invest their energy and time in R&D and not managing and deploying complex on-site infrastructure.

Leave a Reply