Getting to Grips with Sequencing Alignment

This post was submitted by Front Line Genomics, a media company with a mission to help deliver the benefits of genomics faster.

Front Line Genomics is transforming the way the genomics community interact, get inspired and do business. Through their website, their magazine and the Festivals of Genomics, they connect scientists, clinicians, business leaders and students with ideas, solutions, inspiration, news, and most importantly, with each other.

What exactly is sequencing alignment you may ask? Using our educational guide, Clinical Genomics 101 we are going to try our very best to help you break down this process. During Next Generation Sequencing Analysis (NGS) data is broken down into three main parts, each with their own individual challenges including: primary, secondary and tertiary analysis. It is the second stage that we lend our focus too as it primarily consists of genome alignment and assembly, as well as quality analysis filtering and variant calling.

After NGS has taken place, the sequencer will have produced the data as a series of short reads due to the breaking up of the DNA prior to sequencing; to recreate the true genome sequence, these reads need to be put back together in the correct order. The main purpose of secondary analysis is therefore to reassemble and organise these fragments into a single coherent sequence that represents the original genome, a process usually known as genome assembly.

Before the reassembly process begins, the reads are assessed to filter them by length or quality reported to ensure that the best results are obtained. Once the data has been optimised, genome assembly will usually proceed via one of three ways:

  1. Reference Genome Mapping: Arguably this is the simplest and most frequently used method, reference genome mapping involves assembling and aligning reads against a complete reference genome that is a representative example of the species’ genome in question. By using a completed genome as a template, a computer program can map the reads to their most likely positions with high accuracy and speed. The two genomes can then be subjected to variant calling to highlight any differences between the two sequences and potentially flag up important biological information.
  2. De Novo Sequence Assembly: In contrast to reference genome mapping, de novo sequence assembly doesn’t involve a reference genome at all. Instead the reads are assembled through a complex computational process using techniques such as constructing Brujin graphs using K-mers (for short reads) or an overlap-layout- consensus approach to build a sequence statistically most likely to be accurate. The production of long read lengths allows the data to span various repetitive and complex elements of the genome, regions that short reads are usually unable to resolve.
  3. Assembly Using Graph-Based Reference Genomes: This genome assembly technique is very similar to reference genome mapping in that reads are mapped directly to a reference. However, instead of aligning against a single reference genome, graph-based referencing involves using a ‘graph’ that contains genomics data from hundreds of thousands of individuals in a population as an averaged reference sequence. This method has the potential to become more accurate and less computationally expensive than reference genome mapping, as well as allowing an individual’s data to be used while maintaining anonymity and as such holds significant potential in solving the privacy issues related to genomic studies. Graph-based reference genomes also allow for many of the advantages present in de novo assembly, such as identifying structural variations.

Currently, out of the three techniques, reference genome mapping displays the greatest speed and resource intensiveness. However, de novo methods and graph-based references have the advantage of presenting error-free sequences which can be used in detection of variations (such as structural variants) that could be overlooked when aligning a sequence against a single reference genome. At present, graph-based reference genomes are still recent developments and have not been fully optimised. In future they may become more popular tools but for now they are still limited in their previous applications and need further development.

And there we have it! We hope this has helped with your understanding of this complex process. The rest of our guide talks us through the practical and legal complications you’ll need to navigate when collecting samples, how to obtain data and what to do with it to achieve results, and how these results can be presented to patients and clinicians so that we can see a real difference in patient treatment and care. If you have been left wanting more (hopefully you have), then you can download our full guide for free here.

Leave a Reply