Linear integer programming approach for chloroplast genome scaffolding
Victor Epain  1@  , Rumen Andonov  2@  
1 : Inria Rennes – Bretagne Atlantique
Institut National de Recherche en Informatique et en Automatique, GenScale
2 : Institut de Recherche en Informatique et Systèmes Aléatoires
Universite de Rennes 1, GenScale

Introduction

Genome assembly aims to assemble the genome from a large amount of small DNA sequences named reads. This process can be roughly separated in two main steps. First, reads are merged into sequences named contigs. They correspond to paths in a graph that models relations between reads. For any contig, its orientation defines which DNA strand it belongs to. There are two orientations as the two DNA strands are read in reverse orientations. This study exclusively focuses on the second stage of the genome assembly, scaffolding, in which contigs are put in correct order and orientation towards the completion of the final assembly. Usually this step uses additional data e.g. mate pairs' distances, or homology references from near-species. In contrast to that, we demonstrate here that chloroplasts' genomes can be assembled without any raw additional data. It is well known that chloroplasts' genomes possess highly conserved circular and quadripartite structures [2] (with a pair of dispersed inverted repeat regions, separated by two unique regions, see Figure 1) is sufficient.

Method

The first step outputs an assembly graph where each vertex corresponds to a contig and is provided with an estimated multiplicity number. This number corresponds to an upper bound of repeats' contig in the genome. In the sequel we use another graph where each vertex is duplicated according to its multiplicity number and to the two contig orientations. Edges modelising contigs' successions between in provided assembly graph are duplicated respectively. We model the genome assembly as an elementary circuit in this graph, see Figure (2). We formulate the dispersed repeats with linear constraints and we search for such a circuit using Integer Linear Programming similarly to [1]. Inverted repeats correspond to occurrences of contigs paired with other occurrences of them but in reverse orientation, see Figure (1). Therefore paired contigs positions on the assembled sequence must satisfy nested-pairs pattern. We formulate the above constraints in terms of linear program where the objective is to maximise the nested-pairs number. This results in finding a couple of longest contiguous inverted repeats. Thus, we generalise a similar approach applied for RNA folding [4]. Indeed, in contrast to the later approach where the vertices correspond to bases with known sequence indices, in our case the positions of the contigs are variables. Our tool is implemented with Python 3 and uses the open-source PuLP package which integrates a free solver CBC [5] to solve the above optimisation problem.

Results

We verified our method with the well-known assembly evaluation tool QUAST [3]. We run 8 instances on a laptop (16GB RAM, 8 cores), and we obtained very encouraging preliminary results, with high genome coverage (mostly >99%), and very low mismatches and indels rates. In the case where all links between contigs are provided, our approach permits to finish pre-assembled genomes in just a few seconds, for graphs that not exceed 160 nodes and 284 edges and a dozen of candidate nested pairs.

Personnes connectées : 1