Living organisms have their DNA organized into chromosomes, each complete set of chromosomes being present in two (for diploid organisms such as humans and many animals) or more copies (for polyploid organisms like some plants). We call haplotype each copy within a given set. Haplotypes are highly similar, but show biologically important differences called variants, that can be of high interest as they may be involved into biological processes or genetic diseases.
However most of genome representations available today are monoploid in the sense that they are made of a mix of all haplotypes, thus masking variants and leading to missing or erroneous information. Our aim is to build reference sequences for each haplotype of a genome, taking as input genome subsequences called reads issued from a sequencing machine.
Here we propose a combinatorial method for diploid and polyploid haplotype phasing of long read data. We address the haplotype phasing as an optimization problem and use Answer Set Programming (ASP), with clingo system to solve it.
Rather than providing a unique and likely erroneous answer to this hard problem, the ASP framework allows to reason on the set of possible solutions. Moreover, ASP is a high-level declarative language that offers both efficiency (inspired on SAT-solver techniques) and ex-pressiveness (more than ILP for example): the user can easily express preferences and get aglobal view of confident and ambiguous positions in phased regions.