CMU-CB-22-100 Ray and Stephanie Lane Computational Biology Department School of Computer Science, Carnegie Mellon University
Theory and Practice of Low-Density Minimizer Sketches Hongyu Zheng February 2022 Ph.D. Thesis
Sequence sketch methods generate compact fingerprints of large string sets for efficient indexing and searching. Minimizers are one of such sketching methods, sampling k-mers from a string by selecting the minimal k-mer from each sliding window with a predetermined ordering. Minimizers sketches preserve information to detect sufficiently long substring matches. This favorable property of minimizers, and its ease of implementation, lead to wide adoption in a large number of software and pipelines, including read mappers, genome assemblers, sequence databases and more. Despite the method’s popularity, many theoretical and practical questions regarding its performance remain open. This is especially true regarding the density of minimizers, a simple yet powerful metric for minimizer performance. We have neither understanding of density growth under various conditions, nor principled approaches to construct minimizers with provably low density. This dissertation attacks the lack of knowledge in minimizer sketches from multiple fronts. In the first part, we are primarily concerned with asymptotic behavior for the density of minimizer sketches. Minimizers are parameterized by the k-mer length k, a window length w and an order on the k-mers. Using a number of new techniques, we are able to provide a complete picture of how the density of optimal minimizer grows with its parameters w and k. We also derive structural lemmas for universal hitting sets and local schemes, two highly related concepts for minimizer design. Together, these results serve as building blocks for future theoretical advances. The next two parts are focused on constructing low-density minimizers in two scenarios. In the second part, we propose the Miniception, the first construction of minimizers that provably achieves better densities than a random minimizer in practical configurations of w and k. This method is also simple to implement and requires minimal overhead compared to existing implementations of random minimizers. In the third and final part, we consider optimization of minimizer density on a given reference sequence. The most common use case is when the given sequence is the reference genome or transcriptome. We propose the concept of polar sets, which itself is a complementary concept of universal hitting sets and similarly can be used to construct minimizers. Using polar sets, we propose efficient algorithms to directly optimize the density of minimizers on a reference sequence up to some error. Experiments show that both the Miniception and the polar set algorithms can reliably outperform existing methods for constructing low-density minimizers, without a reference and with a reference, respectively
144 pages
| |
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |