CMU-CB-22-100
Ray and Stephanie Lane Computational Biology Department
School of Computer Science, Carnegie Mellon University



CMU-CB-22-100

Theory and Practice of Low-Density Minimizer Sketches

Hongyu Zheng

February 2022

Ph.D. Thesis

CMU-CB-22-100.pdf


Keywords: Sequence Sketch, Sequence Analysis, Combintorial Algorithms, String Algorithms

Sequence sketch methods generate compact fingerprints of large string sets for efficient indexing and searching. Minimizers are one of such sketching methods, sampling k-mers from a string by selecting the minimal k-mer from each sliding window with a predetermined ordering. Minimizers sketches preserve information to detect sufficiently long substring matches. This favorable property of minimizers, and its ease of implementation, lead to wide adoption in a large number of software and pipelines, including read mappers, genome assemblers, sequence databases and more. Despite the method’s popularity, many theoretical and practical questions regarding its performance remain open. This is especially true regarding the density of minimizers, a simple yet powerful metric for minimizer performance. We have neither understanding of density growth under various conditions, nor principled approaches to construct minimizers with provably low density. This dissertation attacks the lack of knowledge in minimizer sketches from multiple fronts.

In the first part, we are primarily concerned with asymptotic behavior for the density of minimizer sketches. Minimizers are parameterized by the k-mer length k, a window length w and an order on the k-mers. Using a number of new techniques, we are able to provide a complete picture of how the density of optimal minimizer grows with its parameters w and k. We also derive structural lemmas for universal hitting sets and local schemes, two highly related concepts for minimizer design. Together, these results serve as building blocks for future theoretical advances.

The next two parts are focused on constructing low-density minimizers in two scenarios. In the second part, we propose the Miniception, the first construction of minimizers that provably achieves better densities than a random minimizer in practical configurations of w and k. This method is also simple to implement and requires minimal overhead compared to existing implementations of random minimizers. In the third and final part, we consider optimization of minimizer density on a given reference sequence. The most common use case is when the given sequence is the reference genome or transcriptome. We propose the concept of polar sets, which itself is a complementary concept of universal hitting sets and similarly can be used to construct minimizers. Using polar sets, we propose efficient algorithms to directly optimize the density of minimizers on a reference sequence up to some error. Experiments show that both the Miniception and the polar set algorithms can reliably outperform existing methods for constructing low-density minimizers, without a reference and with a reference, respectively

144 pages

Thesis Committee:
Carl Kingsford (Chair)
Guillaume Marçais
Hosein Mohimani
Maria Chikina (University of Pittsburgh)
Ron Shamir (Tel Aviv University)

Russell Schwartz, Head, Computational Biology Department
Martial Hebert, Dean, School of Computer Science



Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu