Design Topics¶
Here are various topics related to the design of art_modern.
Motivation¶
ART is an excellent software for simulating reads from a reference genome. However, it comes with the following limitations:
The implementation is not parallelized.
The implementation may consume excessive memory or fail on genomes with a limited number of chromosomes that are enormous in size. That is. larger than
signed int32.The implementation may consume too much memory or fail on transcriptome or template collection that contains an enormous number of contigs (cDNAs or other types of template molecules).
The implementation only supports unified coverage, which is not suitable for transcriptome simulation, where each cDNA molecule has its own expression level.
The implementation outputs alignments in SAM format, which is not space-efficient. We also spot some SAM records generated by
art_illuminawhose query length inferred from CIGAR strings differs from the actual query length.
So, we developed art_modern with the following ideas:
Parallelization is implemented using another layer of abstraction: a simulation job, which can be treated as a unit of work in a single thread.
The writer for the SAM output format was re-implemented using HTSLib, which supports BAM and headless SAM/BAM output formats with minimal code modifications.
Multiple FASTA parsers were added. For example, the
htslibparser enables on-disk random access of enormous genomes without reading them into memory, while thestreamparser allows streaming of FASTA files. All coordinates are now int64-based.The low-level I/O routines are made asynchronous to improve performance.
Support other random number generators like Intel OneAPI Math Kernel Library (OneMKL).
Parallelism¶
The parallelization strategy of different modes and input parsers is as follows:
Parser \ Mode |
|
|
|
|---|---|---|---|
|
Coverage |
Batch |
Batch |
|
Coverage |
ERROR |
ERROR |
|
ERROR |
Batch |
Batch |
Here, “coverage” means that all contigs are passed to all threads with coverage divided by the number of threads, while “batch” means that the entire data was split into batches (partitioning the existing in-memory data structure for memory parser or --i-batch_size for stream parser) and passed to different threads. The current batch-based design may be suboptimal if transcript/template lengths or coverages are not evenly distributed. You’re recommended to shuffle the input data to avoid such problems using, i.e., seqkit shuffle. For even larger data, a proposed MPI-based parallelization strategy is in TODO.md.
For MPI-based parallelization, the strategy is as follows:
For
wgsparser, just divide the sequencing depth.For non-
wgsparser, skip records based on MPI rank and word size.
The coverage-based parallelization strategy may yield a slightly different number of reads than single-threaded execution due to rounding errors. This is adjusted in 1.2.0, where the number of reads generated on the positive and negative strands will be adjusted by 1 (SE) or 2 (PE/MP) to make the number of generated bases and the number of required bases as close as possible.
For example, consider generating 125-nt reads, 5.0 positive and 5.0 negative depth for a contig of length 225.
In SE mode, 9 reads will be generated on positive and negative strands.
In PE/MP mode, 10 and 8 reads will be generated on positive and negative strands, respectively.
Random Generators¶
Bits Generation¶
The current random number generation function in each library is MT19937, which may not be the best choice for performance-critical applications. However, it is the most widely used, well-known, and moderate-performance cycle, and is implemented in all random number generator libraries (namely, Boost, STL, and Intel OneAPI MKL).
We choose not to use Abseil Random since (1) it is hard to bundle the entire Abseil Random library with the project, and (2) its performance is not satisfying. Using the Intel compiler, even absl::InsecureBitGen is slower than either boost::random::mt19937 or std::mt19937.
We choose not to use cuRAND since it is hard to configure.
We choose not to use the GSL random generator since its performance is unsatisfactory.
The current implementation also supports the PCG random generator experimentally. This random generator should be faster than STL random generators.
The current version supports seeding through a seed allocation algorithm described in our bioRxiv preprint.
Distribution Sampling¶
Distribution sampling is currently implemented using each library’s own method. For example, boost::random::uniform_int_distribution for Boost random generators, std::uniform_int_distribution for STL and PCG, and viRngUniform for Intel MKL. Further decoupling may be needed.
I/O¶
The generated read data, represented by the PairwiseAlignment class (src/lib/PairwiseAlignment.hh), will be passed to an output dispatcher, which dispatches the data to all output writers registered to it.
Output writers are asynchronous. The PairwiseAlignments are first formatted by each thread (since setting bam1_t* is time-consuming) and then passed to a multi-producer single-consumer lock-free queue, which is then consumed by the file appender in a separate thread.
We choose not to support UCSC 2-bit format since developing a 2-bit parser is time-consuming, especially when endianness (2-bit files allow both endianness), 64-bit offsets, masking, and other edge cases are considered. Additionally, whether the UCSC 2-bit format improves I/O performance is questionable. See this HTSJDK PR for details.
Other Performance Bottlenecks¶
A majority of the time was spent generating quality scores, which extensively calls the std::map::lower_bound() function. This was addressed by using Walker’s algorithm.
Miscellaneous¶
File System Support¶
We use boost::filesystem to handle file system operations, such as creating directories, since it provides a consistent interface across different platforms. This also allows us to avoid dealing with platform-specific file system APIs. The std::filesystem implementation across compilers is inconsistent. Some may require additional linker flags (-lstdc++fs for GCC prior to or equal to 9.1; -lc++fs for Clang prior to or equal to 9.0; -lc++experimental for Clang prior to or equal to 7.0). There are also various reports on how those implementations deal with the terminating / when invoking the std::filesystem::creare_directories(). So for simplicity, we use boost::filesystem instead.
See this note in cppreference, this StackOverflow question, and this AskUbuntu question.
Thread Pool Implementation¶
Intel TBB is an excellent library that supports diverse parallel programming models and data structures. However, this library loads into memory at run time, which can consume a lot of time in short-running applications. This also makes the project incapable of being distributed in a fully static form (See Design.md).