Design Topics

Here lists various topics related to the design of art_modern.

Motivation

ART is an excellent software for simulating reads from a reference genome. However, it comes with the following limitations:

  • The implementation is not parallelized.

  • The implementation may consume too much memory or fail on genomes that contain a limited number of chromosomes of enormous (larger than signed int32) sizes.

  • The implementation may consume too much memory or fail on transcriptome or template collection that contains an enormous number of contigs (cDNAs or other types of template molecules).

  • The implementation only supports unified coverage, which is not suitable for transcriptome simulation where each cDNA molecule has its own expression level.

  • The implementation outputs alignments in SAM format, which is not space-efficient. We also spot some SAM records generated by art_illumina whose query length inferred from CIGAR strings differs from the actual query length.

So, we developed art_modern with the following ideas:

  • Parallelization is implemented using another layer of abstraction, simulation job, which can be taken as a unit of work in one thread, etc.

  • Writer for SAM output format was re-implemented using HTSLib, which allows supporting BAM and headless SAM/BAM output format with minimal modifications of code.

  • Multiple FASTA parsers were added. For example, the htslib parser allows on-disk random access of enormous genomes without reading them into memory, while the stream parser allows streaming of FASTA files. All coordinates are now int64-based.

  • The low-level I/O routines are made asynchronized to improve performance.

  • Support other random number generators like Intel OneAPI Math Kernel Library (OneMKL).

Parallelism

The parallelization strategy of different modes and input parsers are as follows:

Parser \ Mode

wgs

trans

templ

memory

Coverage

Batch

Batch

htslib

Coverage

ERROR

ERROR

stream

ERROR

Batch

Batch

Here, “coverage” means that all contigs are passed to all threads with coverage divided by the number of threads, while “batch” means that the entire data were split into batches (partitioning the existing in-memory data structure for memory parser or --i-batch_size for stream parser) and passed to different threads. The current batch-based design may be suboptimal if transcript/template lengths or coverages are not evenly distributed. You’re recommended to shuffle the input data to avoid such problems using, i.e., seqkit shuffle. For even larger data, a proposed MPI-based parallelization strategy is in TODO.md.

For MPI-based parallelization, the strategy is as follows:

  • For wgs parser, just divide sequencing depth.

  • For non-wgs parser, skip records based on MPI rank and word size.

The coverage based parallelization strategy may generate slightly different number of reads compared to single-threaded execution due to rounding errors. This is adjusted in 1.2.0, where the number of reads generated at positive and negative strands will be adjusted by 1 (SE) or 2 (PE/MP) to make the number of generated bases and the number of required bases as close as possible.

  • For example, consider generating 125-nt reads 5.0 positive and 5.0 negative depth for a contig of length 225.

  • In SE mode, 9 reads will be generated on positive and negative strands.

  • In PE/MP mode, 10 and 8 reads will be generated on positive and negative strands respectively.

Random Generators

Bits Generation

The current random number generation function in each library is MT19937, which may not be the best choice for performance-critical applications. However, it is the most widely used, well-known, of moderate performance and cycle, and is implemented in all random number generator libraries (namely, Boost, STL, and Intel OneAPI MKL).

We choose not to use Abseil Random since (1) It is hard to bundle the entire Abseil random library with the project and (2) Its performance is not satisfying. Using Intel compiler, even absl::InsecureBitGen is slower than either boost::random::mt19937 or std::mt19937.

We choose not to use cuRAND since it is hard to configure and its performance is not satisfying, which may be due to the time spent on copying data from GPU as the majority of our computation happens on CPU.

We choose not to use GSL random generator since its performance is not satisfying.

The current implementation also supports PCG random generator experimentally. This random generator should be faster than STL random generators.

The current version does not support the specification of seed since it will result in reads with identical position and error profile in each thread. The current seed is set by the combined hash of nanoseconds since epoch, hash of the current thread ID, and the MPI rank to allow each thread to generate different data.

Distribution Sampling

The distribution sampling is currently implemented using each library’s own implementation. For example; boost::random::uniform_int_distribution for Boost random generators, std::uniform_int_distribution for STL and PCG, and viRngUniform for Intel MKL. Further decoupling may be needed.

I/O

The generated read data, which is represented as PairwiseAlignment class (src/lib/PairwiseAlignment.hh), will be passed to an output dispatcher which dispatches the data to all output writers registered to it.

Output writers are asynchronous. The PairwiseAlignments are firstly formatted by each thread (since setting of bam1_t* is time-consuming) and then passed to a multi-producer single-consumer lock-free queue, which is then consumed by the file appender in a separate thread.

We choose not to support UCSC 2-bit format since Developing a 2-bit parser is time-consuming, especially when endianness (2-bit files allow both endianness), 64-bit offsets, masking, and other edge cases are considered. Additionally, whether the UCSC 2-bit format improves I/O performance is questionable. See this HTSJDK PR for details.

Other Performance Bottlenecks

A majority of time was spent on generation of quality scores, which extensively calls map::lower_bound() function. This was addressed by using Walker’s algorithm.

Miscellaneous

File System Support

We use boost::filesystem to handle file system operations like creating directories since it provides a consistent interface across different platforms. This also allows us to avoid dealing with platform-specific file system APIs. The std::filesystem implementation in different compilers is not consistent. Some may require additional linker flags (-lstdc++fs for GCC prior or equal to 9.1; -lc++fs for Clang prior or equal to 9.0; -lc++experimental for Clang prior or equal to 7.0). There are also various reports in how those implementations deal with the terminating / when invoking std::filesystem::creare_directories(). So for the sake of simplicity, we use boost::filesystem instead.

See this note in cppreference, this StackOverflow question and this AskUbuntu question.

Thread Pool Implementation

Intel TBB is an excellent library that supports diverse parallel programming models and data structures. However, this library loads into the memory in run-time, which would consume a lot of time in short-running applications. This also makes the project incapable of being distributed in a fully static form (See Design.md).