# Design Topics

Here lists various topics related to the design of `art_modern`.

## Motivation

ART is an excellent software for simulating reads from a reference genome. However, it comes with the following limitations:

- The implementation is not parallelized.
- The implementation may consume too much memory or fail on genomes that contain a limited number of chromosomes of enormous (larger than `signed int32`) sizes.
- The implementation may consume too much memory or fail on transcriptome or template collection that contains an enormous number of contigs (cDNAs or other types of template molecules).
- The implementation only supports unified coverage, which is not suitable for transcriptome simulation where each cDNA molecule has its own expression level.
- The implementation outputs alignments in SAM format, which is not space-efficient. We also spot some SAM records generated by `art_illumina` whose query length inferred from CIGAR strings differs from the actual query length.

So, we developed `art_modern` with the following ideas:

- Parallelization is implemented using another layer of abstraction, simulation job, which can be taken as a unit of work in one thread, etc.
- Writer for SAM output format was re-implemented using [HTSLib](https://www.htslib.org/), which allows supporting BAM and headless SAM/BAM output format with minimal modifications of code.
- Multiple FASTA parsers were added. For example, the `htslib` parser allows on-disk random access of enormous genomes without reading them into memory, while the `stream` parser allows streaming of FASTA files. All coordinates are now int64-based.
- The low-level I/O routines are made asynchronized to improve performance.
- Support other random number generators like Intel OneAPI Math Kernel Library (OneMKL).

## Parallelism

The parallelization strategy of different modes and input parsers are as follows:

| Parser \ Mode | `wgs`     | `trans`   | `templ`   |
|---------------|-----------|-----------|-----------|
| `memory`      | Coverage  | Batch     | Batch     |
| `htslib`      | Coverage  | **ERROR** | **ERROR** |
| `stream`      | **ERROR** | Batch     | Batch     |

Here, "coverage" means that all contigs are passed to all threads with coverage divided by the number of threads, while "batch" means that the entire data were split into batches (partitioning the existing in-memory data structure for `memory` parser or `--i-batch_size` for `stream` parser) and passed to different threads. The current batch-based design may be suboptimal if transcript/template lengths or coverages are not evenly distributed. You're recommended to shuffle the input data to avoid such problems using, i.e., `seqkit shuffle`. For even larger data, a proposed MPI-based parallelization strategy is in [TODO.md](TODO.md).

For MPI-based parallelization, the strategy is as follows:

- For `wgs` parser, just divide sequencing depth.
- For non-`wgs` parser, skip records based on MPI rank and word size.

The coverage based parallelization strategy may generate slightly different number of reads compared to single-threaded execution due to rounding errors. This is adjusted in [1.2.0](#v-1.2.0-section), where the number of reads generated at positive and negative strands will be adjusted by 1 (SE) or 2 (PE/MP) to make the number of generated bases and the number of required bases as close as possible.

- For example, consider generating 125-nt reads 5.0 positive and 5.0 negative depth for a contig of length 225.
- In SE mode, 9 reads will be generated on positive and negative strands.
- In PE/MP mode, 10 and 8 reads will be generated on positive and negative strands respectively.

## Random Generators

### Bits Generation

The current random number generation function in each library is [MT19937](https://doi.org/10.1145/272991.272995), which may not be the best choice for performance-critical applications. However, it is the most widely used, well-known, of moderate performance and cycle, and is implemented in all random number generator libraries (namely, Boost, STL, and Intel OneAPI MKL).

We choose not to use [Abseil Random](https://abseil.io/docs/cpp/guides/random) since (1) It is hard to bundle the entire Abseil random library with the project and (2) Its performance is not satisfying. Using Intel compiler, even absl::InsecureBitGen is slower than either boost::random::mt19937 or std::mt19937.

We choose not to use [cuRAND](https://docs.nvidia.com/cuda/curand/index.html) since it is hard to configure and its performance is not satisfying, which may be due to the time spent on copying data from GPU as the majority of our computation happens on CPU.

We choose not to use [GSL](https://www.gnu.org/software/gsl/) random generator since its performance is not satisfying.

The current implementation also supports [PCG](https://www.pcg-random.org/) random generator experimentally. This random generator should be faster than STL random generators.

The current version does not support the specification of seed since it will result in reads with identical position and error profile in each thread. The current seed is set by the combined hash of nanoseconds since epoch, hash of the current thread ID, and the MPI rank to allow each thread to generate different data.

### Distribution Sampling

The distribution sampling is currently implemented using each library's own implementation. For example; `boost::random::uniform_int_distribution` for Boost random generators, `std::uniform_int_distribution` for STL and PCG, and `viRngUniform` for Intel MKL. Further decoupling may be needed.

## I/O

The generated read data, which is represented as `PairwiseAlignment` class (`src/lib/PairwiseAlignment.hh`), will be passed to an output dispatcher which dispatches the data to all output writers registered to it.

Output writers are asynchronous. The `PairwiseAlignment`s are firstly formatted by each thread (since setting of `bam1_t*` is time-consuming) and then passed to a multi-producer single-consumer lock-free queue, which is then consumed by the file appender in a separate thread.

We choose not to support [UCSC 2-bit](http://genome.ucsc.edu/FAQ/FAQformat.html#format7) format since Developing a 2-bit parser is time-consuming, especially when endianness (2-bit files allow both endianness), 64-bit offsets, masking, and other edge cases are considered. Additionally, whether the UCSC 2-bit format improves I/O performance is questionable. See [this HTSJDK PR](https://github.com/samtools/htsjdk/pull/1417) for details.

## Other Performance Bottlenecks

A majority of time was spent on generation of quality scores, which extensively calls `map::lower_bound()` function. This was addressed by using Walker's algorithm.

## Miscellaneous

(filesystem-section)=
### File System Support

We use `boost::filesystem` to handle file system operations like creating directories since it provides a consistent interface across different platforms. This also allows us to avoid dealing with platform-specific file system APIs. The `std::filesystem` implementation in different compilers is not consistent. Some may require additional linker flags (`-lstdc++fs` for GCC prior or equal to 9.1; `-lc++fs` for Clang prior or equal to 9.0; `-lc++experimental` for Clang prior or equal to 7.0). There are also various reports in how those implementations deal with the terminating `/` when invoking `std::filesystem::creare_directories()`. So for the sake of simplicity, we use `boost::filesystem` instead.

See [this note in `cppreference`](https://en.cppreference.com/w/cpp/filesystem), [this StackOverflow question](https://stackoverflow.com/questions/53365538/how-to-determine-whether-to-use-filesystem-or-experimental-filesystem) and [this AskUbuntu question](https://askubuntu.com/questions/1256440/how-to-get-libstdc-with-c17-filesystem-headers-on-ubuntu-18-bionic).

### Thread Pool Implementation

Intel TBB is an excellent library that supports diverse parallel programming models and data structures. However, this library loads into the memory in run-time, which would consume a lot of time in short-running applications. This also makes the project incapable of being distributed in a fully static form (See [Design.md](Design.md)).