Usage of `art_modern`¶

This is the detailed usage of art_modern. Every parameter and its combinations are introduced in detail. In this documentation, commands will be represented as ls -lFh with in-line or block omission represented as [...].

Parameter Controlling Simulation Behavior¶

Simulation Modes (`--mode`)¶

The simulator currently supports 3 simulation modes:

wgs (DEFAULT) for whole-genome sequencing. For scenarios with constant coverage over a limited number of sized contigs.
trans for transcriptome simulation. For scenarios with many short contigs and/or contig-specific coverage.
template for template simulation. Often being used with a high-level simulator that produces read fragments.

Library Construction Methods (`--lc`)¶

The simulator currently supports 3 library construction methods:

`se` (DEFAULT) for single-end reads¶

For wgs and trans mode: A read may start from any position in the reference contig that is long enough to contain the read.
For template mode: A read will start from position 0 in the reference contig.

A diagram for single-end read extraction:

FRAGMENT:  |==============================|
SE READ:   |------->

`pe` for paired-end reads¶

For wgs and trans mode: A fragment (called insert in some other simulators), whose length is drawn from a Gaussian distribution specified by --pe_frag_dist_mean and --pe_frag_dist_std_dev, is created from any position in the reference contig that is long enough to contain the fragment.
For template mode: A fragment that spans the entire length of the reference contig is created.
Paired-end reads facing inward of the fragment are then created from both ends of the fragment.

A diagram for paired-end read extraction:

FRAGMENT:  |==============================|
PE READ 1: |------->
PE READ 2:                        |<------|

`mp` for mate-paired reads¶

Extraction of the fragment as-is in pe mode.
Mate-paired reads facing outward of the fragment are created from both ends of the fragment.

A diagram for mate-pair extraction:

FRAGMENT:  |==============================|
MP READ 1:                        |------->
MP READ 2: <-------|

Thread-Level Parallelism (`--parallel`)¶

Number of threads to use. Use -1 to disable parallelism. Use 0 (default) to use all available cores. Oversubscription is allowed but not recommended.

NOTE If MPI is enabled, each MPI process will use the specified number of threads. For example, if you run mpiexec -n 4 art_modern --parallel 2 ..., there will be 8 threads in total.

NOTE SAM/BAM output writer is parallelized using its own thread pool. See parameter --o-sam-num_threads and --o-hl_sam-num_threads for more information. For example, if you run art_modern --parallel 4 --o-sam-num_threads 2 --o-hl_sam-num_threads 3 ..., there will be 9 threads in total.

WARNING Do not oversubscribe the number of threads. It impacts the performance of the program.

Error-Profile Related Parameters¶

Which Quality Distribution File to Use¶

Quality distribution files specify the base quality distribution at each position of the read. They are used to simulating sequencing errors.

Users may supply self-trained quality distribution files using --qual_file_1 and --qual_file_2 for read 1 and read 2 respectively. The format of quality distribution files is compatible with the original ART. The first parameter is required while the second parameter is required only for pe and mp library construction mode.

Users may also use built-in quality distribution files using --builtin_qual_file (added in 1.1.0.). The built-in quality distribution files are bundled with the simulator. Note that the built-in quality distribution files comes with different maximum read-length and different support over paired-end reads. So please make sure that the read length specified by --read_len is compatible with the built-in quality distribution file.

Quality distributions bundled with the original ART can be found here. In details:

Here lists existing Illumina models supported by the original ART, which is still available in art_modern. The (8B)/(6B) means that the quality scores may be binned.

Name	Name (`art_modern`)	R1 `txt` file name	RLen1	RLen2	Model Name
`GA1`	`GA1Recalibrated_36bp`	`EmpR36R1`	36	36	GenomeAnalyzer I
`GA1`	`GA1Recalibrated_44bp`	`EmpR44R1`	44	44	GenomeAnalyzer I
`GA2`	`GA2Recalibrated_50bp`	`EmpR50R1`	50	50	GenomeAnalyzer II
`GA2`	`GA2Recalibrated_75bp`	`EmpR75R1`	75	75	GenomeAnalyzer II
`HS10`	`HiSeq1000_100bp`	`Emp100R1`	100	100	HiSeq 1000
`MSv1`	`MiSeq_250bp`	`EmpMiSeq250R1`	250	250	MiSeq
`HS20`	`HiSeq2000_100bp`	`HiSeq2kL100R1`	100	100	HiSeq 2000
`HS25`	`HiSeq2500_125bp`	`HiSeq2500L125R1`	126	126	HiSeq 2500
`HS25`	`HiSeq2500_150bp`	`HiSeq2500L150R1`	150	150	HiSeq 2500 (8B)
`HSXn`	`HiSeqX_PCR_Free_150bp`	`HiSeqXPCRfreeL150R1`	151	151	HiSeqX PCR free (8B)
`HSXt`	`HiSeqX_TruSeq_150bp`	`HiSeqXtruSeqL150R1`	151	151	HiSeqX TruSeq (8B)
`MinS`	`MiniSeq_TruSeq_50bp`	`MiniSeqTruSeqL50`	51	N/A	MiniSeq TruSeq (6B)
`MSv3`	`MiSeq_v3_250bp`	`MiSeqv3L250R1`	251	251	MiSeq v3
`NS50`	`NextSeq500_v2_75bp`	`NextSeq500v2L75R1`	76	76	NextSeq500 v2 (6B)

Following Illumina profiles are only available in art_modern but not in the original ART as a built-in model. However, the model error files are provided in the original ART.

Name (`art_modern`)	R1 `txt` file name	RLen1	RLen2	Model Name
`GA1_36bp`	`Emp36R1`	36	36	GenomeAnalyzer I
`GA1_44bp`	`Emp44R1`	44	44	GenomeAnalyzer I
`GA2_50bp`	`Emp50R1`	53	50	GenomeAnalyzer II
`GA2_75bp`	`Emp75R1`	75	75	GenomeAnalyzer II
`HiSeq2500Filtered_150bp`	`HiSeq2500L150R1filter`	150	150	HiSeq 2500 (8B)

The following profiles are built and distributed by art_modern maintainers. New models from Illumina, MGI Tech, Elements Biosciences, and PacBio are added. Sorted in the order of discovery of data.

The following models are added in 1.4.0 & 1.4.1:

Name (`art_modern`)	R1 `txt` file name	RLen1	RLen2	Model Name	Source
`PacbOnso_150bp`	`PacbOnsoL150R1`	150	150	PacBio Onso	Mendeley Data, DOI,
`MiSeq_v3_300bp`	`MiSeqv3L300R1`	300	300	Illumina MiSeq v3	SRR29636774, DOI
`GA2X_150bp`	`GA2XL150R1`	150	150	Illumina GenomeAnalyzer IIx	ERR038223, 1KG
`GA2X_100bp`	`GA2XL100R1`	100	100	Illumina GenomeAnalyzer IIx	SRR933371, DOI
`HiSeq1500_250bp`	`HiSeq1500L250R1`	250	250	Illumina HiSeq 1500	ERR3322785, DOI
`BGISeq500_150bp`	`BgiSeq500L150R1`	150	150	MGI Tech BGISeq-500	ERR14979854, DOI
`DNBSeqT7_150bp`	`DNBSeqT7L150R1`	150	150	MGI Tech DNBSeq-T7	SRR30016282, DOI
`DNBSeqG50_150bp`	`DNBSeqG50L150R1`	150	150	MGI Tech DNBSeq-G50	SRR30016282, DOI
`DNBSeqG400_150bp`	`DNBSeqG400L150R1`	150	150	MGI Tech DNBSeq-G400	GIAB
`DNBSeqG400_400bp`	`DNBSeqG400L400R1`	400	N/A	MGI Tech DNBSeq-G400	ERR2888331, DOI
`HiSeq4000_150bp`	`HiSeq4kL150R1`	150	150	Illumina HiSeq 4000 (8B)	SRR2664950, DOI
`NextSeq500_150bp`	`NextSeq500L150R1`	150	150	Illumina NextSeq 500 (8B)	SRR9733244, DOI

The following models are added in 1.5.0:

Name (`art_modern`)	R1 `txt` file name	RLen1	RLen2	Model Name	Source
`DNBSeqG99_300bp`	`DNBSeqG99L300R1`	300	300	MGI Tech DNBSeq-G99	CNR1261865, MGI Tech ,
`DNBSeqT10X4_100bp`	`DNBSeqT10X4L100R1`	100	100	MGI Tech DNBSeq-T10x4	CNR0117350, MGI Tech ,
`DNBSeqG400_200bp`	`DNBSeqG400L200R1`	200	200	MGI Tech DNBSeq-G400	CNR0117149, MGI Tech ,
`DNBSeqT1Plus_150bp`	`DNBSeqT1PlusL150R1`	150	150	MGI Tech DNBSeq-T1 Plus	CNR1513487, MGI Tech ,
`HiSeq3000_150bp`	`HiSeq3kL150R1`	150	150	Illumina HiSeq 3000 (8B)	ERR10705258, DOI
`NextSeq550_150bp`	`NextSeq550L150R1`	150	150	Illumina NextSeq 550 (6B)	SRR19308002, DOI

List of all models by distinguished NGS sequencer manufacturers:

Note:

For Illumina models:
- We do not distinguish Dx (forensics and diagnostics) models with non-Dx models.
- The following models use 3- or 4-bin quality score compression, so not included:
  - NextSeq 1000 (PE300)
  - NextSeq 2000 (PE300)
  - NovaSeq 5000 (PE150)
  - NovaSeq 6000 (PE250)
  - NovaSeq X (PE150)
  - NovaSeq X Plus (PE150)
  - iSeq 100 (PE150)
  - MiSeq i100 (PE500)
For MGI Tech models:
- We do not distinguish Rs (Research-only) models with non-RS models.
- DNBSEQ-G400ER is a nanopore-based long read sequencer, so not included.
- The following platforms are “haunted” – We cannot find authentic information of those models:
  - BGISEQ-1000
  - DNBSEQ-T1
  - DNBSEQ-T5
For GeneMind: All GeneMind models uses 4-bin quality score compression, so not used.

Shift and Clip Parameters¶

--q_shift_1 and --q_shift_2: Shift the base quality of the quality distribution by the specified value. Please note that the shifted quality distribution will be bounded to --min_qual and --max_qual.

--min_qual and --max_qual: Clip the quality distribution to the specified range.

Separate Quality Distributions for Different Bases (`--sep_flag`)¶

This flag can separate quality distributions for different bases in the same position.

unset (DEFAULT): Use the same quality distribution for different bases in the same position. For example, the quality distribution of pos 10 will be the same regardless whether the incoming base is A or T.
set: Use different quality distributions for different bases in the same position. For example, the quality distribution of pos 10 may NOT be the same regardless whether the incoming base is A or T.

Parameters Controlling Input¶

Input File Path (`--i-file`)¶

The input reference file path. It must exist in the filesystem. Device paths like /dev/stdin is supported for non-MPI builds.

Currently, we support input in FASTA and PBSIM3 Transcripts format. An additional parameter --i-type is used to specify the file type.

NOTE Some parser may support compressed FASTAs. See --i-parser for more detils.

Input File Type (`--i-type`)¶

The file type of input reference sequences. Currently, FASTA and PBSIM3 Transcripts format are supported.

auto (DEFAULT) for extension-based decision.
- If the file ends with .fna, .fsa, .fa, .fasta, resolve to fasta.
- Otherwise, an error will be raised.
fasta for FASTA files.
pbsim3_transcripts for PBSIM3 Transcripts format.

NOTE If you’re using UNIX devices like /dev/stdin as input, please make sure that this option is specified.

Input Parser (`--i-parser`)¶

`auto` (DEFAULT) for size-based determination.¶

If the file size larger than 1 GiB or cannot be told (which is quite common if the input was redirected from stdin or other devices), resolve to htslib (wgs mode) or stream (trans or template mode). Otherwise, use memory.

`memory` for in-memory reference parser¶

The entire file will be read into the memory. Fast for small reference files.

`htslib` for HTSLib FAIDX parser¶

Store FASTA Index in memory and fetch sequences from disk using HTSLib. Memory-efficient for large genome assembly FASTA files with limited number of long contigs, but inefficient for transcriptome/template FASTAs with large number of (relatively) short contigs.

The FASTA index, usually ends with .fai and can be built using samtools faidx, will be created automatically if not exist.

NOTE For wgs mode only.

NOTE Do not support PBSIM3 Transcripts format.

NOTE Do not support UNIX devices for input.

NOTE htslib parser can be used on a .fa.gz file. However, please note that:

The GZip file MUST be bgzip-compressed. Plain GZip will NOT work. If you try to use a GZip-compressed FASTA file, you’ll see error messages like:

[E::fai_build_core] File truncated at line 1
[E::fai_build3_core] Cannot index files compressed with gzip, please use bgzip
[E::fai_path] Failed to build index file for reference file [...]

Specify --i-type fasta to use this parser.

`stream` for one-pass streaming parser¶

Streamline the input reference as batches and process them one by one. Efficient for transcriptome/template FASTAs with large number of (relatively) short contigs.

Please note that here, the contig number and length are not accessible to the output dispatcher. So use headless SAM/BAM output if a SAM/BAM output is needed.

The batch size of each batch is controlled by --i-batch_size. Larger batch size may improve performance but requires more memory. Use larger batch size if the targeted sequencing depth is relatively shallow.

Coverage (`--i-fcov`)¶

This option allows you to specify coverage. art_modern supports the following coverage mode:

Unified coverage floating point in double data type (e.g., 10.0). Under this scenario, the coverage is identical for all contigs.
Per-contig coverage headless tab-separated value (TSV) without strand information.
Per-contig coverage headless TSV with strand information.

NOTE This parameter is ignored if the file type is set to pbsim3_transcripts. This format comes with contig-specific coverage information.

NOTE We prefer unified coverage floating points to TSVs. That is, if you set this parameter to 10.0, we’re NOT going to check whether there’s a file named 10.0. So please make sure that the file name is not a number if you want to specify a coverage file.

NOTE For coverage of the template mode:

If a unified coverage file is provided, the coverage will be intepreted as positive coverage. That is:
- If the library is SE, all reads will be generated from the positive strand at the 1st base.
- If the library is PE, all read 1 will be generated from the positive strand at the first base, with read 2 generated from the negative strand at the last base.
- If the library is MP, all read 1 will be generated from the positive strand to the last base, with read 2 generated from the negative strand to the first base.
If a 2-column (unstranded) coverage file is provided, the coverage will be intepreted as positive coverage.
- Introduced in 1.3.0.
If a 3-column (stranded) coverage file or input in format of pvsim3_transcripts is provided, the coverage will be intepreted as-is.

For example, if you specified --read_len_1 10 --read_len_2 150 with unified/unstranded coverage 5 and PE library construction mode, you’ll be likely to see:

|================================================|
|---->                              <------------|
|---->                              <------------|
|---->                              <------------|
|---->                              <------------|
|---->                              <------------|

With stranded coverage positive=3 and negative=2, you may see:

|================================================|
|---->                              <------------|
|---->                              <------------|
|---->                              <------------|
|------------>                              <----|
|------------>                              <----|

while for MP library construction mode you would see:

|================================================|
<------------|                              |---->
<------------|                              |---->
<------------|                              |---->
<----|                              |------------>
<----|                              |------------>

Compatibility Matrices of Input Parameters¶

Compatibility matrix of file type, simulation mode, and parser:

Parser \ Mode	`wgs`	`trans`	`template`
`memory`	`fasta`	`fasta` / `pbsim3_transcripts`	`fasta` / `pbsim3_transcripts`
`htslib`	`fasta`	ERROR	ERROR
`stream`	ERROR	`fasta` / `pbsim3_transcripts`	`fasta` / `pbsim3_transcripts`

Compatibility matrix of coverage mode, simulation mode, and file type:

Mode \ File Type	`fasta`	`pbsim3_transcripts`
`wgs`	Unified	ERROR
`trans`	Unified / Unstraded / Stranded	IGNORED
`template`	Unified / Unstraded / Stranded	IGNORED

Parameters Controlling Output¶

Here introduces diverse output formats supported by art_modern. You may specify none of them to perform simulation without any output for benchmarking purposes.

If the parent directory does not exist, it will be created using a mkdir -p-like manner.

For all output format, a queue size parameter (e.g., --o-pwa-queue_size) is provided to control the size of lockfree queue. Turn this up to improve performance when you have enough memory, and low if the application is consuming too much memory. Added in 1.3.2.

Pairwise Alignment Format (`--o-pwa`)¶

PWA is serialization of the internally used data structure of the simulator. Currently, PWA consists of some metadata lines (Started with #) and a list of alignments, each spanning 4 lines.

The alignment lines are as follows:

>, read name, a blank, and the leftmost position of the alignment.
The aligned read sequence, gapped with -.
The aligned reference sequence, gapped with -.
The gapless quality string.

Example:

#PWA
#ARGS: opt/build_debug/art_modern [...]
>NM_069135:art_modern:1:nompi:0	NM_069135:0:+
AGC-CAAACGGGCAACCAGACTCCGC[...]
AGCACAAA-GGGCAACCAGACTCCGC[...]
BCCCCGGGGGGGFGGGGGGGFGGGGG[...]
[...]

This output format supports am_compress-like compression arguments (Namely, --o-pwa-compression, --o-pwa-compression_level, --o-pwa-buffer_size and --o-pwa-num_threads). Added in 1.3.4.

FASTQ Format (`--o-fastq`)¶

The good old FASTQ format. The qualities are Phread encoded in ASCII with an offset of 33.

@NM_069135:art_modern:1:nompi:0
AGCCAAACGGGCAACCAGACTCCGCC[...]
+
BCCCCGGGGGGGFGGGGGGGFGGGGG[...]

FASTQ files can be easily converted to FASTA using seqtk:

cat in.fq | seqtk seq -A > out.fa

For PE/MP simulation: The generated file is interleaved FASTQ. Split to 2 FASTQ files using seqtk:

# Read 1
seqtk seq a.fq -1 > a_1.fq
# Read 2
seqtk seq a.fq -2 > a_2.fq

This output format supports am_compress-like compression arguments. Added in 1.3.4.

FASTA Format (`--o-fasta`)¶

Sequence-only output format. Example:

>NM_069135:art_modern:1:nompi:0
AGCCAAACGGGCAACCAGACTCCGCC[...]

This simulator generates one-line FASTA files. That is, each sequence occupies only one line.

This output format supports am_compress-like compression arguments. Added in 1.3.4.

SAM/BAM Format (`--o-sam`)¶

Sequence Alignment/Map (SAM) and Binary Alignment/Map (BAM) format supports storing of ground-truth alignment information and other miscellaneous parameters. They can be parsed using samtools, pysam and other libraries, and can be used as ground-truth when benchmarking sequence aligners.

This writes canonical SAM/BAM format using HTSLib. This output writer supports computing NM and MD tag. The mapping qualities for all reads are set to 255.

Example (Tabs replaced with spaces):

@HD	VN:1.4	SO:unsorted
@PG	ID:01	PN:art_modern	CL:opt/build_debug/art_modern [...]
@SQ	SN:NM_069135	LN:1718
[...]
NM_069135:[...] 0 NM_069135 1 255 3=[...]5= = 1 0 AGCC[...] BCCC[...] MD:[...] NM:[...]
[...]

Other SAM/BAM formatting parameters includes:

--o-sam-use_m: Computing CIGAR string using M (BAM_CMATCH) instead of =/X (BAM_CEQUAL/BAM_CDIFF). Rarely used in next-generation sequencing but common for long-read sequencing alignments.
--o-sam-write_bam: Write BAM instead of SAM.
--o-sam-num_threads: Number of threads used to compress BAM output.
--o-sam-compress_level: zlib compression level. Supports [u0-9] with u for uncompressed BAM stream, 0 for uncompressed stream with zlib wrapping, and 1–9 for fastest to best compression ratio. Defaults to 4. Please note that both u and 0 generates uncompressed output. However, 0 generates BAM stream with zlib wrapping while u generates raw BAM stream.
--o-sam-without_tag_MD: Do not write MD tag. Added in 1.3.3.
--o-sam-without_tag_NM: Do not write NM tag. Added in 1.3.3.
--o-sam-no_qual: Do not write quality string (Write quality string as *). Added in 1.3.3.

NOTE For PE/MP simulation, the 2 reads from one fragment may NOT appear together.

See also: SAMv1.pdf for more information on SAM/BAM format.

Headless SAM/BAM Format (`--o-hl_sam`)¶

This format is specially designed for the stream reference parser that allows handling of numerous contigs. As a side effect, the contig name and length information will not be written to SAM header. Each read will be written as unaligned with coordinate information populated in the OA tag.

Example (Tabs replaced with spaces):

@HD	VN:1.4	SO:unsorted
@PG	ID:01	PN:art_modern	CL:opt/build_debug/art_modern [...]
NM_069135:[...] 4 * 1 0 * * 1 0 AGCC[...] BCCC[...] OA:[...] MD:[...] NM:[...]
[...]

An additional option: --o-hl_sam-without_tag_OA, setting such which will not write OA tag. Added in 1.3.3.

This output writer supports other BAM formatting parameters.

Miscellaneous Parameters¶

Read Length Parameters¶

--read_len, --read_len_1, and --read_len_2 controls read length. Specifically:

If none of the 3 parameters are set, will use the longest read length supported by the quality distribution file(s).
If only --read_len is set, will use the specified read length for both read 1 and read 2.
--read_len cannot be used together with --read_len_1 or --read_len_2.
If --read_len_1 and/or --read_len_2 is set, will use the specified read lengths for read 1 and/or read 2 respectively.
For paired-end or mate-pair library construction mode, both read lengths must be set either using --read_len OR --read_len_1 and --read_len_2.

Indel Error Parameters¶

--ins_rate_1, --ins_rate_2, --del_rate_1, and --del_rate_2: Targeted insertion and deletion rate for read 1 and 2.
--max_indel: The maximum total number of insertions and deletions per read.

Ambiguous Base Parameters¶

--max_n: The maximum total number of ambiguous bases (N) per read.

Others¶

--id: Read ID Prefix. Used to distinguish between different runs.
--i-seed: Seed for random number generator to ensure reproducibility. Added in 1.3.2. NOTE To ensure reproducibility, please use the same number of threads, number of MPI progress under the same random generator when running art_modern with the same seed.

Logging Parameters¶

--reporting_interval-job_executor: The reporting interval (in seconds) for individual art_modern job. Added in 1.2.1.
--reporting_interval-job_pool: The reporting interval (in seconds) for ThreadPool. Added in 1.2.1.

Those logs are designed for observation of long-running jobs. Increase the interval to reduce log size.

Environment Variables¶

ART_NO_LOG_DIR: If set (to any value), disables creation of a log directory. Logging to files is skipped, and a warning is printed. Added in 1.1.4.
- NOTE If you’re using art_modern with MPI, this discards all logs generated from rank 1, 2, 3….
ART_LOG_DIR: If set, specifies the directory where log files will be written. If not set, defaults to log.d and a warning is printed. Ignored if ART_NO_LOG_DIR is set. Added in 1.1.4.
- NOTE If you’re running 2 art_modern processes simultaneously, please make sure that they are using different log directories.

Usage of art_modern¶

Parameter Controlling Simulation Behavior¶

Simulation Modes (--mode)¶

Library Construction Methods (--lc)¶

se (DEFAULT) for single-end reads¶

pe for paired-end reads¶

mp for mate-paired reads¶

See also¶

Thread-Level Parallelism (--parallel)¶