vg modules

Table of Contents

With all the problems I’ve been coming up against, I often felt that there is the right tool available, I just don’t know it (yet), and I’ve struggled to find it. vg help lists a lot of tools in four different categories: “main mapping and calling pipeline”, “useful graph tools”, “specialised graph tools”, and “developer commands”. In order for myself to have a better overview of everything that is available, I’m going to list and summarise their functions in this post.

This overview is currently based on vg v1.22.0 Rotella.

Main mapping and calling pipeline

The main mapping and calling pipeline is also introduced in the vg wiki, including examples and explanations.

construct

vg construct constructs a genome graph based on a reference sequence and known variants. It takes a FASTA file as input (-r), together with a VCF file (-v) and an optional file of insertion sequences (-I) (which have to be referenced in the VCF). It is possible to match contig names which differ between the two input files (-n), and to specify that paths for variants should be saved by variant ID (-a). Regions can be specified based on contig names or 1-based inclusive (-R), to exclude them from being parsed (-C) (I think). It’s possible to set the region size as a way to specify how many variants per region can be parallelised (-z), and the program can run on multiple threads (-t). If structural variants should be included in graph construction, the -S option has to be used. Alternate alleles are usually chopped up, which can be avoided by using -f, and 1bp reference bases are removed from indels, which can be avoided by using -i.

vg construct can also be used to construct a graph from a multiple sequence alignment (MSA). In that case, the MSA file is specified with -M, and its format with -F (default is fasta, but clustal is also possible). It is possible to not add paths for the sequences to the graph (-d), but options listed above apparently cannot be used when working with MSA files.

In both cases, it is possible to limit the node sequence length (default is 32 nucleotides) (-m) and to show the progress of the construction process (-p).

index

vg index creates indices for a genome graph. The tool comes with a ton of options and I’m not going to list them all here. There are also vg wiki pages available for index construction in general and indexing huge dataset specifically.

There are three general options: -b sets a temporary directory to save temporary files in, -t sets the number of threads to use, and -p shows the progress of the indexing.

The creation of an xg index has only two additional options: -x the name of the output file, and -L to include alternative paths in the xg index.
There are many options for gbwt (graph Burrows-Wheeler transform) and gcsa (Generalized Compressed Suffix Array, which is BWT-based). A gbwt index can be created from a graph based on a VCF file, and therefore has special options for the VCF data (i.e. VCF phasing, renaming contigs, setting a region in the VCF to operate on, and others). The gcsa index, on the other hand, has more general options: -g to set the name of the output file, -i to give a file with kmers to use and -f for a node mapping file, -k to set the kmer size, -X for the number of doubling steps (?). The temporary disk space can be limited (-Z) and the GCSA2 index can optionally be validated using the input kmers (-V).
There is also one gam indexing option (gam files are the standard output from mapping sequencing reads to a graph in vg): to specify that the input is an already sorted gam file (-l); and one option for vg in-place indexing (--index-sorted-vg).

Finally, vg index includes some options for rocksdb and snarl distance indexing (a snarl is a minimal subgraph in a biedged graph).
A rocksdb name can be given with -d, and this method of indexing expects a gam input. -m denotes a sorted input where the mappings in alignments should be stored by node, -a denotes a sorted input where the alignments themselves should be stored by node, -A just says that there are alignments which should be given out sorted, and -N says to cross-reference the nodes by alignment traversals. There is a database dump option (-D) and an option to improve performance by compacting the index (-C).
Snarls can be loaded from a file (-s) (probably created by vg snarls), -j sets the name of the output file, and -w is the cap beyond which the maximum distance is no longer accurate.

map

Mapping is the next part of the basic operations of vg. Options for vg map are sorted by graph/index (input), options affecting the algorithm or scoring, and input and output formats.

Graph input can be set as a base name (-d) that is used to look for both .xg and .gcsa indices, or as single file names (-x for xg, -g for gcsa, and -l for gbwt) (see also vg index).

Options for the algorithm include, but are not limited to, the number of threads to use (-t), a minimum MEM (maximal exact match) length (-k), ignoring MEMs with too many hits (-c) or MEMs longer than a set length (-Y). You can tell the tool to attempt to align up to INT best candidate chains of seeds (-u) or attempt to align at least the INT best candidate chains of seeds (-l). Of course, a minimum threshold for alignment identity is possible to set manually (-P), and there are a few options dealing with bands for long read alignment. For paired read mapping, a number of mate rescues to perform can be set with --mate-rescues, and -S denotes the penalty for an unpaired read pair.

The scoring can be influenced by specifically setting a match score (-q) and a mismatch penalty (-z), as well as gap open (-o) and gap extension (-y) penalties. There is a full-length alignment bonus (-L), which can be removed from the score before sorting and mapping quality calculation (--drop-full-l-bonus). There is an exponent for haplotype consistency likelihood (-a), and a log recombination penalty for GBWT scoring (--recombination-penalty). With -A it is possible to perform base quality adjusted alignments (if base quality inputs are available). Finally, many of these options can be set at the same time when using -m to select an alignment model - default is “short”, but there is also “long” for PacBio reads for example.

Input options in vg map are: -s for a single sequence string to align, with -V to name the sequence. -T denotes the read file where each line contains a read (sample and read group names for this can be set with -N and -R, respectively), and -b gives a htslib-compatible file like BAM (alignments then go to stdout). It is also possible to realign a GAM input (-G), or give fastq (-f) or fasta (-F) files as input. If the fastq or GAM file is interleaved paired-ended, the -i option has to be used additionally.

Standard output of vg map is an alignment stream, which can be converted to JSON (-j), e.g. for debugging. The output can be directly surjected into graph paths with --surject-to bam/sam/cram. Alignments can be buffered together before putting out the GAM file (--buffer-size), and for the GAM realignment (-G), the -X option says that the alignment is written with the “correct” field set to overlap with the input. For efficient testing, instead of a GAM file, a table with name, chromosome, position, mapping quality, and score can be printed (-v). It is also possible to produce up to INT alignments for each read (-M), or to keep secondary input alignments (-K) as well as primary ones. Mapping quality can be capped at a maximum (-Q), and unaligned reads can be excluded (--exclude-unaligned). Finally, -D prints debugging information to stderr.

augment

vg augment is used to embed GAM alignments into a graph to facilitate variant calling, and it has comparatively few options, most of them general:
The tool can merge the paths implied by alignments into the graph (-i), it can keep soft-clips from input alignments (-S), which are cut by default, and instead of actually augmenting, it can just use the alignments for labeling the graph (-B). Translations from the augmented graph back to the original can be saved in a file given with -Z, and augmented GAM reads are saved in a file given with -A. The graph to augment can be a subgraph of the one used to create the GAM, so alignments with missing nodes can be ignored (-s). -m sets the minimum coverage of a breakpoint to be added to the graph, and -c sets the expected coverage used for memory tuning. It is possible to ignore edits with low average base quality (-q), and alignments with low mapping quality (-Q), as well as setting a maximum fraction of bases that have to be in an edit for it to be included (-N). -p again shows the progress of the vg augment run, and -v makes it verbose with information and warnings about the VCF generation. The number of threads to use can be specified with -t.

There is also the possibility to add specific loci into the graph: -l merges all alleles in the given file into the graph, and -L merges only the alleles in called genotypes into the graph.

pack

vg pack is not part of the basic operations overview in the vg wiki, but it’s listed as part of the main pipeline in vg help (before vg call). This tool can be used to create compressed coverage packs as read support to be used in variant calling.

This tool uses a basis graph of any format (set with -x), and writes compressed coverage packs to an output file specified with -o. Input is usually a GAM file (-g), but there is also an option for other files (-i), and output can also be a table in stdout representing pack (-d) or edge coverage (-D), or focus only on specified nodes (with -n for IDs in-line, or -N for a list of IDs in a file). It is possible to record and write edits as well as only recording graph-matching coverage (-e), you can set a bin size with -b, scale the coverage by phred quality (-q), and ignore reads with mapping and base quality below a threshold (-Q). For tuning, it is possible to set the expected coverage with -c and the number of threads to use with -t.

call

Finally, vg call calls variants or genotypes known variants from data processed with the above commands using a samtools pileup style structure (but you could also use vg genotype, which uses a FreeBayes-type approach). There are a few support calling options and a number of general options to use.

Supports can be created with vg pack for a given input graph and added as file to vg call with (-k) - as far as I know, this is mandatory. It is then also possible to set a minimum allele and site support for a call (-m), and use an old ratio-based genotyping algorithm instead of the default probabilistic model (-B). You can also set a minimum bias for homozygous alt/ref alleles (-b), meaning they must have M/N times more support than the next best allele.

Generally, input can be the VCF file used to construct the graph that should now be used for genotyping (-v), together with a reference fasta (-f) if VCF contains symbolic deletions or inversions, or an insertions fasta (-i) if insertions are referenced in the VCF file. The sample name has to be given with -s, and if a snarls file has already been created with vg snarls it can be used as input with -r. It is possible to enter one or multiple paths that should be used as reference (-p) and to give offsets from the reference path(s) (-o). Additionally, you can override the length of the reference given in the contig field in the output VCF (-l), and you can again specify the number of threads to use (-t).

Useful graph tools

The useful graph tools include some I have already used, like vg view, prune, and mod, and some that I don’t know yet, like vg deconstruct.

deconstruct

vg deconstruct creates VCF records for snarls present in a graph relative to a chosen reference path. For that it needs the path(s) to deconstruct against (-p), or a prefix to select paths to use as reference (-P). It is also possible to give an alternative prefix that is used to lump together non-reference paths to one sample in the output VCF (-A). If snarls have already been computed, those can be given to the module with -r. It is possible to focus the algorithm only on traversals that correspond to paths in the graph (-e). As often before, a number of threads to use can be given with -t and there is a verbose mode (-v) for “some status messages”.

find

So far, I’ve mostly used vg find to find specific nodes or paths in my graph, but there are a lot more options available, grouped into “graph features”, “subgraphs by path range”, “alignments”, and “sequences”. General input is either a rocksdb database (-d) or an xg index (-x).

Graph features to find are nodes by direct ID input (-n) or with IDs listed in a file (-N), or edges on the end (-e) or start (-s) of a node with a given ID. It is also possible to expand the context of the generated subgraph by a given number of steps (-c), and to use the length in bases instead of steps (-L). vg find can also find the position of a node in a given path (-P), or get all nodes in a given range (-r) (can also be converted to length in bases with -L), or accumulate a graph touched by alignments in a GAM file (-G). Output can be specified to be just the path names in the index as well (-I).

Going by path range, vg find can find node(s) in a specified range inside a target path (-p) (I assume this means genomic positions), or read out targets from a BED file (-R). In both cases, -E will get any node in the partial order from the given positions, assuming a sorted DAG (Directed Acyclic Graph). The output can be redirected from stdout to one file per target with a given prefix (-W), and instead of subgraphs kmers from the subgraphs can be given out as well (-K).

Alignments can be used either on the basis of a rocksdb database (-d) or a sorted and indexed GAM file (-l). You can either get all alignments written out (-a), only those aligning to any of the nodes in a given range (-o), or alignments to a provided subgraph (-A).

To work with sequences, vg find needs the gcsa index of the sequence space of the graph (-g). The tool can then look for a sequence string given with -S, which is split into kmers of a given size (-z), with a specific distance between kmers (-j). -M describes the super-maximal exact matches (SMEMs) of the sequence in JSON, and -B finds non-super-maximal MEMs inside SMEMs of a given length. It’s possible to use a fast SMEM re-seeding algorithm for this with -f, and to set minimum (-Z) and maximum (-Y) lengths for the MEMs. Output can be a graph of edges and nodes matching a kmer (-k), or a table of kmers (-T), if kmers are already in the index.

Finally, the approximate count of a kmer given with -k in the db can be reported (-C), the distance between a pair of nodes given with -n can be reported either on the reference path given with -P or on the best path chosen heuristically (-D), and all paths with a specific prefix can also be returned (-Q).

ids

This is a small function with not a lot of documentation, but vg ids is mentioned for index construction and working with a whole genome in the vg wiki, where the tool is used to coordinate node IDs over multiple genome graphs.

It only has six options: -c, which minimises the space of integers used by the IDs; -i and -d, which increment or respectively decrement IDs by a given number; -j, which is used in the above mentioned articles to create a joint ID space for all supplied graphs; -m which creates an empty node mapping for vg prune (whatever that means); and -s, which sorts the node IDs, generating new ones in a generalised topological sort order.

mod

I have used vg mod before to prepare my graph for indexing, but this module can do a lot more than that.

In general, input is a graph in vg format, and output in the same format goes to stdout.
The simplest option to modify a graph is to use alignments to (re-)label the graph with -P. Similar to vg ids it is also possible to compact the node ID space (-c), and cycles inside the graph can be broken with -b. A general simplification of the graph can be reached by removing redundancy without changing the path space (-s).
The graph can be normalised to have non-redundant edges (-n), and this normalisation can be iterated until convergence is reached, or for a specified number of times (-U). Additional modification of edges includes the flipping of doubly-reversing edges to be on the forward strand.
Strongly connected components (SCC) of the graph can be reviewed with -T, and then DAGified (to create a Directed Acyclic Graph) in multiple ways: -d copies those SCCs a given number of times, while forwarding the edges from old to new copies, and -w does the same copying until the shortest path through the SCC is of a specified length. A DAGification step can also be stopped if the unrolling component has a specified sequence length (-L).
vg mod can unfold a graph with -f to represent inversions accessible up to a specific number from the forward component of the graph, and it can orient the nodes in the graph forward (-O). It is possible to remove everything (nodes and edges) that is not part of a path (-N), or to keep only those edges and nodes (-A). Additionally, only nodes and edges belonging to a specific path can be returned (-k), and nodes with no sequence can be removed while forwarding their edges (-R).
To create subgraphs, a node ID can be given for the node to function as root (-g), and -x sets the context (i.e. how many steps the subgraph should move out).

vg mod can also be used to prune a graph by removing complex nodes (-p) that are reached by paths of a specific length (-l) which cross more than a certain number of edges (-e). Subgraphs shorter than a specific length (-l) can also be removed (-S), and nodes can be chopped to be no more than a certain length (-X). On the other hand, nodes that are only connected to each other by one edge can be replaced (“unchopped”) with a single node (-u).
Labels can be deleted from the graph, leading to empty nodes (-K), and nodes can be unlinked if they have many edges (-M). It is also possible to remove nodes with specific IDs (-y), or to convert to a cactus graph representation (-a). A sample graph can be calculated from a given VCF file if the input has allele paths (-v), and using a locus file, an augmented graph can be subset to a sample graph (-G).

For debugging, vg mod can also join all head and tail nodes to marker nodes (-m), and for tasks that can be done in parallel, the number of threads can be set with -t, as always.

prune

vg prune is a little more than a specialised version of vg mod. It is used to prune complex regions of a graph for GCSA2 indexing, and it always removes embedded paths.

Pruning parameters are -k as the kmer length used, -e as the maximum number of edges, -s as the minimum size of subgraphs, and -M as the maximum degrees - all of these have defaults.

Important are the three pruning modes, which are mutually exclusive. The default is -P, the simple pruning, but there is also -r which restores edges on non-alternative paths, and -u which unfolds non-alternative paths and GBWT (Graph Burrows-Wheeler Transform) threads. An additional and potentially slow option is -v, which verifies that paths are still existing after any kind of pruning.

vg prune also comes with a few unfolding options: it is possible to unfold the threads from a GBWT index (-g), store node mapping for duplicates (-m), or append to the existing node mapping (-a).

Other options are the usuals: -p to show the progress, -t for the number of threads, and also -d as a dry-run option, which determines the validity of the chosen combination of options.

sim

We used vg sim right at the start of the course in the toy examples, but it has a few more options that we didn’t explore.

In general, the module needs an xg index to work on (-x), and it can also use a fastq file to match the error profile of the reads contained within (-F). This file can also contain interleaved read pairs (-I).
Simulation can be done from specified path(s) (-P) or an expression profile formatted as RSEM output (-T). If the expression profile contains haplotype transcripts, it is necessary to also input the transcript origin info table from vg rna -i (-H).

vg sim can generate a specified number of reads (-n) or a specified length (-l). A random seed to use can also be give (-s), as well as the base substitution (-e) and indel rates (-i). If you are using a fastq file for the error profile, -d can set the proportion of trained errors that are indels, and -S scales the trained error probabilities.
It is possible to avoid simulation from the reverse strand (-f), and to generate paired end reads with a given fragment length (-p), for which a standard deviation can also be set (-v). If required, the module will allow reads to be sampled from the graph with Ns in them (-N).

Instead of reads as output, vg sim can also directly generate a true alignment on stdout (-a), which can also directly be given in json (-J).

snarls

As already mentioned, a snarl is a “minimal subgraph in a biedged graph”, and vg snarls writes a list of protobuf snarls. I don’t know much about this topic yet, so let me just cite one more explanation from the paper: “a snarl is an ultrabubble if its separated component is acyclic and contains no tips”.

The module can also output variant paths as SnarlTraversals to stdout (-p), or write SnarlTraversals for ultrabubbles (-r). Traversals can be restricted to leaf ultrabubbles (-l) or to top level ultrabubbles only (-o). Traversals can be computed for any kind of snarl (-a), or only for snarls with a certain maximum number of nodes (-m).
Trivial snarls with only one edge have to be specifically included in the output with -T, and snarls can be returned sorted by node ID (for already topologically ordered graphs) (-s).
vg snarls can use a VCF file to find traversals (-v), and needs a reference fasta file for structural variants for this VCF (-f), and/or a fasta file with insertion sequences (-i).
It is possible to set a number of threads to use for this module as well (-t).

stats

vg stats does just what the name says and gives you different statistics about a graph. This includes the graph size (-z), and the number of nodes (-N) and edges (-E) in the graph. It can also output the length of sequences in the graph (-l) and describe subgraphs (-s). Head (-H) and tail (-T) nodes can be listed, as well as strongly connected components (-c).
vg stats also knows whether or not the graph is acyclic (-A), can analyse nodes with a given ID (-n) or in a given range (-r), and show the distance of each provided node to the head (-d) or tail (-t). The module also computes stats for reads aligned to the graph (-a), and can write a table with information on overlapping paths given a specific path name (-o) or for all (-O). Finally, statistics can be printed for snarls (-R), and the module can run in verbose mode (-v).

view

Last in this category of “useful graph tools” is vg view, which I’ve used quite a lot already to convert between graph formats. I’ll try to list the options as comprehensively as possible.

Input formats: GFA (-F), VG (-V), JSON (-J), graph translation transcription (-Z) (??), turtle format (-T), GAM format (-a), stream in Locus format (-q), stream in Locus format for dot output (-Q), BAM or other htslib-parseable alignment (b), FASTQ with GAM output (-f), VG Pileup format (-l), VG snarl format (-R), VG SnarlTraversal format (-E), VG MultipathAlignment format (GAMP) (-K).
The alignments from an input GAM file can be added to the graph (-A), and interleaved fastq files have to be marked with -i.

Output formats: GFA (-g), VG (-v), JSON (-j), GAM (-G), RDF/turtle format (-t), stream in Locus format (-z), DOT (-d), FASTQ with GAM input (-X), VG Pileup format (-L), VG MultipathAlignment format (GAMP) (-k).
It is also possible to do a streaming conversion of a VG format graph in line delimited JSON format (-c), and when using RDF output, the base URI can be set manually (-r). The DOT output can be simplified with -S to remove node labels and simplify alignments.

For visual outputs (e.g. DOT), ASCII labels can be used instead of emoji (-e), and nodes can be labeled with the labels used for ultrabubbles (-Y). When drawing alignments, nodes that are not in the graph can be skipped with -m, and for DOT output, nodes that are not in the reference path can be coloured (-C). Paths are shown in the DOT output only with -p, and will be walked (represented with labeled edges) only with -w. It is also possible to add labels to normal edges to represent paths with -n, and to add JSON formatted mappings to each path (shown with -p) using -M. Edge ports can be reverted in DOT (-I), and a random seed can be specified for the assignment of path symbols (-s).

In general it is also possible to hide warnings if nodes or edges are encountered multiple times (-D), and to use a specified number of threads (this time only with --threads!).

Specialised graph tools

I’ve used some of those specialised tools already, but there are so many of them… Let’s see what they do.

add

vg add adds variants to a graph from a VCF file (-v). Contigs in the VCF can be renamed/matched to contigs from the graph with -n, and contigs in the VCF not found in the graph can be ignored (-i). You can give a range in which to look for nearby variants to make a haplotype (-r), or add an extra flanking sequence to use outside of found variants (-f). Progress is shown with -p, and threads are selected with -t.

align

Instead of using vg map -s to align a single sequence to the graph, it is also possible to use vg align. -s sets the string to align, and -Q gives it a name. The output can be a GAM (default) or in json format (-j).
vg align lets you set a match score (-m) and a mismatch penalty (-M), or use a scoring matrix from a file (--score-matrix). There are also gap open (-g) and gap extension penalties (-e), as well as a full length alignment bonus (-T).
A banded global alignment algorithm can be used with -b, and the alignment traceback can be pinned to the optimal edge of the graph (-p) - here it is also possible to choose to pin the first rather than the last bases of graph and sequence (-L). Instead of using an input graph, it is possible to use an input sequence with -r to run an SSW (Striped Smith Waterman, I assume) alignment. Score matrices and other debugging info can be printed using -D.

annotate

I have experimented with the annotation function already, but used only the input and output specifier. There are also options for the annotation of alignments I haven’t used yet, though.

For graph annotation, -x set the xg index of the graph to annotate, and -b or -f sets the BED or GFF file with the annotation, respectively. Output is usually a GAM file, but can also be a GGFF subgraph annotation file (-g). To generate that, snarls also have to be given (-s).

To annotate an alignment, vg annotate needs a GAM file as input (-a) as well as the xg index of the reference graph (-x). You can give specific positions to annotate (-p) and a limit of how far the tool should be searching for paths around those (-l). The alignments can be annotated with overlapping regions named from a BED file (-b). Instead of the standard output, it is also possible to write a table with a header describing how much of each alignment is novel (-n).
-t can here again be used to set the number of threads for the tool to run on.

chunk

vg chunk has a longer explanatory text, but actually all it does is it splits a graph or alignment into smaller chunks. Graph chunks are saved in VG format, read chunks in GAM files, and haplotype annotations go to .annotate.txt files, all with automatic file names. Only for single-range chunks does the output go to stdout.

Input options are -x to input the xg index or graph to split, and -G to set the GBWT haplotype index for haplotype extraction. Instead of a graph, an alignment can be given with -a, or you can use both with an additional -g.
Other general options can be used to define the chunk size (-s) and overlap between chunks (-o), as well as to further control the output. All created chunks can be written to a BED file (-E), and automatically generated files (see above) can be given a specific base name (-b). The context of the chunks can also be controlled, either by a specified number of node steps (-c) or base pairs (-l). Haplotype threads can be traced in chunks with -T, and GAM alignment output can be restricted to only those fully contained within a chunk (-f). Finally, the output format can be specified to be either vg, pg, or hg (-O), and the number of threads is set with -t.

There are also several options for path chunking, id range chunking, gam chunking, and component chunking: a range (0-based inclusive) can be specified within a certain path (-p) or a list of paths (-P). Additionally, a BED file can be used to define regions to chunk (-e).
For id range chunking, a single node range can be given with -r, or a file of ranges with -R. It is also possible to determine how many id-range chunks should be generated (-n).
The only option for simple gam chunking is -m to set a maximum number of reads per chunk.
Component chunking can either be done for each connected component (-C), also in combination with paths or id ranges as targets, or for each path in the graph’s connected component (-M).

circularize

This module does what the name says: it makes specific paths or nodes in a graph circular. Just give it a path (-p) or a file with path names (-P), or alternatively a node id (-a) or a tail id (-z) (?). vg circularize can also list all paths currently in the graph with -d, which should be similar to vg paths -L, but the tool cannot work on xg indices and therefore takes longer.

combine

vg combine doesn’t have any options. You give it graphs to combine, and it combines them regardless of the input format (they have to be normal, seek-able files, though).

compare

vg compare compares the kmer sets of two graphs. Standard input are two graphs (I assume vg format, but it doesn’t say), or alternatively databases with options -d and -e. The number of threads to use can be set with -t.

concat

This module again doesn’t take any options. Input graphs are concatenated in order y adding edges from the tail nodes of the predecessor to the head nodes of the following graph. Be aware that node IDs are compacted.

convert

vg convert looks like a little sibling to vg view, but offers different output formats. Input is a graph of unspecified format, and output options are: VG (-v), HashGraph (-a), PackedGraph (-p), xg (-x) (how does that work?), or odgi (-o).

depth

vg depth creates additional information after mapping reads to a graph. It either works directly on the GAM file, or on vg pack output. -k identifies the file with supports created from vg pack for the given input graph, and -p sets the reference path(s) to use (defaults to all paths). It is possible to set a bin size (-b) and to count deletion edges within the bin (-d).
To calculate GAM coverage depths, input a GAM file with -g. There is a threshold for the maximum number of nodes to consider (-n), and the possibility to set a random seed manually (-s). Alignments with poor mapping quality can also be ignored (-Q).
In general, nodes with poor coverage can be ignored with -m, and threads to use can be set with -t.

dotplot

There is absolutely no explanation for this tool. It takes in xg index with -x, that’s it. vg dotplot writes a table to stdout with a column containing a “query” name and position, as well as the orientation, and also a “target” name and position, so the result is a kind of genome-genome alignment dot plot, only with all paths (?) in the graph queried against all.

explode

vg explode breaks a graph (in vg format) into connected components in their own files, which are saved in a given directory. It is possible to select a number of threads to use for this process with -t.

filter

vg filter is a tool to use on files in GAM format which can filter alignments for specified properties. These can be names - reads with certain prefixes either using a single one (-n) or a file of prefixes (-N) -, or annotations - to exclude reads with annotations on contigs matching a given regex (-X) or with specific feature annotations (-F).
It is also possible to set a minimum score to keep the second (-s) or primary (-r) alignment. Reads can be “re-scored” using default parameters (-O), and scores can be normalised based on length (of what, it doesn’t say) (-f). Instead of the score, a substitution count can be used with -u.
Reads can be filtered based on their alignment beginning or ending with an insert of a certain size (-o), or if they don’t begin with at least a specified number of matches at each end (-m). Split reads can also be removed (-S), but for that the xg index has to be loaded with -x, and regions can be appended to alignments (?) (-A).
vg filter can also filter alignments based on a mapping quality threshold (-q), and based on the presence of tandem repeats (-E). Additionally, there is an option to filter reads where a fraction of bases has a low base quality (-b).
The ends of ambiguously mapped reads can be clipped (-D) (this also needs the xg index with -x), and this “defraying” can be stopped after a certain number of nodes have been visited (-C). The set of reads can also be downsampled with -d.

The input can be interleaved, but this should be stated with -i or -I. -i filters out both ends if either fails a filter, while -I only filters them out if both fail. It is possible to use a complement of any of these filters with -U, and to set a number of threads with -t. Using a verbose mode, it is possible to get statistics on number of reads filtered by what (-v), and if only that is of interest, the output of a filtered GAM file can be omitted entirely (-V).

gamcompare

This looks like a tool with very limited use: vg gamcompare compares an alignment GAM file with a “true” GAM file, which I assume comes from simulated reads. Options are: -r to set the distance within which to consider reads as correct, -T to output a TSV file instead of a GAM file, -a to add an aligner name to the TSV output, and the standard -t for the number of threads to use.

gamsort

vg gamsort can sort a GAM file, or index a sorted GAM file. For indexing, use the -i option, to use a naive sorting algorithm use -d. You can also use RocksDB-style indexing for sorting (-r), or create a node-to-alignment index (-a). -p shows the progress and -t sets the number of threads.

gbwt

vg can also work with and manipulate files in Graph Burrows-Wheeler Transform (GBWT) format. General options are available to set the output format (-o) and show the progress of the process (-p).
vg gbwt can merge GBWT files with the -m option, with an optional fast merging algorithm (-f).

With one GBWT file as input, the tool can count the number of threads (-c) or extract threads in SDSL format (-e). vg gbwt can also print basic metadata of the file (-M), or specified metadata: contigs (-c), haplotypes (-H), and samples (-S). Contig or sample names can be listed with -L, thread names are listed with -T, and -R removes a specified sample from the index.

Using none or one GBWT file as input, vg gbwt can also build a GBWTGraph and serialise it (-g), for example using an xg index as input (-x). Local haplotypes can also be sampled from an input graph (-l). Further, you can build GBWT from a greedy path cover (-p), or find a certain number of paths in the path cover (-p) or haplotype sample (-l) options. For both of these options, you can also set the node context length (-k). In general, the GBWT generation buffer size can be adjusted with -b, and the interval in which path IDs are stored is set with -i.

genotype

To compute genotypes from a graph and an indexed collection of reads (I’m not sure about the format here, there seems to be a directory structure involved), use vg genotype. The output can be in VCF format (standard or -v) or JSON (-j), and in addition to the required input files, a GAM file can be specified to use with variant recall or in place of the read index (-G). There are also options to include a fasta file (-F) or a file with insertions (-I).
A given path can be used as reference path with -r, and a name can be specified for the contig (-c) and the sample (-s) in the output VCF file. Variant positions can be offset by a given amount of nucleotides (-o), and total sequence length can be overwritten with -l.
It is also possible to dump an augmented graph in a file (-a), and to avoid embedding gam edits into the graph (-E). There are four traversal finders to choose from (-T): reads, exhaustive, representative, and adaptive.
vg genotype can ignore mapping quality (-Q) and you can disable indel realignment with -A. Use a denominator for prior probability of heterozygousness with -d. The minimum amount of unique reads per strand for a called allele to accept the call can be set with -P.
Finally, progress is shown with -p and threads to use can be set with -t.

inject

vg inject probably injects alignments into a graph? It uses BAM/SAM/CRAM files as input, together with a graph of xg index (-x) and creates a GAM file. -t sets the number of threads to use.

join

vg join joins graphs and subgraphs into a single variant graph by joining their heads to a single root node. For this to work, they graphs should already be in the same ID space. This tool takes vg format graphs as input and no further options.

kmers

This tool generated kmers for both strands of the input graph(s) as a tab-separated file (kmer id pos).
General options to vg kmers are -k to print kmers of a specified size in the graph, -t to select the number of threads to use, and -p to show the progress.
There are also special options for GCSA output: -g returns a table that can be used as GCSA2 input, the graph of which can be written in binary format (-B). It is also possible to specify the ID for the GCSA2 head (-H) and/or tail (-T) sentinel node.

locify

There is no explanation for what “locify” means, but it seems to be useful to extract data after mapping. Here are the available options anyway:
It is possible, but apparently not mandatory, to input a file with loci over which to locify alignments (-l), and the alignments can be entered either as rocksdb alignment index (-a) or graph in xg format (-x). Instead of using full Paths, it is also possible to generate names for each allele with -n. Alignments on the reverse strand can be flipped to the forward (-f), and the output can be sorted (´-s´) or filtered to contain only the best alleles (-b). The output can be restricted to loci with only the best alleles kept (-o).

mpmap

There is a very good vg Wiki article on multipath alignments and the usage of vg mpmap. Nevertheless, here is an overview of the options the tool has to offer.

Input graphs and indices are specified as follows: -x for the required xg index, -g for a GCSA2/LCP index pair (also required), -H for a GBWT haplotype index, which is needed to population-based MAPQs, and -d for a snarl distance index for clustering (requires snarl input via -s as well). There are also --linear-index and --linear-path to specify a sublinear Li and Stephens index instead of an GBWT index and to specify the path that the linear index is against.

There are also other input options, like -f for the input FASTQ file(s), or -G for input of a GAM file; both file types can contain interleaved paired ends (-i). It is possible to set sample (-N) and read group (-R) names for the output GAMP (Graph Alignment MultiPath) file, and to use read pairs from the same DNA strand (-e). Instead of a GAMP file, a GAM (Graph Alignment / Map) file can be produced with -S, which then ignores the snarls input file that can be specified with -s. If base quality is not included in the read input, the alignment can be done without it (-A), and if long reads are used as input, alignment scoring can be adjusted with just a single option (-E).

There are also a lot of advanced options to tweak the algorithm or the scoring.
vg mpmap can use a target value search based clusterer (-v), or a minimum distance based clusterer (--min-dist-cluster). You can set a length for an exact match to a snarl from which on no alternate paths will be used for alignment (-X), or limit the number of alternate paths in between MEMs or snarls (-a). When reads are aligned to alternate paths in snarls, it is possible to suppress the production of extra anchors. There are multiple options to tweak the fragment length distribution calculation and mismapping detection, as well as settings for MEMs (maximum exact matches) and SMEMs (super-maximal exact matches). The standard number of mappings per read can be adjusted with -u and the number of mappings to report with -M. Some settings for scoring and penalties are also listed under the algorithm options.
Actual scoring options are -q to set a match score and -z for the mismatch penalty. It is possible to supply an integer substitution scoring matrix (--score-matrix), and to set gap open (-o) and gap extension (-y) penalties. The full length bonus can also be adjusted (-L), as well as removed from reported scores (-m).
Final computational parameters for vg mpmap are the number of threads to use -t and the buffer size (-Z).

msga

We used vg msga in the course already to align multiple smaller sequences, before switching to minimap2 for larger genomes.

Input to build a graph can be sequences in a FASTA file (-f), where it is possible to specify which sequence to include by name (-n) or which to use as base for the graph (-b). It is also possible to include a sequence directly from the command line (-s), or to include a graph (-g). The output graph is usually build starting with the biggest sequence in the FASTA, but vg msga can also use the order inside the file (-a). You can use a BED file to map sequence names to positions on a reference path (-R), and you can expand the context around the BED regions (-T).

To adjust the alignment parameters, it is possible to set the minimal MEM length (-k) or chance (-e). MEMs will be ignored if they have more than a specified number of hits (-c) or if they are longer than a threshold (-Y). vg msga supports re-seeding (-r), and can attempt to align up to a certain number of chains of seeds (-l) and to trace back up to a specified number of chains of bands (-u). Chains can be discarded if they are too short either by a specified threshold (-W) or by a fraction of the longest overlapping chain (-C). An alignment threshold can be set with -P, and a mapping quality threshold for each band with -F. vg msga can skip cluster subgraphs when they are too big (-H), and accepts a specified band/chunk width (-w) for long read alignment, as well as a band overlap (-O). A certain number of bands of insertion can be considered in the alignment chain model (-J) and a specified number of alignments is accepted fro each band in banded alignment (-B). vg msga can additionally consider a number of alternate alignments for the entire sequence, and you can set it to not patch banded alignment by locally aligning unaligned regions (--no-patch-aln).

There are also special options for the local alignment: match score (-q), mismatch penalty (-z) and gap open (-o) and extension (-y) penalties can all be specified, as well as the full-length bonus (-L). Additionally, vg msga can use the X-drop heuristic (--xdrop-alignment), which is much faster for long read alignment, and you can set the maximum gap length that is allowed in each contiguous alignment when using that heuristic (--max-gap-length).

Options for index generation are -K to set the kmer size for building the GCSA index, -E to set the maximum number of edges to reduce GCSA complexity, -Q to prune subgraphs, -m to set a node length limit, and -X for the number of doublings to use.
The output graph can be normalised with the -N option, and circularised with -Z. Also available is a debugging option for the construction (-D) and for the alignment (-A), as well as an alignment progress bar (-S), and -t to set the number of threads to use.

paths

vg paths is another tool I’ve used multiple times before, to list all paths in a graph, or to extract subgraphs based on path names. To use the tool, you can input either a vg format graph (-v), an xg index (-x), or a GBWT index (-g). The output can be a path-only graph covering the selected path(s) (-V), a graph without the selected path(s) (-d), or a graph retaining only the selected path(s) (-r).
It is also possible to print the paths as GAM alignments (-X), or as a list (-L); such a list can be supplemented with the lengths of the paths (-E). The paths can also be printed in FASTA format. Path selection works either by supplying a file with path names (-p), or by setting a prefix that all paths to extract should have (-Q). When working with a GBWT index, you can also select threads for a specified sample (-S). Finally, variant paths added by vg construct -a can also be selected.

recalibrate

vg recalibrate takes a GAM file as input and uses the “mapped_correctly” flags from vg gamcompare to train a model. The model will be saved to a given file name, and the same option (-m) can also be used to load an already existing model. The number of threads to use for this operation is again set with -t.

rna

It is also possible to use vg with RNA-Seq data (as briefly mentioned in the multipath mapping wiki), but there is no special Wiki describing a workflow for that kind of data yet. The vg rna tool creates a spliced graph in vg format, using transcripts in gtf/gff format (-n) and/or intron files in bed format (-m). It can be restricted to parse only a specified feature type from the gtf/gff file (-y) and to use a specified attribute tag as ID (-s). Transcripts can be projected onto haplotypes in a GBWT index file (-l) or onto embedded graph paths (-e).
It is standard to collapse identical transcripts across haplotypes, which can be switched off with (-c), and the standard sorting and compacting can be switched off with -o. Intergenic and intronic sequences can be deleted with -d, reference transcripts can be added as embedded paths with -r and non-reference transcripts with -a.
Output formats can also be specified: -u results in a GBWT file with reference transcripts, a FASTA and an info file (?), and -b creates a GBWT index file with transcripts as threads. Transcripts can also be added to a GBWT file as bidirectional threads (-g), or written as sequence to a FASTA file (-f). The info file is in TSV format and contains transcript origin info (-i). Progress of vg rna is shown with -p, and the number of threads to use is set with -t.

sift

vg sift sifts through a GAM file to select or remove reads with particular properties. It’s general options are -t for the threads to use, -p for paired-end reads, -R for a local re-mapping of soft-clipped, split, or discordant reads, and -o for an output prefix.
Paired-end reads can be selected by their mean insert size (-I) or by the standard deviation of that insert size (-W). Reads can be flagged when only one of the pair is unmapped (-O), when they map to distinct paths (-C), or when they don’t have the expected orientation (-D). Single-end reads can be flagged if their soft-clipped sections are too long (-c), or if they contains sections that reverse within the read (i.e. small inversions) (-r). There is also an option to flag reads with soft-clips that map independently of the anchored portion (-s), but this supposedly requires the -x and -g options, which both don’t exist.
Finally, there is an option to calculate and print the insert size mean and standard deviation every 1000 reads (-1).

simplify

vg simplify is mentioned in the Wiki about indexgin huge datasets, where it is used to remove alternate alleles of rare variants. The tool takes a vg graph as input and returns a simplified graph in the same format without changing the node IDs.
General options are the selection of the algorithm (either “small” or “rare”) (-a), the number of threads to use (-t), and an option to show the progress of the operation (-p). Input (-b) and output (-B) can also be a BED file.
Small snarl simplifier options are -m to remove leaf sites with few bases and -i to perform a specified number of iterations of simplification.
To simplify rare variants, you have to input a VCF file to determine variant frequency (-v). Then you can choose whether you want to remove variants with a total alternative frequency under a certain threshold (-f) or with a total alternate occurrence count under a certain threshold (-c).

sort

vg sort sorts a graph in vg or GFA (-g) format. Available algorithms (select with -a) are “eades”, “max-flow”, “id”, or “topo”; the first two algorithms need a specified reference (-r). For the eades method, grooming can be switched off with -w, and when using IDs to sort, an index file of the sorted graph can be produced with -I.

srpe

OK, here I have absolutely no idea. Usage is vg srpe [options] <data.gam> <graph.vg>, and options are -p for a reference path, -x for the xg index, and -g for the GCSA index.

surject

vg surject is one of the commands I can’t get to work on my data. The tool’s purpose is to transform alignments to be relative to particular paths, which means it basically converts a GAM file to a BAM or CRAM file. An xg index is needed as input in addition to the GAM file (-x), and the number of threads to use is set as usual (-t). The GAM file can also contain interleaved paired-end reads (-i).
It is possible to select one or multiple paths to use as reference as strings (-p) or in a file (-F), while the default is to use all paths. The multipath mapping surjection usually produces global alignments, but can also do local ones (-l).
Output formats are CRAM (-c), BAM (-b), or SAM (-s), and the level of compression can be set with -C. Long deletions can be interpreted as spliced alignments with -S, scoring can be adjusted for base qualities (-A), and reads with long fragment lengths can be marked as not properly paired in SAM/BAM (-f). vg surject can set sample (-N) and read group (-R) names for all reads.

trace

If you want to trace and extract haplotypes from an index, vg trace is the right tool. It takes an xg (-x) or GBWT haplotype (-G) index as input and can be restricted to start at a certain node (-n) with a specified search radius (-d). The haplotype frequency annotation output file name can be specified with -a and it’s possible to create a subgraph in json format instead of protobuf (-j).

translate

vg translate translates alignments or paths using “the translation map”. This map should probably be part of the input (usage is vg translate [options] translation), but there is no specification of a format. Options for this tool are -p to input a file with paths to project into “the from-graph”, -a to project alignments, or -l to project locus descriptions. It is possible to print the from-mapping corresponding to a given JSON mapping (-m), or the from-position corresponding to a JSON position (-P). Finally, a translation overlay file can be given with -o.

vectorize

I’ve used vg vectorize when I tried to find gene presence/absence information, but the result was too big to analyse or keep. In general, this tool is supposed to vectorise a set of alignments to a variety of vector formats. The input is an xg index (-x) and a GAM file with alignments. Additionally, a GCSA2 index can be supplied if MEM sketches are being generated (-g).

vg vectorize can rename every alignment with a specified label (-l) and can print a tab delimited output for use in e.g. R (-f). There is an annotation option to create a header with each node’s or edge’s name and a column with alignment names (-A).
It is possible to create a {0|1|2} vector for covered, reference, or alternative allele (-a), or to output a score vector based on percent identity and coverage (-i).
vg vectorize can also create a format that is friendly to vowpal wabbit, and writes a file with mappings used for vowpal wabbit classes with -M.
A MEM sketch (as mentioned above) can be generated of a given read with -m and -P adds positions to that sketch, while -H ignored MEMs with many hits when extracting positions.

viz

vg viz is a tool I first used when looking at HIV sequences. It creates nice SVG visuals of variant graphs and takes either a graph (-x) or a compressed coverage file (pack file) (-i) as input. When using pack files, you can also name the packs with -n.
The output will be saved to the file name given with -o, and image sizes can be specified with -X for width and -Y for height. CNVs are usually shown as text, but can also be visualised in paths on new rows (-C). Reference paths can be hidden with -P, and DNA sequences can be hidden with -D.

Developer commands

Developer commands are truly a mixed bag - from showing the currently used version of vg to specialised analysis tools.

benchmark

vg benchmark writes benchmarking results to stdout and has only one option: -p to show the progress (but it doesn’t seem to do anything).

bugs

You can use vg bugs to write a bug report for GitHub right in your console. In case vg crashes, part of the error message advises: “Run ‘vg bugs –new’ to report a bug.”

cluster

vg cluster is an analysis tool to find and cluster mapping seeds. It takes a GAM file as input, together with an xg index (-x), a GCSA2/LCP index pair (-g), and/or a minimiser index (-m). The tool also needs a distance index to be used for clustering (-d). It is possible to adjust the maximum number of locations for a minimiser to be considered (-c), and -t sets the number of threads to use.

crash

vg crash is a developer tool to throw and error and test error handling, with the only option being -t to set the number of threads to run.

gaffe

vg gaffe is a bigger tool, used to map unpaired reads using minimisers and gapless extension.

mcmc

This tool find haplotypes based on reads using Markov Chain Monte Carlo methods. It needs three input files: a graph in vg format, a file with snarls and a file with multipath alignments in mgam format, which I haven’t heard of before. Did they mean GAMP, like from vg mpmap? In any case, there are also two options available: -i to set the number of iterations to run, and -s to select a seed for the random number generator.

minimizer

vg minimizer takes a GBWT (-g), an index (-i) and a graph file and builds a minimiser index of the haplotypes in the GBWT index. Minimiser options are -k for the kmer length in the index, and -w for the window size. It is also possible to use other options to insert kmers into the provided index file (-l), to specify that the input graph is a GBWTGraph (-G), to show the progress of the process (-p), or to select the number of threads to use (-t).

test

vg test does what it says, it tests the currently installed version of vg. It runs with no required input, so further option information can only be obtained by using -? or -h.
It is possible to get a list of all/matching test cases (-l) or tags (-t). With -s, successful tests will be included in the output, and -b breaks into the debugger when a test fails. Exception tests can be skipped (-e), invisible syntax like tabs can be shown (-i), and names can be set for an output file (-o), a reporter to use (-r), and a suite (-n).
Testing can be aborted at first failure (-a) or after a set number of failures (-x). Warnings can be shown with -w, and test durations with -d. Tests to run can be selected via a file with test names (-f), and tags can be added to the file name (-#), or a specific section can be selected to run (-c).
Further listing options are --list-test-names-only and --list-reporters, which do what their names say. The test order can be specified with --order as “decl”, “lex”, or “rand”, a seed can be specified for random numbers (rng-seed), and output can be colourised (-use-colour).

validate

vg validate can validate a graph by either checking all aspects of it or specific things that can be selected via the options. -n checks if the expected number of nodes is present, -e verifies that the graph contains all nodes that are referred to by edges, -p checks if path segments are connected by edges, and -o verifies that all nodes have edges. It is also possible to input a GAM file and verify that the edits in the alignment fit on nodes in the graph (-a), this needs an index file as well (-x).

version

vg version prints the version information of the current install - either the full information (including compilation info) or only the version number (-s).

Last updated:
Tags: vg
comments powered by Disqus