Q&A

Table of Contents

vg

Day 1 - protocol

  • How to choose k for GCSA indexing or graph construction?
  • What are the length values in the mapping results?
  • Why do some reads not have a mapping score in vg?

Day 2 - protocol

  • How to decide on a -x setting in minimap2?
  • What is the significance of the additional output of odgi viz?
  • Is there still a pileup option/tool somewhere, if vg augment doesn’t do it?
  • What is the problem with vg augment -i?
  • What’s so tricky about augmenting the graph with long reads? Why don’t they fit to the references better?

Day 3 - protocol

  • Is there another way to visualise a whole genome graph?
    • You can install IVG locally to be able to scroll through a graph. There is also MoMI-G, which I haven’t tested yet.
  • How can I annotate the graph? Is a specific formatting of GFF files required?
    • The files just have to strictly follow the specifications, at least for the “Name” part. All annotations that don’t have a “Name” attribute are added to a single path with no name.
  • How does vg viz work compared to vg view (-d)? Are the nodes sorted differently, and if so, why?
  • How does vg find -N work? Why are more nodes included than were on the list?
  • What exactly happens when multiple paths have the same or no name (i.e. annotation of genes with the same name, or no name)?
  • Did my pipeline to extract sub-graphs for certain genes (like oprD) really extract and include all nodes of the target gene/region?
    • Yes, but there is a problem with an off-by-one error when annotating a graph with a GFF file that leads to either the first (on the plus strand) or last (on the minus strand) nucleotide missing in the annotation paths.

Mapping with vg - protocol

  • What is going on with vg surject when a path position could not be identified?
  • Where to get the best workflows/pipelines to use? Are there any good, up-to-date tutorials?
  • How is the mapping influenced by the number of nodes and edges?
  • How can I find unmapped reads?
  • How can I get the variant calling to work?

Pandora

Answers are mostly from Zamin Iqbal and Rachel Colquhoun.

General

  • What are the numbers in the PRG file?
    • They are separators of different paths in the graph.
  • Why are single gene graphs so loopy?
    • This is due to the way Bandage visualises the graphs, maybe try a different tool.
  • Why don’t we get the same output when mapping single or multiple samples?
    • “There are slightly different output files when running pandora map on a single sample or pandora compare on several. The reason for this is that they are designed to be used in different scenarios. It doesn’t really make sense to run pandora map separately on many samples and then ‘merge’ the VCFs because each will be with respect to a different reference by default. However, we may want to know what gene sequences we see when we only have a single sample and that is why we still have pandora map as an option.”

pandora map

  • Where can I find gene presence/absence information?
    • There should be a matrix file. It can be inferred from the pandora.consensus.fq.gz file, since that contains all mosaic sequences for genes that were found in the sample.
  • Suggestion for fasta reference for VCF creation when using panX data?
    • That is not necessary to create a VCF file, you just need to use the --output_vcf or --genotype options (additional explanation here).
  • What does the de novo discovery do, exactly?
    • The de novo discovery tool can be used to augment/complement the original graph.
  • What are the graphs created after mapping?
  • What are the warnings “Input vcf_ref path was too short to be the ref” and “Could not find reference sequence in the PRG so using the consensus path” about?
    • These warnings can safely be ignored. They are being triggered by the default user behaviour which is not to provide a VCF reference. The first of these is already fixed in a pull request I’m waiting to merge in. We will update the code to stop the second being triggered also (or at least call it something other than warning).”

pandora compare

  • What are the sequences in pandora_multisample.vcf_ref.fa?
    • “The sequences in the pandora_multisample.vcf_ref.fa are the ‘reference sequence’ which the VCF is with respect to. Because the reference contains multiple alleles, we have to pick one of them to be the equivalent of the ‘wild type’. These reference sequences are chosen as paths through the graph, aiming to minimize the distance between each sample and this ‘reference’ (so that we get more SNPs in the VCF and fewer long alleles called)”
  • What is the GAPS value in the VCF files?
    • “When we calculate the coverage on an allele, we are actually calculating the coverage on kmers which cover the allele. Similarly, we can look at the fraction of these kmers which have no coverage. This is represented by the GAPS field. If an allele is the true allele, not only do we expect to see (relatively) consistent/high coverage over the allele, we also do not expect to see many kmers with no coverage overlapping that allele.”
  • What does it mean when a variant has almost equal forward and reverse coverage?
    • “For Illumina data, most variants should have almost equal forward and reverse coverage because we expect on average half of reads to have been generated in the forward direction along the genome, and half in the reverse. For Nanopore data, sequencing biases make it more likely to have a skew between the coverage each way.”
  • What is the reference in the VCF file?
    • The reference are the sequences in pandora_multisample.vcf_ref.fa.
  • Why does pandora compare leave out the last sample in the tab-separated sample list?
    • It seems the program expects an empty new line at the end of the file.