Day 3 - Bacterial Pangenomics
Table of Contents
The exercises of day one and day two were centered around relatively small data / genomes. Now, day three moves on to - bacteria!
The objectives were two-fold: we were provided with whole genome and single gene sequences and were supposed to create a graph of a single gene and build a graph-based pangenome. Then we should have figured out ways to learn if a gene is present or absent in a strain, and identify a set of genes present in all strains (the core genome) versus genes that are missing in some strains (accessory genes).
In the course we were working with E. coli data that Mikko had prepared for us. I am going to work with Pseudomonas aeruginosa instead, since the objectives (except maybe for the single gene graph) are my work goals as well, so I can use this to start working on my own data.
Getting started
I have a protocol from the course as inspiration, in addition to the actual course materials, where we extracted sequences of gyrA from all provided E. coli strains to create a graph, and then moved on to create a whole genome graph using 10 random E. coli strains. The latter we used to map the gyrA fasta sequence to see where it was located in the graph, and we also mapped provided short reads and all gene sequences of a not included strain to the “pangenome”.
Pseudomonas aeruginosa references
I’m going to do something similar here. I’ll start with a few reference sequences from pseudmonas.com; there are 208 strains with complete genomes included in the database right now, while NCBI lists 209 complete genomes. We usually use P. aeruginosa PA14 as reference, or PAO1 (which is the recommended standard), and in phylogenetic trees we use strains from the popular five on pseudomonas.com (PAO1, PA14, LESB58, PA7). In general, it seems that most sets of P. aeruginosa strains can be divided into “PA14-like” and “PAO1-like” strains and a small group of outliers related to PA7. The PAO1 group usually includes most other reference strains as well (e.g. here, here, or here).
“Pangenome” graph based on five reference genomes
To start, I will create a genome graph based on the top five P. aeruginosa reference strains, and depending on how long that takes I can then think about using all 208 strains instead. Since P. aeruginosa has a well conserved genome, it might not even be necessary to use so many whole genomes - augmenting the starting graph with missing sequences and variations could also be a good strategy.
mkdir day3
cd day3/
mkdir pics
mkdir references
cd references/
wget http://pseudomonas.com/downloads/pseudomonas/pgd_r_19_1/Pseudomonas_aeruginosa_PAO1_107/Pseudomonas_aeruginosa_PAO1_107.fna.gz
wget http://pseudomonas.com/downloads/pseudomonas/pgd_r_19_1/Pseudomonas_aeruginosa_UCBPP-PA14_109/Pseudomonas_aeruginosa_UCBPP-PA14_109.fna.gz
wget http://pseudomonas.com/downloads/pseudomonas/pgd_r_19_1/Pseudomonas_aeruginosa_LESB58_125/Pseudomonas_aeruginosa_LESB58_125.fna.gz
wget http://pseudomonas.com/downloads/pseudomonas/pgd_r_19_1/Pseudomonas_aeruginosa_PA7_119/Pseudomonas_aeruginosa_PA7_119.fna.gz
wget http://pseudomonas.com/downloads/pseudomonas/pgd_r_19_1/Pseudomonas_aeruginosa_PAK_6441/Pseudomonas_aeruginosa_PAK_6441.fna.gz
gunzip *.gz
Case in point: the group that created PPanGGOLiN, a different pangenome graph approach, analysed 3958 P. aeruginosa genomes from GenBank and found a quite large “soft core” genome (as shown in the publication).
Graph creation with minimap2 and seqwish
As I’ve already seen in day two that using vg msga
can be quite slow when creating a graph based on multiple genomes, I’m going to use the minimap2+seqwish approach again. Since the genomes are circular in Pseudomonas anyway, there shouldn’t be any confusion about that this time.
First, I have to combine the five fasta sequence files into one multi-fasta file:
for i in $(ls *fna)
do
cat $i >> FivePsae.fasta
done
Now I can run minimap2 to do the alignment and then use seqwish to create a graph from that:
minimap2 -X -c -x asm20 FivePsae.fasta FivePsae.fasta | gzip > FivePsae.paf.gz
seqwish -s FivePsae.fasta -p FivePsae.paf.gz -b FivePsae.work -g FivePsae.gfa | pv -l
I’m still using the -x asm20
for lack of a better idea. Very useful is also the option to pipe in gzip to create a compressed output and save some space.
[M::mm_idx_gen::1.495*1.00] collected minimizers
[M::mm_idx_gen::1.695*1.23] sorted minimizers
[M::main::1.695*1.23] loaded/built the index for 5 target sequence(s)
[M::mm_mapopt_update::1.796*1.21] mid_occ = 100
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 0; #seq: 5
[M::mm_idx_stat::1.860*1.21] distinct minimizers: 2341228 (48.14% are singletons); average occurrences: 2.504; average spacing: 5.506
[M::worker_pipeline::19.029*2.24] mapped 5 sequences
[M::main] Version: 2.17-r954-dirty
[M::main] CMD: minimap2 -X -c -x asm20 FivePsae.fasta FivePsae.fasta
[M::main] Real time: 19.072 sec; CPU: 42.664 sec; Peak RSS: 0.868 GB
Creating the alignment was again very fast (19 seconds), so I believe that adding at least a few more genomes wouldn’t be a problem. Since seqwish doesn’t include output with run time, I’ve included | pv -l
in my call to be able to accurately stop the time.
0 0:01:09 [ 0 /s] [<=> ]
This is a pretty handy tool, and shows that seqwish is also really fast (one minute) for the five reference genomes.
Graph visualisation
Of course I now want to take a look at my P. aeruginosa mini-pangenome! I’ll start with odgi, but I’m probably going to use vg viz
as well.
odgi build -g FivePsae.gfa -o - | odgi sort -i - -o FivePsae.og
odgi viz -i FivePsae.og -o ../pics/FivePsae.png -x 2000 -R
Again, odgi viz
gives out some additional information about the paths in the graph:
path 0 refseq|NC_011770|chromosome 0.65 0.152778 0.197222 249 58 75
path 1 refseq|NC_009656.1|chromosome 0.254032 0.254032 0.491935 97 97 188
path 2 refseq|NZ_CP020659.1|chromosome 0.474088 0.0690979 0.456814 181 26 175
path 3 refseq|NC_002516.2|chromosome 0.460952 0.48 0.0590476 176 184 23
path 4 refseq|NC_008463.1|chromosome 0.430809 0.0443864 0.524804 165 17 20
Pangenome graph based on five references
There are a lot of edges in this graph, and while odgi viz
creates a nice overview of this, I think vg viz
is better if you want to see some details (like which genome is which).
vg view -F FivePsae.gfa -v > FivePsae.vg
vg index -x FivePsae.xg FivePsae.vg
vg viz -x FivePsae.xg -o ../pics/FivePsae.svg
These steps take a little time, but now I have an SVG file that I can load into Inkscape to check out… and it’s empty. I can’t see any graph inside, just the border box that’s usually around the image. I wonder where the graph got lost - let’s find it!
File sizes of my different graph formats:
format | size |
---|---|
GFA | 97M |
OG | 92M |
VG | 28M |
XG | 136M |
Hmm, the vg graph is a lot smaller than the other two, but the index is big and at least for day two, the graph created with seqwish and then converted to vg looks similar (size wise). The conversion had worked there, too, as I was able to map reads to the converted graph.
What does the vg file look like, then? Let’s ask jq:
vg view FivePsae.vg -j | jq ".path"
Great, paths are there just fine. Maybe the whole graph is too big for vg viz
? I’ll try looking at just a few nodes instead.
Since jq listed a bunch of nodes to the console for me, I can just choose a number from there:
vg find -n 668831 -x FivePsae.xg -c 10 | vg view -dp - | dot -Tpdf -o ../pics/FivePsae_668831.pdf
That is also working perfectly: DOT version of node 668831 and surroundings (PDF)
So now I know that the graph and the index are fine, and it’s vg viz
that… can’t cope with the amount of data?
vg find -n 668831 -x FivePsae.xg -c 10 | vg view -v - > FivePsae_668831.vg
vg index -x FivePsae_668831.xg FivePsae_668831.vg
vg viz -x FivePsae_668831.xg -o ../pics/FivePsae_668831.svg
Apparently yes, since this worked fine as well:
vg viz version of node 668831 and surroundings
There is yet another visualisation tool that I have not yet tested, but we’re very interested in: IVG, which creates sequence tube maps. Sadly, this online tool can only take uploads of up to 5 MB, and apparently it can’t deal with graphs that only contain a few extracted nodes, either:
terminate called after throwing an instance of 'std::runtime_error' what(): Attempted to get handle for node 101 not present in graph ERROR: Signal 6 occurred. VG has crashed. Run 'vg bugs --new' to report a bug. Stack trace path: /tmp/vg_crash_aQ6V67/stacktrace.txt
Apparently my only way to look at the graph is using odgi viz
(or just looking at parts and not the whole thing). Since the additional output of odgi viz
lists the path identities, at least I believe I know which genome is which:
path | colour | RefSeq | name |
---|---|---|---|
0 | red | NC_011770 | LESB58 |
1 | blue | NC_009656.1 | PA7 |
2 | pink | NZ_CP020659.1 | PAK |
3 | green | NC_002516.2 | PAO1 |
4 | purple | NC_008463.1 | PA14 |
Scrolling through the graph, there are multiple regions visible which are specific for a single genome, mostly for LESB58, but also for PA7 and PAK. I’m a little surprised that LESB58 seems to be so different from the others, since it usually clusters with PAO1 and PAK, while PA7 is always shown as a - very different - outlier. Of course, the coloured paths are basically just the multiple sequence alignment and the edges show the real genome composition in the end - only they are very hard to interpret in this format.
Annotating the graph
Before I move on to mapping samples to the graph, I would like to be able to identify genes and other features of the reference genomes in the graph. Similar to my analysis of HIV I could then also look at “variation” in the genes, based on the number of nodes per gene. My problem here is that I have five annotated references that make up the graph, and many of the annotated genes will be the same (remember, big soft core genome). Since vg gives each gene its own path, I assume it would be a bit too much to augment the graph with all five complete annotations. Instead, I want to try and create one annotation for all five references.
First, I’ll download the annotations for all five in GFF3 format, since vg needs GFF (or BED, but that’s not listed on pseudomonas.com):
mkdir annotations
cd annotations
wget http://pseudomonas.com/downloads/pseudomonas/pgd_r_19_1/Pseudomonas_aeruginosa_PAO1_107/Pseudomonas_aeruginosa_PAO1_107.gff.gz
wget http://pseudomonas.com/downloads/pseudomonas/pgd_r_19_1/Pseudomonas_aeruginosa_UCBPP-PA14_109/Pseudomonas_aeruginosa_UCBPP-PA14_109.gff.gz
wget http://pseudomonas.com/downloads/pseudomonas/pgd_r_19_1/Pseudomonas_aeruginosa_LESB58_125/Pseudomonas_aeruginosa_LESB58_125.gff.gz
wget http://pseudomonas.com/downloads/pseudomonas/pgd_r_19_1/Pseudomonas_aeruginosa_PA7_119/Pseudomonas_aeruginosa_PA7_119.gff.gz
wget http://pseudomonas.com/downloads/pseudomonas/pgd_r_19_1/Pseudomonas_aeruginosa_PAK_6441/Pseudomonas_aeruginosa_PAK_6441.gff.gz
gunzip *.gz
Since I don’t want to do a homology search or anything, I’ll keep it “simple”: I’ll look for gene names and if they are the same in multiple files, I’ll keep only one of them. I will also sort through the gene/CDS duplicates and probably only keep the gene entries if start and stop are equivalent to a CDS entry.
There will be another problem, though: the GFF files I downloaded start every entry with “chromosome” as seqid instead of a valid RefSeq or other identifier for the genomes. I don’t think that will work with vg annotate
, but I’ll check first:
vg annotate -x FivePsae.xg -f annotations/Pseudomonas_aeruginosa_PAO1_107.gff > FivePsaePAO1.gam
warning: path "chromosome" not found in index, skipping
No, it’s not working. So here is what I have/want to accomplish using a Python script:
- exchange “chromosome” with an actual path/reference name
- maybe change the path names from “
refseq|NC_011770|chromosome
” to just “NC_011770” using jq first?
- maybe change the path names from “
- remove duplicate gene/CDS entries
- filter genes in different references so every gene is only annotated once (or at least most genes are)
Preparing the annotation
I decided to keep the path names as they are to save some time. The Python script I wrote (using Python 3) is called CleanMergeGff.py
and resides in the same directory as the annotation files. The script reads a file listing GFF files to clean and combine, ignores features that are not needed, removes duplicate gene/CDS entries and only keeps one gene annotation for all references (based on gene name).
python3 CleanMergeGff.py annot_list.tab
wc -l PseudomonasAnnotation.gff
25187 PseudomonasAnnotation.gff
Each single annotation file contains between 11,000 and 13,000 entries, while the combined annotation file contains only 25,000 hopefully unique entries. Now let’s add those to the variant graph.
cd ../
vg annotate -x FivePsae.xg -f annotations/PseudomonasAnnotation.gff > FivePsaeAnnot.gam
vg augment FivePsae.vg -i FivePsaeAnnot.gam > FivePsaeAnnot.vg
vg index -x FivePsaeAnnot.xg FivePsaeAnnot.vg
The annotation is really fast, while augmentation and indexing take some time. Again, we can’t possibly visualise the whole graph with all the paths, so I have to select a region I want to visualise. This can be done using vg paths
, which - I believe - can select paths by name from a variant graph:
vg paths -x FivePsaeAnnot.xg -X -Q "gyrA" > FivePsaeAnnotgyrA.gam
Again (similar to vg annotate
in day two, vg paths
does not create a vg output file even when given the -V
option - it turns out as gam anyway, so I’m using -X
(to create a gam file) just to be consistent. The output file does not contain any paths, though - interesting…
vg paths -x FivePsaeAnnot.xg -L
This command is supposed to list all paths in the file. The output is this:
refseq|NC_002516.2|chromosome
refseq|NC_008463.1|chromosome
refseq|NC_009656.1|chromosome
refseq|NC_011770|chromosome
refseq|NZ_CP020659.1|chromosome
Since vg annotate
did not complain about the annotation format or seqid (first column in the gff), I assumed that everything worked this time, but the annotation does not seem to be there.
What is inside FivePsaeAnnot.gam, then?
vg view -a FivePsaeAnnot.gam | jq '.path' > FivePsaeAnnotPaths.json
I tried to have a look at the gam file after conversion to json using tidyjson, but I get an error in return:
Error: parse error: trailing garbage
"393756" } } ] } { "mapping": [ { "e
(right here) ------^
Something seems to have gone wrong in this file somewhere, even though there was no previous error message.
Does something similar happen when I just try to annotate one of the references?
In order to figure that out, I have to adjust an annotation file to have the right seqid (fitting one of the paths in the variant graph), which I’m just going to do manually with the PAO1 annotation.
vg annotate -x FivePsae.xg -f annotations/Pseudomonas_aeruginosa_PAO1_107_corrected.gff > FivePsaePAO1Annot.gam
vg augment FivePsae.vg -i FivePsaePAO1Annot.gam > FivePsaePAO1Annot.vg
vg index -x FivePsaePAO1Annot.xg FivePsaePAO1Annot.vg
vg paths -x FivePsaePAO1Annot.xg -L
Pseudomonas aeruginosa PAO1 refseq|NC_002516.2|chromosome, complete genome.
refseq|NC_002516.2|chromosome
refseq|NC_008463.1|chromosome
refseq|NC_009656.1|chromosome
refseq|NC_011770|chromosome
refseq|NZ_CP020659.1|chromosome
OK, so now I have one single path from the annotation, that is new. I did exactly the same thing for the HIV data, and when I use vg paths -L
on these data, I get a list of two paths per gene, just as I saw in the visualisation. Why is this not happening here? It’s not like I updated vg or anything, so the software didn’t change.
For some reason vg annotate
took only the first line from my PAO1 gff file and annotated that into the graph, right? I think I’d like to have a look at that…
vg paths -x FivePsaePAO1Annot.xg -X -Q "Pseudomonas aeruginosa PAO1 refseq|NC_002516.2|chromosome, complete genome." > FivePsaePAO1Annotpath.gam
Since this file is now supposed to only contain parts of the graph that contain the path I chose, I should be able to get any random node ID from it (or the json version of it) and visualise that.
vg view -a FivePsaePAO1Annotpath.gam | jq '.path'
vg find -n 489302 -x FivePsaePAO1Annot.xg -c 10 | vg view -v - > FivePsaePAO1Annot_489302.vg
vg index -x FivePsaePAO1Annot_489302.xg FivePsaePAO1Annot_489302.vg
vg viz -x FivePsaePAO1Annot_489302.xg -o ../pics/FivePsaePAO1Annot_489302.svg
graph path 'Pseudomonas aeruginosa PAO1 refseq|NC_002516.2|chromosome, complete genome.' invalid: edge from 489292 start to 489312 end does not exist
graph path 'Pseudomonas aeruginosa PAO1 refseq|NC_002516.2|chromosome, complete genome.' invalid: edge from 489292 start to 489312 end does not exist
[vg view] warning: graph is invalid!
Despite a warning from the vg find
command, this code works:
vg viz version of node 489302 and surroundings
Right, so in general I know what I have to do. Now the question remains why it doesn’t work as expected, and I have an idea - there is one interesting difference between the gff file I downloaded for HIV and the one for PAO1 (or any of the other reference strains): every line in the bacterial annotation ends with a semicolon (semicolons separate the entries in the attributes string), while the HIV annotation does not have a semicolon at the end of a line. Maybe that is taken as a sign that the annotated entry is not finished yet?
I’m going to remove the trailing semicolons from the already corrected PAO1 annotation and try again:
vg annotate -x FivePsae.xg -f annotations/Pseudomonas_aeruginosa_PAO1_107_corrected.gff > FivePsaePAO1Annot.gam
vg augment FivePsae.vg -i FivePsaePAO1Annot.gam > FivePsaePAO1Annot.vg
vg paths -v FivePsaePAO1Annot.vg -L
refseq|NC_002516.2|chromosome
refseq|NC_008463.1|chromosome
refseq|NC_009656.1|chromosome
refseq|NC_011770|chromosome
refseq|NZ_CP020659.1|chromosome
Interesting! We’re down again to the five original paths, with nothing new added. This must somehow be related to the format of my files, but I struggle to see it. Both the HIV annotation and my file have LF line breaks according to Notepad++, both are tab separated files with the same number of columns.
In the attributes column there is one more difference I only noticed now after checking out the GFF specifications: their tags usually start with a capital letter, which is only partly true in the HIV annotation (but true for all features that were annotated in the graph) and only partly true in PAO1, where the official (?) “name” tag does not start with a capital “N” as expected, except for in the first entry… Could it be that stupid?
vg annotate -x FivePsae.xg -f annotations/Pseudomonas_aeruginosa_PAO1_107_corrected.gff > FivePsaePAO1Annot.gam
vg augment FivePsae.vg -i FivePsaePAO1Annot.gam > FivePsaePAO1Annot.vg
vg paths -v FivePsaePAO1Annot.vg -L
And suddenly I have a loooong list of paths in my annotated graph! I need emojis in this protocol…
I still don’t understand why the version without trailing semicolons didn’t even get the first feature annotated, but that’s a question for another day and for now I just hope it’s not a relevant one.
For now, I’ve adjusted my Python script to correct the formatting of my gff files:
python3 CleanMergeGff.py annot_list.tab
cd ../
vg annotate -x FivePsae.xg -f annotations/PseudomonasAnnotation.gff > FivePsaeAnnot.gam
vg augment FivePsae.vg -i FivePsaeAnnot.gam > FivePsaeAnnot.vg
vg paths -v FivePsaeAnnot.vg -L
Working with an annotated graph
It worked!
Time to look at some genes. The gene list I just got ended with “zwf”, “zwf1”, and “zwf2”, I’m curious enough about that to start with those.
vg index -x FivePsaeAnnot.xg FivePsaeAnnot.vg
vg paths -x FivePsaeAnnot.xg -X -Q "zwf" > FivePsaeAnnotZwf.gam
vg view -a FivePsaeAnnotZwf.gam | jq '.path' > FivePsaeAnnotZwf_path.jq
Extracting the paths for those genes (-Q
in vg paths
sets the “name prefix”, so I can find all three genes) is really quick, and there don’t seem to be many nodes involved.
Luckily, we extracted all nodes for certain paths on day 4 of the course and I have the jq command to do it (I don’t know how long it would have taken me to figure it out by myself).
vg view -a FivePsaeAnnotZwf.gam | jq '{name: .name, nodes: ([.path.mapping[].position.node_id | tostring] | join(","))}'
zwf1 has by far the most nodes, and I think I would have to load this into R, or at least sort it somewhere to get an overview, because the node IDs jump around a lot (e.g. “1109132,1019278,688979,1019279,688981,1019280,688983”). I’ll just have a look at a random location:
vg find -n 1019278 -x FivePsaeAnnot.xg -c 10 | vg view -v - > FivePsaeAnnot_1019278.vg
vg index -x FivePsaeAnnot_1019278.xg FivePsaeAnnot_1019278.vg
vg viz -x FivePsaeAnnot_1019278.xg -o ../pics/FivePsaeAnnot_1019278.svg
vg view -dp FivePsaeAnnot_1019278.vg | dot -Tpdf -o ../pics/FivePsaeAnnot_1019278.pdf
vg viz version of node 1019278 and surroundings
DOT version of node 1019278 and surroundings (PDF)
Cool, it worked! I already wish I could select which tag will be used as path name, though… Gene names are fine (although it would be cool to also have the information in which genome this gene is now officially annotated), but the protein/product names are a bit annoying.
Beside that, we can easily see (in the SVG/PNG version) that this is not the best way to look at specific genes inside the graph - the nodes are sorted by node ID, and the list of IDs already showed that the gene itself is not represented in a linear set of nodes.
My chosen example, zwf1 (the purple path at the bottom), is a gene from PA7 (NC_009656.1, the light blue middle of the reference paths), but does - even in this small example - not touch all nodes that the PA7 genome touches.
The DOT version looks quite different from the SVG version. Here, the node IDs are displayed, which helps in navigating the graph. Together with other genes (cdhA, phzC2, pys2, and QueF), the path for zwf1 only starts in the middle of the depicted region. This is what I would actually have expected, since the gene starts at node 1109132 and then touches node 1019278, which is the node I selected (and therefore the numeric middle of the displayed region). I wonder how vg viz
decides what to display and in which order…
So, being aware of the different visualisation methods and their quirks is important if you want to visually inspect your graph (or parts of it).
With regards to the actual content of these visualisations - I think I would really have to see the whole gene. Why? Because of all the other genes starting(?) at the same node - none of them are listed as orthologs for zwf1 on pseudomonas.com. I have to see the whole gene.
Luckily, vg find
could actually be able to do that, using the -N
(node list file) option instead of -n
(node ID). Since I don’t want to create another GAM file for only zwf1, I’m going to write all three paths to the file and manually remove the two I’m not interested in right now.
vg view -a FivePsaeAnnotZwf.gam | jq '{name: .name, nodes: ([.path.mapping[].position.node_id | tostring] | join(","))}' > zwf_nodes.txt
vg find -N zwf1_nodes.txt -x FivePsaeAnnot.xg -c 10 | vg view -v - > FivePsaeAnnot_zwf1.vg
vg index -x FivePsaeAnnot_zwf1.xg FivePsaeAnnot_zwf1.vg
vg viz -x FivePsaeAnnot_zwf1.xg -o ../pics/FivePsaeAnnot_zwf1.svg
vg view -dp FivePsaeAnnot_zwf1.vg | dot -Tpdf -o ../pics/FivePsaeAnnot_zwf1.pdf
graph path 'GDP-mannose 4,6-dehydratase (pseudogene)' invalid: edge from 1142390 start to 1120667 end does not exist
[vg view] warning: graph is invalid!
Again I got a warning when using vg find
- or, more correctly, vg view
, as I got the same error when converting to DOT format - to create a sub-graph. I assume that not all nodes for this specific path were included in my list. Creating the DOT format also takes a lot of time and cannot be recommended for this (sub-)graph size any more. Creation of an SVG is quick, but opening it in Inkscape is not.
A quick glance using Internet Explorer (I am forced to use this to remotely connect to our server in Braunschweig) shows that there are a lot of paths and a lot of edges between the nodes which presumably make rendering difficult for many programs.
Since it will be hard to analyse this visually/manually, I’ll try to look at this programmatically. My nodespergene.R script from day 2 is a good start for that, using the FivePsaeAnnot_zwf1.vg file that I also used for the visualisation of the sub-graph.
vg view FivePsaeAnnot_zwf1.vg -j > FivePsaeAnnot_zwf1.json
jq '.path' FivePsaeAnnot_zwf1.json > FivePsaeAnnot_zwf1_paths.json
graph path 'GDP-mannose 4,6-dehydratase (pseudogene)' invalid: edge from 1142390 start to 1120667 end does not exist
[vg view] warning: graph is invalid!
Is it bad that I’m getting used to ignoring these kinds of errors?
I wrote I quick R script to have a look at the paths in this sub-graph - or more specifically at zwf1, cdhA, phzC2 and pys2.
Venn diagram of zwf1, cdhA, phzC2 and pys2
Each gene has at least one node that distinguishes it from the others. Additionally, zwf1 has a lot more nodes than the other three genes.
Looking at all paths included in this sub-graph, zwf1 has by far the most nodes (3329, compared to 647 of the gene with the next most nodes), except for the reference genomes, which all contain more than 4000 nodes here. This is interesting, since I set vg find
to use the nodes touched by zwf1, but obviously more nodes are included…?
Caveats
As I’ve already seen when I extracted zwf1 from the whole graph, the nodes that make up this gene’s paths are not in a numerical order. Since the node IDs are so far apart (1109132,1019278,688979,…), I assume that the nodes are created per genome/path, so zwf1 touches nodes that came with different references. The problem I see here is with my approach to merge the annotations - for example, I have now one annotated version of oprD, but I know there are phylogenetic differences in the different strain versions. So how well would I be able to find “everything” that is oprD in such an approach?
To test this, and to generally get a “data-driven” feeling for this problem, I’m going to create a mini-annotation with all versions of oprD (manually) and annotate the graph with that.
Note: oprD is not a core gene, and PAK has a number of “OprD family” annotated genes, but none with the gene name oprD, so I’ll leave that strain out for now.
vg annotate -x FivePsae.xg -f annotations/oprD_annotation.gff > FivePsaeoprD.gam
vg augment FivePsae.vg -i FivePsaeoprD.gam > FivePsaeoprD.vg
vg index -x FivePsaeoprD.xg FivePsaeoprD.vg
vg paths -x FivePsaeoprD.xg -X -Q "oprD" > FivePsaeoprD_path.gam
vg view -a FivePsaeoprD_path.gam | jq '{name: .name, nodes: ([.path.mapping[].position.node_id | tostring] | join(","))}' > oprD_nodes.txt
Great, I have the gam file with only the oprD path in it. Wait, with only one path? Well, yes, since the name is always the same, the information is… what? Merged into a single path? Or is the path always being replaced? Let’s check!
I replaced the “Name” in the annotation with the “Name”” and the “Alias” (the locus tag) separated by an underscore (so that I’m able to extract all the paths with “oprD” in them), so now I should be able to generate four paths.
vg annotate -x FivePsae.xg -f annotations/oprD_annotation2.gff > FivePsaeoprD2.gam
vg augment FivePsae.vg -i FivePsaeoprD2.gam > FivePsaeoprD2.vg
vg index -x FivePsaeoprD2.xg FivePsaeoprD2.vg
vg paths -x FivePsaeoprD2.xg -X -Q "oprD" > FivePsaeoprD2_path.gam
vg view -a FivePsaeoprD2_path.gam | jq '{name: .name, nodes: ([.path.mapping[].position.node_id | tostring] | join(","))}' > oprD2_nodes.txt
OK, now I can extract the paths and nodes for another visualisation.
vg view FivePsaeoprD.vg -j > FivePsaeoprD.json
jq '.path' FivePsaeoprD.json > FivePsaeoprD_paths.json
graph path 'oprD' invalid: edge from 1104943 end to 1104948 start does not exist
graph path 'oprD' invalid: edge from 1104943 end to 1104947 start does not exist
graph path 'oprD' invalid: edge from 1104941 end to 1104948 start does not exist
[vg view] warning: graph is invalid!
Ah, here we go again with the warnings. Will they come up with the second version as well? That would be a hint, indicating that maybe the path does get overwritten when the same name appears multiple times (when they don’t appear with properly separated paths).
vg view FivePsaeoprD2.vg -j > FivePsaeoprD2.json
jq '.path' FivePsaeoprD2.json > FivePsaeoprD2_paths.json
No error this time. What happens in the other version, then? The node IDs for all versions of oprD are saved somewhere, but not in the path information I extracted, so nodes are missing?
vg find -N oprD_nodes_clean.txt -x FivePsaeoprD.xg -c 10 | vg view -v - > FivePsaeoprD_nodes.vg
This results in a lot of errors. A few excerpts of all the different kinds are listed below:
[vg] warning: node ID 495199 appears multiple times. Skipping.
[vg] warning: node ID 495200 appears multiple times. Skipping.
[vg] warning: node ID 931098 appears multiple times. Skipping.
[vg] warning: edge 495199 end <-> 495200 start appears multiple times. Skipping.
[vg] warning: edge 931097 end <-> 495200 start appears multiple times. Skipping.
[vg] warning: edge 495200 end <-> 495201 start appears multiple times. Skipping.
[vg] warning: path oprD rank 39 appears multiple times. Skipping.
[vg] warning: path oprD rank 40 appears multiple times. Skipping.
[vg] warning: path oprD rank 42 appears multiple times. Skipping.
[vg] warning: path refseq|NC_002516.2|chromosome rank 13 appears multiple times. Skipping.
[vg] warning: path refseq|NC_002516.2|chromosome rank 14 appears multiple times. Skipping.
[vg] warning: path refseq|NC_002516.2|chromosome rank 15 appears multiple times. Skipping.
graph path 'oprD' invalid: edge from 1104943 end to 1104948 start does not exist
graph path 'oprD' invalid: edge from 1104943 end to 1104947 start does not exist
graph path 'oprD' invalid: edge from 1104941 end to 1104948 start does not exist
[vg view] warning: graph is invalid!
So the information is not overwritten in the xg index, but maybe in the json file? I’m only guessing here, though.
vg index -x FivePsaeoprD_nodes.xg FivePsaeoprD_nodes.vg
vg viz -x FivePsaeoprD_nodes.xg -o ../pics/FivePsaeoprD_nodes.svg
vg view -dp FivePsaeoprD_nodes.vg | dot -Tpdf -o ../pics/FivePsaeoprD_nodes.pdf
graph path 'oprD' invalid: edge from 1104943 end to 1104948 start does not exist
graph path 'oprD' invalid: edge from 1104943 end to 1104947 start does not exist
graph path 'oprD' invalid: edge from 1104941 end to 1104948 start does not exist
[vg view] warning: graph is invalid!
Warning: Could not load "/usr/bin/miniconda3/lib/graphviz/libgvplugin_pango.so.6" - It was found, so perhaps one of its dependents was not. Try ldd.
Warning: Could not load "/usr/bin/miniconda3/lib/graphviz/libgvplugin_pango.so.6" - It was found, so perhaps one of its dependents was not. Try ldd.
Format: "pdf" not recognized. Use one of: canon cmap cmapx cmapx_np dot dot_json eps fig gv imap imap_np ismap json json0 mp pdf pic plain plain-ext png pov ps ps2 svg svgz tk vdx vml vmlz xdot xdot1.2 xdot1.4 xdot_json
Again the invalid paths come up here, and apparently we now have problems with GraphViz so I can’t use the dot format. I’ll try to resolve this after creating the same sub-graph for all four paths of oprD.
vg find -N oprD2_nodes_unique.txt -x FivePsaeoprD2.xg -c 10 | vg view -v - > FivePsaeoprD2_nodes.vg
vg index -x FivePsaeoprD2_nodes.xg FivePsaeoprD2_nodes.vg
vg viz -x FivePsaeoprD2_nodes.xg -o ../pics/FivePsaeoprD2_nodes.svg
This ran without any error messages, so I now have the svg files for both versions of oprD (one path and four).
Start of the vg viz
representation of a sub-graph with a merged oprD path
Start of the vg viz
representation of a sub-graph with four oprD paths
This graph representation nicely shows that nodes are covered multiple times by the merged oprD path. It’s a bit difficult to directly compare it to the graph with the four oprD paths, since the order of the nodes is different, but I think it’s safe to presume that the number of repeats for a node correlates to the number of actual paths going through it in the second graph.
I also used the Sequence Tube Map online tool to visualise my sub-graphs in a different format. Here, the nodes are in the same order, but the number of repetition per node is not shown.
Start of the IVG representation of a sub-graph with a merged oprD path
Start of the IVG representation of a sub-graph with four oprD paths
When scrolling through the graph with only one oprD path, it’s interesting to see that the path is not following a single reference path, but instead seems to jump from one genome to the next. I assume that (at least in this tool?) part of the information does get overwritten, so if at one position the path should hit multiple nodes, only one is selected (but how?). In these cases, the vg viz
representation is much clearer - if not all paths hit a node, the number of hits/repetitions for oprD goes down as well. Since the genomic position information is missing here, there is no conflict with two nodes being touched simultaneously.
Differences in visual representation aside, my main interest here was to see which nodes are touched by oprD and if a merging of reference annotations makes sense or not.
It doesn’t. If I want to know which “type” of oprD I have in my sample, I want to be able to evaluate all known paths the gene can take, so I need to know all nodes that were touched. This means two things:
- I cannot merge the reference annotations, so there will be a lot more paths than I hoped.
- I have to adjust the annotations again, to avoid duplicate gene names.
What’s the problem with duplicate gene names? They would basically lead to what I wanted in the beginning: one path instead of multiple. The merged path of oprD contains all nodes from the four separate paths, so I wouldn’t loose any information. On the other hand, I’d gain a lot of error messages for duplicates (which I could clean up, of course), and visualisation would be even more tricky. Mostly, though, I like to easily be able to see from which reference the path I’m hitting with an isolate originates (e.g. “is this the PA14 oprD or the PAO1 oprD?").
Annotating everything
Well then, time to annotate everything and see how that works! I’ll write a CleanCombineGff.py
script based on my previous annotation merging script for that.
Side note 1: A realisation I had while preparing the script for the new annotation is that there are actually a lot of times when the “Name” attribute in my GFF files is empty, which could in theory have led to empty path names and multiple paths being joined into one.
Side note 2: There were multiple problems with my annotation files, usually due to semicolons in the attributes where no semicolons should go, so I had to manually edit files in order for the script to run.
cd annotations/
python CleanCombineGff.py annot_list.tab
cd ..
vg annotate -x FivePsae.xg -f annotations/PseudomonasAnnotationAll.gff > FivePsaeAnnotAll.gam
Error parsing gtf/gff line 28618: refseq|NC_008463.1|chromosome P
Error parsing gtf/gff line 28717: refseq|NC_008463.1|chromosome P
Error parsing gtf/gff line 29480: refseq|NC_008463.1|chromosome P
There were a lot more of these error lines, so apparently I made a mistake in the annotation creation. I don’t know how this happened, but apparently I have a lot of lines that end at the “P” shown in the error message, instead of containing a whole annotation line. Huh, that is a strange outcome.
I probably should have used python3
instead of python
when calling the script, as I got a new error from the script now (because I used “iteritems” instead of “items” at one point). This alone did not resolve the problem, though.
Which is no surprise, really, as the problem was in the last part of my code. I either have a string or a list item to write to my output file, and when I treat both as a list item (and index with “[0]", as I did), I will only get the first letter from the string. I changed my list comprehension to get a string from there as well to avoid this problem. This is one of a few parts of the script I’m not happy with, as it’s so specialised to these GFF files, but I can’t think of a generalisation right now.
cd annotations/
python3 CleanCombineGff.py annot_list.tab
cd ..
vg annotate -x FivePsae.xg -f annotations/PseudomonasAnnotationAll.gff > FivePsaeAnnotAll.gam
vg augment FivePsae.vg -i FivePsaeAnnotAll.gam > FivePsaeAnnotAll.vg
vg index -x FivePsaeAnnotAll.xg FivePsaeAnnotAll.vg
vg paths -v FivePsaeAnnot.vg -L | wc -l
vg paths -v FivePsaeAnnotAll.vg -L | wc -l
The file annotated with the reduced annotation contains 3497 paths, the new one with all features annotated contains 30578 paths. While this is a lot, the file sizes of the vg graphs are not so different, with 144M and 153M, respectively. At least this annotation should provide me with all information I need to analyse the features of my selected reference genomes and find them in our clinical isolates.
Can I also now use it to find “the real” oprD of the PAK genome in my graph?
vg paths -x FivePsaeAnnotAll.xg -X -Q "oprD" > FivePsaeAnnotAlloprD.gam
vg view -a FivePsaeAnnotAlloprD.gam | jq -r '([.path.mapping[].position.node_id | unique | tostring] | join("\n"))' > oprDAll_nodes.txt
vg find -N oprDAll_nodes.txt -x FivePsaeAnnotAll.xg -c 10 | vg view -v - > FivePsaeAnnotAlloprD_nodes.vg
vg index -x FivePsaeAnnotAlloprD_nodes.xg FivePsaeAnnotAlloprD_nodes.vg
vg viz -x FivePsaeAnnotAlloprD_nodes.xg -o ../pics/FivePsaeAnnotAlloprD_nodes.svg
I extracted all paths that have “oprD” in their name to a gam file and then extracted the nodes for each path. The file oprDAll_nodes.txt contains the four paths I expected for the four annotated oprDs (I checked that before generating the current version which only contains one node ID per line). Again, I end up with a lot of warnings due to duplicated nodes which I cannot easily avoid using jq
so far:
[vg] warning: node ID 495192 appears multiple times. Skipping.
[vg] warning: node ID 931095 appears multiple times. Skipping.
[vg] warning: node ID 495194 appears multiple times. Skipping.
Loading the index into IVG results in an easy answer to my question: Y880_RS01600 is oprD in PAK:
Start of the IVG representation of a sub-graph with all oprD paths
Scrolling through the sub-graph, the path of Y880_RS01600 mostly follows the one of oprD in PAO1, but there are some differences. The same can also be seen in the vg viz
representation: vg viz
representation of a sub-graph with all oprD paths (SVG)
Start of the vg viz
representation of a sub-graph with all oprD paths
I’m pretty excited to see that there are no other paths (from all five complete annotations!) found that touch the nodes that are touched by the different oprD genes. If this works with other genes as well, this could be an interesting application to complete genome annotations.
Finding inversions
A question that came up a few times in group discussions about genome graphs was about inversions. In theory, graphs should show even big genomic inversions just fine, but I’m not entirely sure how this works when the basis for the graph is a multiple sequence alignment.
The publication of the PAO1 genome describes a large inversion (more than one-quarter of the genome) compared to a previously mapped isolate, and another publication describes this inversion as being between PA14 and PAO1. It resulted from a homologous recombination event between the rRNAs rrnA and rrnB and should be visible in the graph. The overview generated with odgi viz
might already be showing that, but I still find it a little hard to interpret.
A quick check could be to look at either the rRNA clusters where the recombination event took place, or at example genes that are located in the inverted or not inverted parts of the genome. For example, rhl, encoding the ATP-dependent RNA helicase RhlB, is on the positive strand in PAO1 (PA3861), but on the negative strand in PA14 (PA14_14040), while dnaA, encoding the chromosomal replication initiator protein DnaA, as the first gene in the annotation is on the positive strand of both strains.
vg paths -x FivePsaeAnnotAll.xg -X -Q "rhl" > FivePsaeAnnotAll_oprD_rhl.gam
vg view -a FivePsaeAnnotAll_oprD_rhl.gam | jq '{name: .name, nodes: ([.path.mapping[].position.is_reverse | tostring] | join(","))}'
This results in an output of different rhl genes (including rhlR and others) and for each node they touch whether that is reversed or not. The rhl genes (listed for PA14, PAO1, LESB58 and PA7) are all reversed.
Upon closer inspection, this makes sense. The gene might be located in the genomes with different orientations, but it is still read in the same direction, so the path will traverse it in the same direction no matter from which reference it comes.
What about rrnA and rrnB, then? It would be interesting to see the region surrounding those genes. In the PAO1 paper, rrnA is located approximately between bases 722096 and 727255, and rrnB is in the region 4788574 to 4793731. While there is no direct reference to these clusters by name or location in PA14, genes encoding the same features (5S, 16S and 23S rRNA together with tRNA for alanine and isoleucine) can be found in similar regions: 733095 to 738251 and 4952020 to 4957169.
It’s a pity that it’s not possible to simply find these genomic locations in the graph, and I think it’s also not possible to select a gene/feature and to specifically extract its surroundings, as I believe this only works with node IDs. Therefore, I will have to either decide which nodes to extract, or which annotated features.
The vg Wiki on Visualisation uses graphviz
to visualise different types of node connections, but there, too, I will have to find the right sub-graph. Would it be easier to reduce the graph to only PA14 and PAO1 for this? For that I first have remember the names of the paths; I think I can use my usual method to extract node IDs, but it will probably take some time.
vg paths -x FivePsae_new.xg -X -Q "refseq|NC_002516.2|chromosome" > FivePsae_PAO1.gam
vg paths -x FivePsae_new.xg -X -Q "refseq|NC_008463.1|chromosome" > FivePsae_PA14.gam
vg view -a FivePsae_PAO1.gam | jq -r '([.path.mapping[].position.node_id | unique | tostring] | join("\n"))' > PAO1_nodes.txt
Well, this is even more tricky than I thought! The GAM files which were supposed to hold the genome paths are empty. I know the original graph in GFA format had the paths, since odgi showed their names (which is where I copied them from), but what about the converted vg graph?
vg view FivePsae_new.vg -V -pM > test.txt
I don’t know if this is the best solution, but I’m using this command to give me paths and mappings in dot format. The file is huge, but can be searched for the paths to see if they’re there - and they are. Are there any restrictions to the format of path names so vg paths
can pick them out? Maybe it doesn’t like the “|”?
vg paths -v FivePsae_new.vg -L
refseq|NZ_CP020659.1|chromosome
refseq|NC_002516.2|chromosome
refseq|NC_011770|chromosome
refseq|NC_008463.1|chromosome
refseq|NC_009656.1|chromosome
OK, this doesn’t seem to be a problem in general. Maybe the paths are missing in my xg index?
vg paths -x FivePsae_new.xg -L
refseq|NZ_CP020659.1|chromosome
refseq|NC_002516.2|chromosome
refseq|NC_011770|chromosome
refseq|NC_008463.1|chromosome
refseq|NC_009656.1|chromosome
No, they are there all right. Is it the string format, then? Maybe I can’t use this type of string when looking for a path prefix, but there is a different way… I can use vg paths
with the -p
option and provide a list of paths to extract. That would also make node extraction easier.
vg paths -x FivePsae_new.xg -X -p paths_to_extract.txt > FivePsae_PAO1_PA14.gam
vg view -a FivePsae_PAO1_PA14.gam | jq -r '([.path.mapping[].position.node_id | unique | tostring] | join("\n"))' > PAO1_PA14_nodes.txt
jq: error (at <stdin>:1): Cannot iterate over string ("1")
jq: error (at <stdin>:2): Cannot iterate over string ("1")
OK, why am I so sure that the GAM files are empty anyway?
vg view -a FivePsae_PAO1_PA14.gam -pM > test.txt
vg view -pM
gives me paths and mappings from the input GAM file, and the result is absolutely not empty. So what’s jsons deal here? Apparently this code isn’t working as expected anymore.
vg view -a FivePsae_PAO1_PA14.gam -j > FivePsae_PAO1_PA14.json
jq '.path.mapping[].position.node_id' FivePsae_PAO1_PA14.json > PAO1_PA14_nodes.txt
Using the above command gives me a list of node IDs just fine, so I could probably use that to get the IDs. I would prefer to be able to remove duplicates, though…
jq '[.path.mapping[].position.node_id | unique]' FivePsae_PAO1_PA14.json
jq: error (at FivePsae_PAO1_PA14.json:1): Cannot iterate over string ("1")
jq: error (at FivePsae_PAO1_PA14.json:2): Cannot iterate over string ("1")
The unique
is the problem. jq manual to the rescue! “The unique function takes as input an array and produces an array of the same elements, in sorted order, with duplicates removed.” My command only produces single elements, though, not an array. I need more square brackets.
jq '[.path.mapping[].position.node_id] | unique | join("\n")' FivePsae_PAO1_PA14.json > PAO1_PA14_nodes.txt
While this is nice in general, it still gives me the unique nodes per path, so I get most of the nodes in duplicates. Is there a way to avoid that? jq play is a nice platform to test code, but so far I can’t find anything that works, because I need an option before the filter…
jq -nr '[inputs.path.mapping[].position.node_id] | unique[]' FivePsae_PAO1_PA14.json > PAO1_PA14_nodes.txt
vg find -N PAO1_PA14_nodes.txt -x FivePsae_new.xg -c 10 | vg view -v - > FivePsae_PAO1_PA14.vg
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 7616 end to 7644 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 7658 end to 7662 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 7658 end to 7662 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 25410 end to 25470 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 723069 end to 39745 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 733289 end to 733616 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 734902 end to 67220 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 747439 end to 747569 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 747439 end to 747569 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 760189 end to 760233 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 763740 end to 763758 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 271103 end to 271157 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 820448 end to 155070 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 820463 end to 820597 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 820625 end to 271276 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 820920 end to 821043 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 92129 end to 92177 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 284133 end to 284215 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 828729 end to 271157 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 835875 end to 835896 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 7616 end to 7644 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 7658 end to 7662 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 836008 end to 836019 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 838225 end to 315992 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 838407 end to 838427 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 747569 start to 747439 end does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 898713 end to 899397 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 915862 end to 917582 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 820597 start to 820463 end does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 733616 start to 733289 end does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 965429 end to 965618 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 967616 end to 967650 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 976481 end to 976506 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 977081 end to 977126 start does not exist
graph path 'refseq|NC_009656.1|chromosome' invalid: edge from 967616 end to 967650 start does not exist
[vg view] warning: graph is invalid!
Well, this is a big “sub-graph” and these invalid edges use to come up at this step, so I’m not worried (yet).
vg index -x FivePsae_PAO1_PA14.xg FivePsae_PAO1_PA14.vg
vg viz -x FivePsae_PAO1_PA14.xg -o ../pics/FivePsae_PAO1_PA14.svg
I have an svg file this time, but it doesn’t seem to want to load for me, or it’s empty again, as it is rather small…
conda activate python3.7
vg view FivePsae_PAO1_PA14.vg > FivePsae_PAO1_PA14.gfa
odgi build -g FivePsae_PAO1_PA14.gfa -o - | odgi sort -i - -o FivePsae_PAO1_PA14.og
odgi viz -i FivePsae_PAO1_PA14.og -o ../pics/FivePsae_PAO1_PA14.png -x 2000 -R
The conversion to gfa format gives me the same warnings that the vg find | vg view
command showed before, which is fine. I wonder if there is a vg prune
command, for example, which would remove invalid edges (I can’t find an option that looks like it does that in vg prune
or vg mod
)…
The odgi commands ran through smoothly, but the result is not what I had hoped for:
odgi visualisation of graph reduced to PAO1 and PA14
vg paths -v FivePsae_PAO1_PA14.vg -L
refseq|NZ_CP020659.1|chromosome
refseq|NC_002516.2|chromosome
refseq|NC_011770|chromosome
refseq|NC_008463.1|chromosome
refseq|NC_009656.1|chromosome
I might have reduced the nodes of the graph, but I didn’t remove the paths I don’t want to see right now. There is a keep-path
option in vg mod
, that only works for one path, not for two.
Visualisation with IVG
Let’s try a completely new approach - with a local copy of IVG. To be able to see my data in the browser, I have to change the “dataPath” in sequenceTubeMaps/src/config.json to the correct folder ("/data3/genome_graphs/CPANG/playground/day3/references/"). Then I can start the server and see if it works:
cd /data3/genome_graphs/sequenceTubeMap
npm run serve
It works! Well, until it dies… I have to figure out how to best use it without crashing my local browser. At least it didn’t kill the server.
The whole thing crashed because I was looking at the annotated graph. Using FivePsae_new.xg with “only” the five reference paths is much easier. I can easily have the graph start at position 722096 (but in reference to what?) and have a look at the references and their differences. This will not help me finding inversions, though, since IVG doesn’t visualise the orientation of the nodes. It’s still cool that this works, though, so let’s figure out how to best navigate this thing (hopefully without needing the annotation).
Using PAO1 (NC_002516.2) as reference, I can navigate to nucleotide position 483, which is the beginning of dnaA, the first annotated gene, on the plus strand and find this gene at exactly this position. This core gene is very conserved, with almost exclusively SNPs in PA7.
oprD, on the negative strand, ends at position 1043983 and starts at 1045314, and is much harder to find. I don’t know if I should look for the coding sequence as it is listed on pseudomonas.com or for the reverse complement, and I can’t find either of them. Additionally, the graph gets really loopy inside the oprD gene - did I see that before?
Pangenome graph section of the possible start/end of oprD
Since the sub-graph for oprD I already constructed is also in the mounted folder for IVG, I can easily load it for comparison - and, yes, it is the right region and it did look like that before.
Pangenome sub-graph of the possible start/end of oprD
This graph has more paths, since the oprD annotations are included as well. The problem I have with this is that it doesn’t really fit with the official position of oprD in the PAO1 genome. The loops, which look like they are inside the gene, are actually outside of it (the annotated paths do not touch that node). The oprD annotation paths start - well, end, actually - either at position 1044045 (PA14, PA7, and LESB58) or 1044044 (PAO1 and PAK) with AATG or ATG, respectively. This is the view I get when I choose PAO1 as the reference.
Choosing the PAO1 annotation of oprD gives a different picture, but shows neither beginning or end of the actual gene (reverse complement or not) at the beginning or end of the sub-graph.
I don’t know how IVG sorts the nodes it is asked to display, and I’m also a little concerned now about the extraction of nodes for certain paths, since the sub-graph I generated based on the PAO1 oprD shows only a little more than 120 nucleotides, even though the gene should be over 1 kb in size. Trying to load more with the “length” argument leads only to a blank page…
In PA14, oprD is on the plus strand, I’ll check and see how well I can identify that in the complete graph without annotation. Well, this works perfectly. The ATG is exactly at the position it should be (4606220) and the end of the gene (4607551) is located at the reverse AATG, shortly before the loop around the C shown above.
In conclusion, it seems to be easy to find genes located on the plus strand, but the minus strand is a little difficult. While entering the end position (smaller number) of a gene on the negative strand gives me the end of the gene on the graph, using the true start position strands me somewhere in the graph where the gene is not. At least that’s what it looks like, BUT
position | strain | node ID |
---|---|---|
start | PAO1 | 495163 |
end | PAO1 | 495478 |
start | PA14 | 495161 |
end | PA14 | 495478 |
PA14 and PAO1 are actually not far apart when looking at node IDs. I think the nucleotide positions of the reference are slightly off, probably due to the negative strand. My best guess is that the nucleotide positions are numbered continuously upwards, even though the nodes are sometimes traversed from the other direction, making it more difficult to navigate the graph with IVG.
Example: If I enter the start position of oprD (1045314) in IVG, I see the same region of the graph I see when entering the start position of oprD in PA14. The gene being on the plus strand in PA14, I expect to have to scroll right to see all of it (nucleotide positions/numbers increase). In PAO1, the position of the end of the gene has a smaller number than the start, due to the gene being located on the minus strand, so I would expect to have to scroll left, since that is also what the scale at the top of the visualisation indicates. This is not how it works, though, since the graph still has the same orientation it has when I look at PA14, so I have to scroll right and lose the nucleotide positions of PAO1. To add to the confusion, the position does not fit with the actual start of the gene, but here I can’t even guess why.
Pangenome graph extract of the start of oprD, using PAO1 as reference - red is the true start of the gene, black is the one indicated by nucleotide position
Well, in any case I don’t think IVG can help with finding genomic inversions, since the display is always relative to a reference. Before I continue to tackle that question, I’ll have a quick look at the oprD sub-graph extraction which doesn’t look so good now that I’ve had an even closer look than before. Did I really extract the whole gene?
The oprD question
For reference, this is the code I used to find the relevant nodes:
vg paths -x FivePsaeAnnotAll.xg -X -Q "oprD" > FivePsaeAnnotAlloprD.gam
vg view -a FivePsaeAnnotAlloprD.gam | jq -r '([.path.mapping[].position.node_id | unique | tostring] | join("\n"))' > oprDAll_nodes.txt
vg find -N oprDAll_nodes.txt -x FivePsaeAnnotAll.xg -c 10 | vg view -v - > FivePsaeAnnotAlloprD_nodes.vg
vg index -x FivePsaeAnnotAlloprD_nodes.xg FivePsaeAnnotAlloprD_nodes.vg
vg viz -x FivePsaeAnnotAlloprD_nodes.xg -o ../pics/FivePsaeAnnotAlloprD_nodes.svg
Comparing the visualisation in IVG with the list of nodes (oprDAll_nodes.txt) it looks like the graph is in order of the nodes in the file, and not necessarily in the order of nodes in the graph(?). Additionally, node 495161 (containing the start of the gene) is not included in the file. I assume that nodes which are not completely part of a path are not included?
vg paths -x FivePsaeAnnotAll.xg -F -Q "oprD"
warning [libhandlegraph]: Serialized handle graph does not appear to match deserialzation type.
warning [libhandlegraph]: It is either an old version or in the wrong format.
warning [libhandlegraph]: Attempting to load it anyway. Future releases will reject it!
warning:[XG] Loading an out-of-date XG format.For better performance over repeated loads, consider recreating this XG index.
>oprD_PA0958
ATGAAAGTGATGAAGTGGAGCGCCATTGCACTGGCGGTTTCCGCAGGTAGCACTCAGTTCGCCGTGGCCGACGCATTCGT
CAGCGATCAGGCCGAAGCGAAGGGGTTCATCGAAGACAGCAGCCTCGACCTGCTGCTCCGCAACTACTATTTCAACCGTG
ACGGCAAGAGCGGCAGCGGGGACCGCGTCGACTGGACCCAAGGCTTCCTCACCACCTATGAATCCGGCTTCACCCAAGGC
ACTGTGGGCTTCGGCGTCGATGCCTTCGGCTACCTGGGCCTGAAGCTCGACGGCACCTCCGACAAGACCGGCACCGGCAA
CCTGCCGGTGATGAACGACGGCAAGCCGCGCGATGACTACAGCCGCGCCGGCGGCGCCGTGAAGGTGCGCATCTCCAAGA
CCATGCTGAAGTGGGGCGAGATGCAACCGACCGCCCCGGTCTTCGCCGCTGGCGGCAGCCGCCTGTTCCCGCAGACCGCG
ACCGGCTTCCAGCTGCAGAGCAGCGAATTCGAAGGGCTCGACCTCGAGGCAGGCCACTTCACCGAGGGCAAGGAGCCGAC
CACCGTCAAATCGCGTGGCGAACTCTATGCCACCTACGCAGGCGAGACCGCCAAGAGCGCCGATTTCATTGGGGGCCGCT
ACGCAATCACCGATAACCTCAGCGCCTCCCTGTACGGCGCCGAACTCGAAGACATCTATCGCCAGTATTACCTGAACAGC
AACTACACCATCCCACTGGCATCCGACCAATCGCTGGGCTTCGATTTCAACATCTACCGCACAAACGATGAAGGCAAGGC
CAAGGCCGGCGACATCAGCAACACCACTTGGTCCCTGGCGGCAGCCTACACTCTGGATGCGCACACTTTCACCTTGGCCT
ACCAGAAGGTCCATGGCGATCAGCCGTTTGATTATATCGGCTTCGGCCGCAACGGCTCTGGCGCAGGTGGCGACTCGATT
TTCCTCGCCAACTCTGTCCAGTACTCCGACTTCAACGGCCCTGGCGAGAAATCCTGGCAGGCTCGCTACGACCTGAACCT
AGCCTCCTATGGCGTTCCCGGCCTGACTTTCATGGTCCGCTATATCAATGGCAAGGACATCGATGGCACCAAGATGTCTG
ACAACAACGTCGGCTATAAGAACTACGGCTACGGCGAGGATGGCAAGCACCACGAAACCAACCTCGAAGCCAAGTACGTG
GTCCAGTCCGGTCCGGCCAAGGACCTGTCGTTCCGCATCCGCCAGGCCTGGCACCGTGCCAACGCCGACCAGGGCGAAGG
CGACCAGAACGAGTTCCGCCTGATCGTCGACTATCCGCTGTCGATCCTGTA
>oprD_PA14_51880
TGAAAGTGATGAAGTGGAGCGCCATTGCACTGGCGGTTTCCGCAGGTAGCACTCAGTTCGCCGTGGCCGACGCATTCGTC
AGCGATCAAGCCGAAGCGAAGGGGTTCATCGAAGACAGCAGCCTCGACCTGCTGCTCCGCAACTACTATTTCAACCGTGA
CGGCAAGAGCGGCAGCGGGGACCGCGTCGACTGGACCCAAGGCTTCCTCACCACCTATGAATCCGGCTTCACCCAAGGCA
CCGTCGGCTTCGGCGTCGATGCCTTCGGCTACCTCGGTCTGAAGCTCGACGGCACCTCCGACAAGAGCGGTACCGGCAAC
CTGCCAGTGATGAACGACGGCACGCCCCGTGACGACTACAGCCGCGCCGGTGGCGCCGTGAAGGTACGCATCTCCAAGAC
CATGTTGAAGTGGGGCGAGATGCAGCCGACCGCTCCGGTCTTCGCCGCCGGCGGCAGCCGCCTGTTCCCGCAGACCGCGA
CCGGCTTCCAACTGCAGAGCAGCGAACTCGAAGGGCTCGATCTCGAAGCGGGCCACTTCACCGAAGGCAAGCAGGGCACC
ACCACCAAGTCGCGCGGCGAACTCTACGCAACCTATGCAGGCGAGACCGCCAAGAGCGCCGATTTCATTGGGGGCCGCTA
CGCAATCACCGATAACCTCAGCGCCTCCCTGTACGGTGCTGAACTCGAAGACATCTATCGTCAGTATTACCTGAACAGCA
ACTACACCATCCCACTGGCATCCGACCAATCGCTGGGCTTCGATTTCAACATCTACCGCACAAACGATGAAGGCAAGGCC
AAGGCCGGCGACATCAGCAACACCACTTGGTCCCTGGCGGCAGCCTACACTCTGGATGCGCACACTTTCACCTTGGCCTA
CCAGAAGGTCCATGGCGATCAGCCGTTTGATTATATCGGCTTCGGCGAGAACGGTTCCGGCGGCGGCGGTGACTCGATTT
TCCTCGCCAACTCCGTGCAGTACTCCGACTTCAACGGCCCCGGCGAGAAATCCTGGCAGGCCCGCTACGACCTGAACCTC
GCCTCCTATGGCGTTCCCGGCCTGACTTTCATGGTCCGCTATATCAATGGCAAGGACATCGATGGCACCAAGATGTCTGA
CAACAACGTCGGCTATAAGAACTACGGCTACGGCGAGGACGGCAAGCACCACGAGACCAACCTCGAAGCCAAGTACGTGG
TCCAGTCCGGTCCGGCCAAGGACCTGTCGTTCCGCATCCGCCAGGCCTGGCACCGCGCCAACGCCGACCAGGCCGAAGGC
GACCAGAACGAGTTCCGCCTGATCGTCGACTATCCGCTGTCGATCCTGTAA
>oprD_PALES_43561
TGAAAGTGATGAAGTGGAGCGCCATTGCACTGGCGGTTTCCGCAGGTAGCACTCAGTTCGCCGTGGCCGACGCATTCGTC
AGCGATCAGGCCGAAGCGAAGGGGTTCATCGAAGACAGCAGCCTGAACCTGCTGCTCCGCAACTACTATTTCAACCGTGA
CGGCAAGGAAGGTCGGGGTGATCGCGTCGATTGGACCCAGGGCTTCCTCACCACCTACGAATCCGGCTTCACTCAAGGCA
CCGTGGGCTTCGGCGTCGATGCCTTCGGCTACCTGGGCCTGAAGCTCGACGGCACCTCCGACAAGACCGGCACCGGCAAC
CTGCCGGTGATGAACGACGGCAAGCCGCGCGATGACTACAGCCGCGCCGGCGGCGCCGTGAAGGTGCGCATCTCCAAGAC
CATGCTGAAGTGGGGCGAGATGCAACCGACCGCCCCGGTCTTCGCCGCTGGCGGCAGCCGCCTGTTCCCGCAGACCGCGA
CCGGCTTCCAGCTGCAGAGCAGCGAATTCGAAGGGCTCGACCTCGAGGCAGGCCACTTCACCGAGGGCAAGGAGCCGACC
ACCGTCAAATCGCGCGGTGAGCTCTACGCCACCTACGCCGGCCAGACAGCCAAGAGCGCCGACTTCGCTGGCGGCCGCTA
CGCGATCACCGACAACCTCAGCGCCTCCCTCTATGGCGCAGAGCTGAAAGACATCTATCGCCAGTACTACCTGAACACCA
ACTACACCATCCCGCTGGCCTCCGATCAATCGCTGGGCTTCGACTTCAACATCTACCGCACCACCGACGAAGGCAAGTCC
AAGGCTGGCGACATCAGCAACACCACCTGGTCCCTGGCGGGCGCGTATACCCTGGACGCCCACACCTTCACCCTGGCCTA
CCAGCAGGTGCATGGCGACGAGCCGTTCGACTACATCGGCTTCGGCGGCAACGGTTCCGGCGCCGGCGGCGACTCGATCT
TCCTCGCCAACTCCGTCCAGTACTCCGACTTCAACGGTCCTGGCGAGAAATCCTGGCAGGCCCGCTACGACCTGAACCTG
GCCTCCTACGGCGTTCCTGGCCTGACCTTCATGCTGCGTTACATCAATGGTAAGGACATCGACGGTACCAAGGTCGATTC
CAGCTCCTCCTATGCAGGCCTGTACGGCGAGGACGGCAAGCACCACGAAACCAACCTCGAAGCCAAGTACGTGGTCCAGT
CCGGTCCGGCCAAGGACCTGTCGTTCCGCATCCGCCAGGCCTGGCACCGTGCCAACGCCGACCAGGGCGAAGGCGACCAG
AACGAGTTCCGCCTGATCGTCGACTATCCGCTGTCGATCCTGTAA
>oprD_PSPA7_4550
TGAAAGTGATGAAGTGGAGCGCCATTGCACTGGCGGTTTCCGCGGGTAGCACTCAGTTCGCCGTGGCCGACGCCTTCGTC
AGCGATCAGGCCGAAGCGAAGGGGTTCATCGAAGACAGCAGTCTCGACCTGCTGCTCCGCAACTACTATTTCAACCGCGA
CGGCAAGAGCGGCAGCCGGGACCGTGTCGACTGGACCCAGGGCTTCCTCACCACCTACGAATCCGGCTTCACCCAAGGCA
CCGTCGGCTTCGGCGTGGATGCCTTCGGCTACCTCGGCCTGAAACTCGACGGCACCTCCGACAAGACCGGCACCGGCAAC
CTGCCGGTAATGAACGATGGCAAGCCGCGCGATGACTACAGCCGCGCAGGCGGCGCCCTGAAGGTACGCATTTCCAAGAC
CATGCTGAAATGGGGCGAGATGCAGCCGACCGCACCGGTCTTCGCCGCCGGCGGCAGCCGCCTGTTCCCGCAGACCGCGA
CCGGCTTCCAGTTGCAGAGCAGCGAGTTCGAAGGCCTCGACCTCGAAGCGGGCCATTTCACCGAAGGCAAGCAGGGCACC
ACCACCAAGTCGCGCGGCGAACTCTACGCGACCTATGCGGGCCAGACCGCCAAGAGCGCGGACTTCGCCGGCGGCCGCTA
CGCGATCACCGACAACCTCAGCGCCTCCCTGTATGGCGCCGAACTGAAAGACATCTATCGCCAGTACTACCTGAACACCA
ACTACACCATTCCGCTGGCGGAAGATCAGTCGCTGGGCTTCGACTTCAACATCTACCGCACCACCGACGAAGGCAAGTCC
AAGGCCGGCGACATCAGCAACACCACCTGGTCCCTGGCCGGCGCGTACACCCTGGACGCACACACCTTCACCCTCGCCTA
CCAGCAGGTCCATGGCGACGAGCCGTTCGATTACATCGGCTTCGGCGAGAACAATTCCGGCGGTGGCGGCGACTCGATCT
TCCTCGCCAACTCCATCCAGTACTCCGACTTCAACGGCCCCGGCGAGAAATCCTGGCAGGCCCGCTACGACCTGAACATG
GCTTCGTACGGCGTACCCGGCCTGACCTTCATGGTCCGCTACATCAACGGCAAGGACATCGACGGCACCAAGGTCGACTC
CAGCTCCGCCTACGCCGGCCTGTATGGCGAAGACGGCAAGCACCACGAGACCAACTTCGAAGCCAAGTACGTGGTCCAGT
CCGGCCCCGCCAAGGACCTGTCGTTCCGCATCCGCCAGGCCTGGCACCGCGCCAACGCCGACCAGGGCGAAGGCGACCAG
GACGAGTTCCGCCTGATCGTCGACTATCCGCTGTCGATCCTGTAA
The first nucleotide is missing in PA14_51880 and most other genes as well (except the PAO1 version, where the last nucleotide is missing), so it’s not a whole node that’s being ignored here. Is that a problem of the annotation? An off-by-one error while reading the GFF file? The PAO1 gene is the only one on the negative strand…
The graph is still in the old vg format, I think it might be a good idea to annotate the new graph I created after updating vg.
Annotating a new graph version
The genomes and annotations are still the same, so I’ll just repeat the steps I did before with the new graph.
cd /data3/genome_graphs/CPANG/playground/day3/references/
vg annotate -x FivePsae_new.xg -f annotations/PseudomonasAnnotationAll.gff > FivePsaeAnnotAll_new.gam
vg augment FivePsae_new.vg -i FivePsaeAnnotAll_new.gam > FivePsaeAnnotAll_new.vg
vg index -x FivePsaeAnnotAll_new.xg FivePsaeAnnotAll_new.vg
vg paths -v FivePsaeAnnotAll_new.vg -L | wc -l
Great, I again have 30578 paths in the annotated graph and the whole process was pretty quick.
Now I can again extract the oprD fasta sequences.
vg paths -x FivePsaeAnnotAll_new.xg -F -Q "oprD"
They still look the same as before, so it’s time to look at some other genes. Two of the first genes located on the negative strand in my reference genomes are glyS and glyQ:
vg paths -x FivePsaeAnnotAll_new.xg -F -Q "glyS"
vg paths -x FivePsaeAnnotAll_new.xg -F -Q "glyQ"
Here, the last adenine is missing from the TGA stop codon.
Now, the first gene in the P. aeruginosa genome is dnaA, on the plus strand:
vg paths -x FivePsaeAnnotAll_new.xg -F -Q "dnaA"
Here, the first guanine of the GTG start codon is missing in all sequences. There is an off-by-one error in vg path
, or probably in vg annotate
or vg augment
.
The oprD question pt. II
Back to my original questions about the oprD sub-graph. I know now that vg paths
is in general able to identify the complete sequence of an annotated feature in the graph (except maybe for the first or last position), so why is the first node missing on my list, and the last node as well? I’m guessing it’s because of an incomplete hit. Luckily I know the IDs of the first and last node and can add them to the file in this case, but in general this will be a huge problem.
Next up is my confusion about the node order. My file starts with node 1107791, but it is not present in the new, not annotated graph. Since the annotated graph seems to crash IVG, it’s a little tricky to find the true position of this node… I’ll have to go by the sequence, which is present in all oprD versions.
Oh. It is the actual beginning of the gene. This node is only one nucleotide long and represents the adenosine that is missing in PA14, PA7, and LESB58. Only the oprD annotations PAO1 and PAK touch this node, so the reason why this graph looked so weird is the same off-by-one error I’ve seen with the fasta sequences (together with me using PA14 as reference and just not seeing the first node). Of course like this node 495161, which contains the start of the gene in my not annotated graph, is not present and will never be - it has been split into two different nodes (1107791 and 1107792). The same is true for the end of the gene, where the PAO1 and PAK version are missing the last nucleotide (represented by node 1107787, following node 1107785).
With this knowledge, I’m willing to rely on the rest of the node extraction as well and assume that the whole gene is present (it was in the fasta).
Finding inversions pt. II
After a chat with Erik Garrison, one of the developers of vg and instructor for CPANG19, I want to go back to my search of the genome inversion. As a visual approach he suggested a dot plot - I just so happened to discover vg dotplot
last week, but he also suggested to use dotPlotly, which appears to be an R script. dotPlotly works directly with the minimap2 output and not with the graph, but that’s fine for a first look.
cd /data3/genome_graphs/CPANG/playground/day3/references/
mkdir inversion
cd inversion
git clone git@github.com:tpoorten/dotPlotly.git
Cloning into 'dotPlotly'...
remote: Enumerating objects: 207, done.
remote: Total 207 (delta 0), reused 0 (delta 0), pack-reused 207
Receiving objects: 100% (207/207), 31.31 MiB | 12.54 MiB/s, done.
Resolving deltas: 100% (88/88), done.
I’ll try the example function call (with my alignment file) first:
gunzip ../FivePsae.paf.gz
dotPlotly/pafCoordsDotPlotly.R -i ../FivePsae.paf -o out -s -t -m 500 -q 500000 -k 7 -l
PARAMETERS:
input (-i): ../FivePsae.paf
output (-o): out
minimum query aggregate alignment length (-q): 5e+05
minimum alignment length (-m): 500
plot size (-p): 15
show horizontal lines (-l): TRUE
number of reference chromosomes to keep (-k): 7
show % identity (-s): TRUE
show % identity for on-target alignments only (-t): TRUE
produce interactive plot (-x): TRUE
reference IDs to keep (-r):
Number of alignments: 4923
Number of query sequences: 455
After filtering... Number of alignments: 3779
After filtering... Number of query sequences: 4
Dot plot of the genome alignment
This looks… OK. Here’s the order of the genomes:
top to bottom: NC_011770, NC_002516.2, NC_009656.1, NC_008463.1
left to right: NC_011770 (LESB58), NC_009656.1 (PA7), NC_008463.1 (PA14), NZ_CP020659.1 (PAK), NC_002516.2 (PAO1)
I don’t know why there are empty panels, or why the queries don’t include PAK, as I was assuming this would be an all-against-all comparison. While it is easy to see the actual inversion, I find it hard to interpret the results. This could be due to the options set in the example. They used -s
to turn on colouring based on identity, which is nice, -t
for % identity calculation only for on-target alignments (whatever that means), and -l
to show horizontal lines. Additionally, -m 500
means the minimal alignment length was 500 bp (default is 10,000, so that options seems very useful), -q 500000
sets the minimum query length to be 500,000 bp instead of 40,000, and -k 7
sets the number of sorted reference chromosomes to keep (which normally defaults to all).
The -k
option probably doesn’t make sense, since I only have five genome in there anyway. I’ll stick to the -m 500
option, though, and keep the visualisation options -s
and -l
as well. I’m not entirely sure about -q
, so I’m trying the default instead.
dotPlotly/pafCoordsDotPlotly.R -i ../FivePsae.paf -o FivePsae_dotplotly -s -l -m 500
PARAMETERS:
input (-i): ../FivePsae.paf
output (-o): FivePsae_dotplotly
minimum query aggregate alignment length (-q): 4e+05
minimum alignment length (-m): 500
plot size (-p): 15
show horizontal lines (-l): TRUE
number of reference chromosomes to keep (-k):
show % identity (-s): TRUE
show % identity for on-target alignments only (-t): TRUE
produce interactive plot (-x): TRUE
reference IDs to keep (-r):
Number of alignments: 4923
Number of query sequences: 455
After filtering... Number of alignments: 3779
After filtering... Number of query sequences: 4
Second dot plot of the genome alignment
The numbers of alignments and query sequences stayed the same, and the biggest difference between the two plots is the colour that was added this time. I’m not sure why, though. I thought I’d add more colour by skipping -t
, but in the output to the console it says it’s still showing “% identity for on-target alignments only”.
In any case, the queries and targets are still the same, as is everything else. The problem could be that you usually use two genomes in such a dot plot and I have five. Maybe this is confusing for the R script.
For comparison, what does the vg dotplot
output look like? Here, I’ll have to figure out a way to visualise the table…
vg dotplot -x ../FivePsae_new.xg > FivePsae_new.dotplot
The resulting output is a huge table with the paths/genomes as queries and targets and the positions that are being compared, as wells as an orientation. The files takes some time to generate and is huge, with 162,015,277 lines. I know that the inversion is present between PA14 (NC_008463.1) and PAO1 (NC_002516.2), so maybe to start I can extract all lines which contain a comparison of those two.
grep 'NC_002516.2' FivePsae_new.dotplot | grep 'NC_008463.1' > PA14_PAO1.dotplot
This file is “only” 13,592,596 lines long. Nevertheless, loading it into R and simply plotting the result takes some time. It does look as expected, though:
Dot plot based on vg dotplot
output
Th genomes fit quite well at the start and end of the sequence, but are inverted in the middle. Now that I have the raw data in R
, I can also check at which positions that happens, which will then help me to find the inversion in the graph directly. I can use the orientation to find the inversion quite easily: the switch possibly starts at position 41,180 in PA14, which corresponds to position 2,780,557 in PAO1.
In IVG, this is one of the positions where the five reference paths appear multiple times, so I assume that IVG is not the right tool to look at this region. It does give me the right starting node ID, though: 6004. Maybe a combination of vg find
and vg view
can help.
vg find -x ../FivePsae_new.xg -p 'refseq|NC_008463.1|chromosome':41150-41210 > subgraph_PA14.vg
vg view -dp subgraph_PA14.vg | dot -Tsvg -o subgraph_PA14.svg
graph path 'refseq|NC_002516.2|chromosome' invalid: edge from 6008 end to 6008 end does not exist
[vg view] warning: graph is invalid!
Of course we now have some problems with non-existent edges, that’s fine. It sounds like the node range is about right, though. Here is the resulting SVG file:
Subgraph created by vg find
and vg view
, click to enlarge
As far as the path duplication is concerned, this looks very similar. Visualisation is just not the right tool for the job here.
As a non-visual way to find the inversion, Erik also suggested to look for variants that are caused by the inversion. I couldn’t see anything in the visual approaches, though. Maybe the vg dotplot
output has to be enough, and I just need to find a better way of parsing and analysing it than using R. It does have the disadvantage of creating a huge file, though, even for only five strains. Apropos…
gzip FivePsae_new.dotplot PA14_PAO1.dotplot
Maybe we’ll go back to dotPlotly for a second?
Trying to interpret the dotPlotly plots leads me to the conclusion that PA14, LESB58, and PA7 do not have the inversion, while PAO1 and PAK have it. This sounds weird considering that PAO1 and LESB58 usually cluster together on any phylogenetic tree, but it does make sense when looking at my standard example oprD, which is on the positive strand in PA14, LESB58, and PA7, but on the negative strand in PAO1 (and not annotated in PAK, but my experiments have shown that the PAK sequence behaves similarly to that of PAO1 in that region).
That’s it, then. It is possible to find genomic inversions, either in the alignment or in the graph, even though vg doesn’t seem to be the best tool for the job. Apparently it’s just not build (or tested) for complicated graphs like mine. And that reminds me… I wanted to have a look at variation in intergenic sequences!
Variation in intergenic sequences
In our standard analyses of genomic sequencing data, we mostly focus on gene sequences and sometimes promoter regions, while intergenic sequences in general are not of interest (and they’re usually pretty short in bacteria anyway). In a pangenome graph, though, they can complicate things if they are too diverse. This is a great task for IVG, where I can scroll through the graph, but for context I’m doing it here and not in the IVG post.
Let’s start with a special case - right at the beginning of the genome. Both PA14 and PAO1 have their first annotated gene (dnaA) start at position 483, but what happens before that? The origin of replication should be there somewhere, so the sequences can’t probably be too diverse. Maybe I can use vg find
to find the putative DnaA binding sites to look at.
cd /data3/genome_graphs/CPANG/playground/day3/references
vg find -x FivePsae_new.xg -g FivePsae_mod_new.gcsa -S 'TTATCCACA' -P 'refseq|NC_008463.1|chromosome'
699647:-7
701302:5
55419:-15
69:6
55419:-15
568641:-6
1150077:4
1121287:2
1308318:-1
290741:-7
1290679:27
290741:-7
16407:-0
1239098:-13
1150075:19
16407:-0
705411:-0
705411:-1
701302:5
1165925:17
657796:1
1269425:-12
467724:6
568641:-6
1296528:-0
69:6
1122386:20
1289362:15
652045:21
693505:-3
1221471:-14
702807:2
693505:-3
693505:-3
1165930:-23
1190403:-22
16458:-7
693500:-1
1303899:-4
16171:-9
693500:-1
1191236:5
702897:6
1278026:10
702897:6
1252543:0
16458:-7
1263147:8
55684:25
702897:6
51:3
1156846:21
1139260:22
51:3
705115:3
1232047:-30
1122684:3
290741:-7
1221471:-14
1294473:21
687273:7
290741:-7
1248388:12
485163:-1
687273:7
702659:2
797952:-0
1291363:0
702659:2
56097:-0
56097:-0
1226872:5
698250:0
699178:1
1025142:-0
672464:2
1136329:15
1193738:-11
1226872:5
1225011:-7
1248474:19
1183428:-14
1131078:-7
51:3
1131078:-7
698250:0
1183428:-14
1168424:-22
1248388:12
51:3
1248474:19
470015:0
1216099:-26
589405:0
699178:1
485163:-1
1180011:-14
1240238:22
699647:-7
1170568:12
201393:0
55684:25
705565:-0
705565:-0
1120774:-8
1179050:30
1175354:-0
1116156:15
1172750:-13
83:-9
1265415:-27
I assume that this is a table of kmers, even though I didn’t use the -T
option (since I get the same output when I do use the option). In any case, I don’t know how to work with that, so I’ll just look at the IVG graph instead.
In general, the starting region of the chromosome looks quite conserved in all five strains, with a little more variation in PA7 than in the others. Among the nodes that are shared by all five reference are those containing DnaA binding motifs, like here nodes 51 (TTATCCACA) and 55 (TTTTCCACA):
Node 51 and surroundings, showing two putative DnaA binding sites
Since 51 is also listed above, and 3 is the start of the sequence inside the node, I now assume that the above is a list of nodes containing the sequence I asked vg find
to find. I think the negative numbers signify that the sequence was found in the reverse complement, which I am not even interested in, but good to know.
This specific intergenic region is not very complex, but what about somewhere random in the genome? I’m going to look at multiple places here: inside and outside the genomic inversion region, starting with (PA14) 16609 - 16922. This is the intergenic region between PA14_00150 and PA14_00160, which both encode hypothetical proteins on the negative strand, and have orthologs in the other strains (like PA0014 and PA0015). The intergenic region is 313 bp long in PA14, and 291 bp in PAO1. It is only present in four of the five strains apparently: PA7 is completely missing in the graph in this region.
The most diverse section of this intergenic sequence is this one:
Variation in an intergenic region - blue is PAO1, orange is PA14, green is LESB58, and red is PAK
PA14 differs a little from the other three strains here, shortly before the next gene starts, but it’s not great variation. More in the middle of the intergenic region, we also can see this 109 bp node that is common to all four strains:
Node 2145 in an intergenic region - blue is PAO1, orange is PA14, green is LESB58, and red is PAK
Overall, except for one strain completely missing, this randomly chosen intergenic region is not very complex and easy to display and navigate in graph format. The two genes surrounding this region have orthologs in many strains, including PA7 (PSPA7_0014 and PSPA7_0015), so where is the intergenic region of PA7?
Interestingly, the intergenic region as well as the surrounding genes are completely separated from the other four strains (node 706909 for PA7). They only reunite at node 2185, around PA7 genomic position 17240 (which is the beginning of PSPA7_0017). The separation begins at node 2120, or PA7 genomic position 14640, which is even before PSPA7_0014 starts (at 14713).
The compressed view in IVG helps a little to visualise this region (only showing start and end, though):
Beginning of intergenic region specific for PA7 (shown in green)
End of intergenic region specific for PA7 (shown in green)
OK, so the intergenic regions are completely different, but what about the genes? They are supposed to be orthologs, how come they are separated on the graph? Well, I don’t know how the orthologs on pseudomonas.com are defined, but the sequences are indeed very different (i.e. BLAST can’t align PA14_00150 and PSPA7_0014).
Conclusion: the graph seems to deal pretty well with regions that are completely different in a strain and at least this intergenic region is not very complex.
The next random region I chose is part of the genomic inversion: the region between PA14_22350 (actP) and PA14_22370, which is only 88 bp long (from 1,947,590 to 1,947,678). actP again has orthologs in many other strains, so this region could be present in most or all of the reference strains used in the graph.
Whole short intergenic sequence between two small operons
The image above shows the whole intergenic region plus start/end of the surrounding genes. It’s a very simple graph with all five reference strains that should not cause any problems.
Things get a little more difficult when we’re looking at genes that are specific for single strains. There is, for example, exoU in PA14, which is not present in the other reference strains. Entering coordinates from around this gene in IVG leads only to a white screen and no error messages. A neighbouring gene with orthologs in the other strains is PA14_51510, which I can find in IVG, as well as the PAO1 version PA0988.
Starting with the coordinates of PA0988, I get this graph for the end of the gene:
End of PA0988
PA7 is again completely missing and there are some funny loops in there, but it looks OK. Then we scroll to the left (which should be intergenic) and the problems start:
End of PA0988 and begin of the intergenic region
This is the compressed view to include more of the graph - the two bigger nodes on the right, together with the loops, are the same as in the previous image, just the colours of the strains have changed. The light blue path is one version of PA7, and then we have multiples of all reference strains except PAO1, which I used as reference.
I don’t know what’s going on here, so let’s have a closer look. The light blue PA7 path goes through the nodes on the upper strand as well as through those on the lower strand, which I guess causes this weird double layer. Let’s check out the first two “stacked” nodes (492717 and 188112, to the right of the blue PA7-only node) separately, with PAO1 as reference.
Well, those regions are even more complex when I try to look at them separately - I can’t even squeeze them into nice images. Erik actually discussed problems like this one in the seqwish github, so I guess I could try re-generating my graph with the new options included in this to improve these regions. For now, just in general, it is good to know that this issue is not un-solvable.
Open questions
- Is there another way to visualise a whole genome graph?
- How can I annotate the graph? Is a specific formatting of GFF files required?
- The files just have to strictly follow the specifications, I believe.
- How does
vg viz
work compared tovg view (-d)
? Are the nodes sorted differently, and if so, why? - How does
vg find -N
work? Why are more nodes included than were on the list? - What exactly happens when multiple paths have the same or no name (i.e. annotation of genes with the same name, or no name)?
- Did my pipeline to extract sub-graphs for certain genes (like oprD) really extract and include all nodes of the target gene/region?
- Yes, but there is a problem with an off-by-one error when annotating a graph with a GFF file that leads to either the first (on the plus strand) or last (on the minus strand) nucleotide missing in the annotation paths.