Please find the description of prerequisite files here.
We will propose the following directory structure (press on folders to expand):
Target loci
Database with target loci can be constructed using locityper add
command.
First of all, locus alleles can be extracted from a pangenome VCF file:
locityper add -d db -r reference.fa -j counts.jf \
-v pangenome.vcf.gz -l locus_name chr:start-end
where reference.fa
is the reference genome and counts.jf
are Jellyfish k-mer counts for the same reference
(more details).
By default, Locityper considers all available haplotypes, including the reference.
Reference haplotype name is either inferred, on needs to be provided with -g NAME
.
Additionally, you can discard some haplotypes with --leave-out
argument.
Regions can also be supplied with a four-column BED file using -L
argument.
Furthermore, -l/-L
arguments can be used together and can be repeated multiple times:
locityper add -d db -r reference.fa -j counts.jf \
-v pangenome.vcf.gz -l locus1 chr:start-end \
-L loci2.bed -L loci3.bed -l locus4 chr:start-end
Alternatively, locus alleles can be directly provided using a FASTA file.
locityper add -d db -r reference.fa -j counts.jf \
-l locus_name chr:start-end alleles.fa
Note, that you still need to provide region coordinates in the reference genome.
This is needed to evaluate off-target k-mer counts.
Path to alleles can also be provided in a fifth column of the input BED file (-L
).
If you don't know exact region coordinates, you can align locus alleles to the reference genome with
minimap2 -cx asm20 genome.fa alleles.fa | \
awk '{ printf("%s:%d-%d\n",$6,$8+1,$9) }' | \
sort | uniq -c | sort -k1,1nr | head -n20
This will produce a list of possible reference coordinates ranked by the number of alleles. You can then select the most common coordinates / largest region at your discretion.
You can freely add more loci to an existing database using the same commands.
Locus extension
When constructing locus database from a pangenome VCF file,
Locityper tries to extend the locus such that the new boundaries do not overlap any input variants.
This process may fail if there are very long pangenomic bubbles overlapping the locus.
In such cases, we recommend increasing allowed extension size (-e
) or
manually providing locus sequences / manually specifying bigger input region.
Additionally, after locus extension it may happen that several target loci start overlapping. To check that, you can run
/path/to/locityper/extra/check_overlaps.py db [--move]
This command will list all overlapping target regions.
With --move
argument, the script will move all redundant loci (completely covered by other target locus)
to db/redundant
.
Preprocessing WGS dataset
Before locus genotyping can be performed, WGS dataset needs to be preprocessed. For that, please use
locityper preproc -i reads1.fastq [reads2.fastq] \
-r reference.fa -j counts.jf -o bg/SAMPLE
This command creates output directory bg/SAMPLE
for this specific sample.
If you use long reads, please specify technology with --tech
argument.
Input files can have FASTA or FASTQ format, and can be uncompressed or compressed with gzip
, bgzip
or lz4
.
During sample preprocessing Locityper examines read alignments to a long simple region in the reference genome
without significant duplications or other structural variantions.
Locityper attempts to automatically identify the genome version, however, if this does not happen,
please use -b/--bg-region
to provide such region (preferable ≥3 Mb).
By default, the following regions are used: chr17:72950001-77450000
(CHM13),
chr17:72062001-76562000
(GRCh38) and chr17:70060001-74560000
(GRCh37).
You can examine already preprocessed directory with
locityper preproc --describe -o bg/SAMPLE
BAM/CRAM files
If you already have read mappings to the whole genome, you can use them via
locityper preproc -a aligned.bam -r reference.fa -j counts.jf -o bg/SAMPLE
In some cases, reads are stored in an unmapped BAM/CRAM format.
In cases like that you can use arguments -a reads.bam --no-index
.
For interleaved paired-end BAM/CRAM file, please use -^
flag.
Non-interleaved paired-end unmapped BAM/CRAM files are not supported.
In addition, you can provide a list of files using -I/--in-list
argument,
please see description here.
Multiple threads
For compressed input files (including BAM/CRAM), decompression often takes significant time,
which prevents efficient parallelization.
Consequently, it is often more efficient to process multiple samples in parallel,
each using a single thread (-@ 1
).
Subsampling
This part was changed in Locityper v0.16
.
Locityper allows dataset subsampling (--subsample
) during the preprocessing, as a way to speed up the process.
Depending on the dataset and input coverage, small subsampling factors may produce inaccurate read depth estimates.
Using similar datasets
You can estimate WGS characteristics using an already preprocessed file (-~/--like
).
Preprocessing using existing dataset is significantly faster, but may produce incorrect results
if datasets have noticeably different characteristics.
locityper preproc -i/-a input -~ bg/OTHER_SAMPLE \
-r reference.fa -j counts.jf -o bg/SAMPLE
Please use this feature with care if you are certain that the two datasets are similar (produced with the same sequencing technology and similar library preparation).
This method still requires a couple minutes to count the number of reads in the dataset.
You can also use --filesize
argument for immediate preprocessing
by simply comparing file sizes with a previously processed dataset.
Both datasets must have the same file format (for example fastq.gz
)
and similar read name structure. Even different read name lengths may lead to incorrect predictions.
Multiple datasets
If you plan to analyze a large number of WGS datasets, we advise to first benchmark the optimal configuration
with the number of threads (-@
), recruiting threads (--recr-threads
) and subsampling rates (--subsample
).
Additionally, as the datasets are probably similar, it usually makes sense to use the
similar-dataset feature (including --filesize
argument).
Nevertheless, you can additionally fully preprocess several randomly selected datasets
and compare predicted parameters (--describe
) with those, obtained using a similar dataset.
Genotyping WGS
In order to genotype a dataset, please run
locityper genotype -i/-a/-I input -d db -p bg/SAMPLE -o analysis/SAMPLE
You can limit genotyping to a subset of loci using --subset-loci
argument.
Additionally, one can specify genotype priors using --priors
.
Please find descriptions of the output files here.
Recruiting reads
Locityper implements quick read recruitment to one or multiple target loci.
This ability can be utilized independently using a flexible locityper recruit
module.
If you want to recruit reads for one locus (multiple locus haplotypes allowed), please use the following command: please you
locityper recruit -i/-a/-I input -s targets.fa -o recruited.fastq
For paired-end reads, output reads will be interleaved, however, most tools can process interleaved reads.
Note: you can use the same command if you do need information, where each read is recruited to. Then, you can simply concatenate all haplotype FASTA files and run the command above.
Multiple loci
Similar command can be used for multiple loci.
Suppose, there are target haplotypes in target/gene1.fa
, target/gene2.fa
and target/gene3.fa
and you want to put recruited reads to output/*.fastq
. Then, please run
locityper recruit -i/-a/-I input -s target/{}.fa -o output/{}.fastq
Then, all possible target/*.fa
files will be found and processed.
Recruited reads will mirror loci names, for example reads, recruited to gene1
, will be placed to output/gene1.fastq
.
If you have one input FASTA file, but still want to recruit reads to multiple loci, you can use -S
argument:
locityper recruit -i/-a/-I input -S multi_targets.fa -o output/{}.fastq
Sequence names in multi_targets.fa
must follow format locus*anything_else
, where locus
can be repeated multiple times.
Finally, you can provide a two column file of input FASTA and output FASTQ files with
locityper recruit -i/-a/-I input -l list
Additional arguments
By default, all (k,w)-minimizers from input haplotypes are collected.
Then, reads with appropriate number of matches are recruited.
Minimizer size and match fraction are controlled by -m
and -M
arguments, but can be specified using
preset argument (-x
) based on your sequencing technology.
Note that preset overrides -m
, -M
and -c
values, provided before.
Input minimizers are not filtered by default. However, you can either manually mask them with Ns in the input haplotypes,
or provide Jellyfish counts file using -j
argument.
Then, all k-mers that appear too many times (-t
) across the genome, are automatically masked.
Finally, if you recruit reads from a mapped BAM/CRAM file, consider providing regions file (-R
) to speed up read recruitment.
Then, only reads with primary alignments to these regions are recruited, as well as reads from alternative contigs (--alt-contig
)
and unmapped reads.
Aligning medium-size sequences
Locityper provides ability to globally align (opens in a new tab)
input haplotypes or other small to medium-size sequences.
For that, you can use locityper align
.
Note, that this operation is probably slower than Minimap2 alignment, but, on the other hand, it provides global alignment
without splitting the alignment into overlapping primary/secondary/supplementary subalignments.
To align all pairwise sequences, please use
locityper align -i sequences.fasta -o alignments.paf.gz -A
In addition, you can align specific pairs using
locityper align -i sequences.fasta -o alignments.paf.gz \
-p name1,name2 name3,name4 ...
The same effect can be achieved using -P
and two column file with sequence names.