Input files
Pangenome VCF file
Locityper can automatically extract locus alleles from the pangenome reference, stored in a VCF file with non-overlapping variants. Such pangenome VCF file can be created, for example, with Minigraph-Cactus (opens in a new tab). Alternatively, it can be created using PAV (opens in a new tab).
Existing Minigraph-Cactus VCF file exists for the HPRC cohort, see here (opens in a new tab) (Raw VCF). Specifically for Minigraph-Cactus, VCF file needs to be transformed such that no variants overlap each other using vcfbub (opens in a new tab):
vcfbub -l 0 -i hprc-v1.1-mc-grch38.raw.vcf.gz | bgzip > hprc-v1.1-grch38.vcf.gz
tabix -p vcf hprc-v1.1-grch38.vcf.gzBy default, Locityper forbids overlapping variants, unless --ignore-overl is used.
Nevertheless, ignoring overlapping variants leads to lost information, as only one of the overlapping variants
can be used (Locityper always uses the first one).
Extracting local haplotypes
Since various potential problems with input VCF files (overlapping variants, possibly missing variants, forced locus extension) we now advise to extract local haplotypes directly from pangenome assemblies. In v1.5.5 we added a helper script locityper/extra/extract-haplotypes.sh. Specifically, the script maps reference haplotypes (e.g. gene sequence from CHM13) to FASTA[.gz] assemblies or to the AGC compressed assemblies (opens in a new tab), merges hits and extracts corresponding subsequences:
# Starting BED files with target coordinates (fourth column = locus name):
cat target_coord.bed | while read chrom start end locus extra; do
# * Extract sequence from reference
# * (optional) Use seqtk to convert to uppercase and limit line length to 120
# * Replace first line with the gene/locus name
samtools faidx reference.fa "${chrom}:$((start+1))-${end}" | \
seqtk seq -U -l 120 | \
sed "1c>$locus"
done > targets.fa
path/to/locityper/extra/extract-haplotypes.sh -a assemblies.agc -t targets.fa -o local_haps
# or
path/to/locityper/extra/extract-haplotypes.sh -g assemblies_dir -t targets.fa -o local_haps
mkdir ref_panels
cat target_coord.bed | while read chrom start end locus extra; do
# Combine extracted haplotypes for this locus.
cat local_haps/*/"${locus}.fa.gz" > "ref_panels/${locus}.fa.gz"
# Output BED file (fifth column = FASTA with locus haplotypes).
echo -e "${chrom}\t${start}\t${end}\t${locus}\t${locus}.fa.gz"
done > ref_panels/targets.bed
locityper target ... -L ref_panels/targets.bedNote that multiple instances of extract-haplotypes can be executed in parallel, and they will automatically extract haplotypes from different assemblies.
By default, this script outputs a single longest region for each target and assembly, but this can be changed with -N. You can additionally examine output/*.bed.gz files (fifth column = lengh / target length) or alignment files output/*.paf.gz to verify whether haplotype extraction makes sense or if some parameters must be updated.
Jellyfish k-mer counts
Locityper utilizes k-mer counts across the refernece genome, calculated using Jellyfish (opens in a new tab). You can use the following code to obtain them (for example for k = 25):
jellyfish count --canonical --lower-count 2 --out-counter-len 2 --mer-len 25 \
--threads 8 --size 3G --output counts.jf genome.faList of input files
Locityper preproc, genotype and recruit commands allow multiple input files for a single dataset.
You can specify them by repeating -i/-a arguments:
locityper preproc -i readsA.fq.gz -i readsB.fq.gz ...
locityper preproc -i readsA1.fq.gz readsA2.fq.gz \
-i readsB1.fq.gz readsB2.fq.gz ...
locityper preproc -a readsA.bam -a readsB.bam --no-index ...Input files of different types (-i and -a) are not allowed, as well as multiple mapped BAM/CRAM files.
All of the files must correspond to the same sample, same sequencing technology and should have roughly the same characteristics (read length, error rates).
Please make sure to use the same input files for genotyping as for preprocessing.
Alternatively, you can specify an input list of files with -I list.txt where each line follows the format
<flag> <file> [<file2>].
Specifically, lines can be
- Two paired-end files:
p reads1.fq.gz reads2.fq.gzorp reads*.fq.gz, - Interleaved paired-end file:
pi reads.fq.gz(same as-i reads.fq.gz --interleaved), - Single-end FASTA/Q file:
s reads.fq.gz, - Alignment and mapped BAM/CRAM file:
a alns.bam, can optionally specifybai/craiindex path in third column, - Unmapped BAM/CRAM file:
u reads.bam(same as-a reads.bam --no-index). - Unmapped interleaved BAM/CRAM file:
ui reads.bam(same as-a reads.bam --no-index --interleaved).
Multiple lines are allowed, but all must have the same flag.
Output files
Results in a JSON format
For each locus, Locityper creates a res.json.gz output file with the following information:
genotype: predicted genotype.quality: quality of a predicted genotype, calculated as a Phred-probability of error. If there are many very similar genotypes, quality may be low. We advise users to instead focus on the number of unexplained reads and weighted distance (see below).total_reads: total number of reads, used to identify locus genotype,unexpl_reads: number of reads that map well to some haplotypes, but not to the predicted haplotype,weight_dist: sum distance from the primary prediction to other genotypes. Distances are weighted by the predicted probabilities. Distances between genotypes are approximated using Jaccard distance between minimizer multisets.warnings: optional field that specifies heuristic warnings, raised by Locityper after genotyping.options: Several top genotype predictions, along with their probabilities and likelihood mean and standard deviation.
Converting to a CSV format
In order to convert JSON to a CSV file, you can use the following script:
/path/to/locityper/extra/into_csv.py \
-i analysis/./* -o gts.csvIn this examples, multiple samples were processed, with output folders located at
analysis/sample1, analysis/sample2, etc.
First folder name after . is taken as the sample name.
Output CSV file contains similar information to the JSON file, with exception for the top predicted genotypes.
In addition, output file can be automatically compressed (-o gts.csv.gz).
Filtering Locityper predictions
Currently, we use the following strategy to filter out potentially incorrect Locityper genotypes (any of the statements are true):
- Any warnings present,
- or weighted distance ≥ 30,
- or ≥ 1000 total reads and unexplained reads ≥ 20% of total reads.
Converting to VCF
Locityper predictions can be converted into a VCF file based on an existing pangenome VCF. For that, please run
/path/to/locityper/extra/into_vcf.py -i gts.csv -d db \
-v pangenome.vcf -g GRCh38 -o out-dirNote that Locityper is not a variant caller, and instead predicts overall haplotype structure. Therefore, individual variants may not be accurate, especially short variants.
Converting to FASTA
In addition to a VCF file, you can generate predicted haplotypes in a FASTA file. To do so, please run
/path/to/locityper/extra/into_fasta.py -i gts.csv -d db -o out-dirEach entry in the output files will be described in the following way:
>SAMPLE HAPLOTYPE qual=XX ...