Please download test data from zenodo (opens in a new tab) and extract files with
tar xf test_data.tar.gz
. The directory follows the following structure:
- hprc.vcf.gz
- loci.bed
- 1.fastq.gz
- 2.fastq.gz
hprc.vcf.gz
contains a subset of pangenomic VCF file (see here).
Target loci are described by a four-column file loci.bed
.
In this example, we have GRCh38 reference genome at genome.fa
.
Such reference genome can be downloaded here (opens in a new tab) (Genome sequence, primary assembly,
direct link (opens in a new tab)).
Please index the genome and construct k-mer counts (instructions):
samtools faidx genome.fa
jellyfish count --canonical --lower-count 2 --out-counter-len 2 --mer-len 25 \
--threads 8 --size 3G --output counts.jf genome.fa
Then, you can create the database of loci using
locityper add -d db -v hprc.vcf.gz -r genome.fa -j counts.jf -L loci.bed
Most importantly, this command will produce locus haplotypes:
- MUC16/haplotypes.fa.gz
- MUC6/haplotypes.fa.gz
Next, please run WGS dataset preprocessing:
locityper preproc -i reads/{1,2}.fastq.gz -r genome.fa -j counts.jf -o bg
This command will align a fraction of the reads to a part of chromosome 17 in order to identify background read depth distribution and error profiles.
Finally, you can run targeted genotyping with
locityper genotype -i reads/{1,2}.fastq.gz -d db -p bg -o gts
This will produce genotype predictions:
- MUC16/res.json.gz
- MUC6/res.json.gz
You can find output file descriptions here. In addition, you can summarize JSON files in one CSV file with
/path/to/locityper/extra/into_csv.py -i ./gts -o gts.csv
Note: into_csv
takes the first segment after .
as sample name, so sample name in the output CSV file will be gts
.
Due to non-determenistic steps within Locityper, results may differ. Sample CSV file may look like this:
sample locus genotype quality total_reads unexpl_reads weight_dist warnings
gts MUC6 HG00621.1,HG00621.2 108.8 3738 1 0.00000 *
gts MUC16 HG00621.1,HG00621.2 585.8 13969 16 0.00000 *
In total, all steps should take 2–4 minutes.