Test dataset

Please download test data from zenodo (opens in a new tab) and extract files with tar xf test_data.tar.gz. The directory follows the following structure:

    • hprc.vcf.gz
    • loci.bed
      • 1.fastq.gz
      • 2.fastq.gz
  • hprc.vcf.gz contains a subset of pangenomic VCF file (see here). Target loci are described by a four-column file loci.bed.

    In this example, we have GRCh38 reference genome at genome.fa. Such reference genome can be downloaded here (opens in a new tab) (Genome sequence, primary assembly, direct link (opens in a new tab)). Please index the genome and construct k-mer counts (instructions):

    samtools faidx genome.fa
    jellyfish count --canonical --lower-count 2 --out-counter-len 2 --mer-len 25 \
        --threads 8 --size 3G --output counts.jf genome.fa

    Then, you can create the database of loci using

    locityper add -d db -v hprc.vcf.gz -r genome.fa -j counts.jf -L loci.bed

    Most importantly, this command will produce locus haplotypes:

    • MUC16/haplotypes.fa.gz
    • MUC6/haplotypes.fa.gz
  • Next, please run WGS dataset preprocessing:

    locityper preproc -i reads/{1,2}.fastq.gz -r genome.fa -j counts.jf -o bg

    This command will align a fraction of the reads to a part of chromosome 17 in order to identify background read depth distribution and error profiles.

    Finally, you can run targeted genotyping with

    locityper genotype -i reads/{1,2}.fastq.gz -d db -p bg -o gts

    This will produce genotype predictions:

    • MUC16/res.json.gz
    • MUC6/res.json.gz
  • You can find output file descriptions here. In addition, you can summarize JSON files in one CSV file with

    /path/to/locityper/extra/into_csv.py -i ./gts -o gts.csv

    Note: into_csv takes the first segment after . as sample name, so sample name in the output CSV file will be gts.

    Due to non-determenistic steps within Locityper, results may differ. Sample CSV file may look like this:

    sample  locus  genotype             quality  total_reads  unexpl_reads  weight_dist  warnings
    gts     MUC6   HG00621.1,HG00621.2  108.8    3738         1             0.00000      *
    gts     MUC16  HG00621.1,HG00621.2  585.8    13969        16            0.00000      *

    In total, all steps should take 2–4 minutes.