piawka
piawka (pronounced pee-af-kah = leech in Russian) calculates SNP-based population statistics for arbitrary groups of samples directly from bgzipped, tabixed VCF files. It outputs a BED file that can be intersected with any genomic annotation and piped into the companion subcommands for filtering, summarizing, and distance matrix creation.
What piawka does
piawka calc reads a VCF file and, for each genomic window or locus supplied in a BED file, counts alleles per group and per site, then accumulates per-site numerators and denominators for each requested statistic. At the end of each window the accumulated sums are emitted as a single BED row per group (or group pair) per statistic. The companion subcommands transform or aggregate this output for downstream analysis.
Example output
$ cd piawka/examples
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz -b genes.bed -g groups.tsv -s pi,dxy -q \
| head -5 | column -t
#chr start end locus pop1 pop2 stat value numerator denominator
scaffold_1 10035093 10035276 AL5G20950 CESiberia_2n LE_2n dxy 0.0071137 460 64664
scaffold_1 10035093 10035276 AL5G20950 PUWS_4n . pi 0.00588993 640 108660
scaffold_1 10035093 10035276 AL5G20950 LE_2n PUWS_4n dxy 0.00881262 1102 125048
scaffold_1 10035093 10035276 AL5G20950 LE_2n . pi 0.00772461 1078 139554
Each output row carries a numerator and denominator in addition to the final value. This is essential for correctly averaging statistics across windows (see Technical details).
Advantages
Indexable BED output
The output is sorted, tab-delimited BED with a standard header. It can be bgzip-compressed & indexed with tabix and fed to other genomics programs with no or little restructuring. The locus field allows labeling results by gene, exon, or any other genomic feature.
Correct handling of missing data
All statistics are computed as sums of per-site numerators and denominators, with additional summarization on top if needed. A site with many missing genotypes contributes a smaller numerator-denominator pair than a site with full coverage. When the output is later averaged over windows or genes with piawka sum, the averaging is automatically missing-data aware — no bias is introduced.
Support for polyploid variant calls
piawka counts individual alleles rather than diploid genotypes. Mixed-ploidy cohorts are handled natively; sample ploidy is detected automatically from the first VCF data line.
Higher data yield: ALT-agnostic SNP retrieval
Many tools check if a site’s REF and ALT alleles are both single nucleotides to identify it as a biallelic SNP. When analysing a subset of samples from a large VCF, many sites biallelic in the subset may be multi-allelic in the full dataset. piawka checks only that the REF allele is a single nucleotide; it then counts only the alleles actually present in each group. This recovers substantially more SNPs for small subsets of large VCF files. See Technical details for a worked example.
Optional support for multiallelic sites
The flexible retrieval of genotypes allows piawka to process sites with more than one alternative allele. Many statistics in piawka are defined in a way that naturally extends to multiallelic sites; this allows to explore genetic patters less accessible to classic tools.
Extensible statistics
piawka ships with 18 statistics covering nucleotide diversity, divergence, differentiation, allele frequency spectra and standard neutrality tests. Any user-defined statistic can be added as a plain AWK file following a simple interface. See Technical details and include/example.awk for further details.
GNU AWK: no installation, fast, low memory
The entire tool is a single AWK/shell script with no compiled dependencies beyond gawk ≥ 5.2 with default extensions, tabix, and bgzip. It runs at competitive speed and uses modest amounts of RAM. Multiprocessing is built in via gawk’s fork() extension.
Installation
# Recommended: conda
conda install -c bioconda piawka
# Alternative: clone and add to PATH
git clone https://github.com/novikovalab/piawka.git
export PATH="$( realpath ./piawka ):${PATH}"
# Dependencies: gawk>=5.2.0, tabix, bgzip
Input and output
Required (for piawka calc):
| Input | Description |
|---|---|
| VCF file | bgzipped (.vcf.gz) and tabix-indexed |
Optional:
| Input | Description |
|---|---|
| Groups file | 2-column TSV: sample ID → group ID OR keywords “unite” (all samples in one group) or “divide” (each sample is a separate group) |
BED file (-b) | Regions to analyse; optionally with locus name in column 4 |
Targets BED (-T) | Small regions for targeted extraction (faster than -b for many small regions); optionally with locus name in column 4 |
Output columns:
| Column | Description |
|---|---|
chr, start, end | BED coordinates |
locus | Region name (from BED col. 4, or . if absent) |
pop1 | Group name (within-group stats) or first group (pairwise) |
pop2 | . for within-group stats; second group for pairwise |
stat | Statistic name |
value | Statistic value |
numerator | Raw numerator (for re-averaging with piawka sum) |
denominator | Raw denominator (for re-averaging with piawka sum) |
Subcommands
| Subcommand | Description |
|---|---|
calc | Calculate within- and between-group statistics from a VCF file |
dist | Convert pairwise calc output to a PHYLIP or NEXUS distance matrix |
filt | Filter calc output rows using AWK expressions on statistic values |
list | List all statistics available for piawka calc |
sum | Re-summarize calc output across custom regions or super-groups |
win | Partition a VCF or genome into windows for calc |
Citation
If you use piawka, please cite:
Scott et al. (2025) Mol. Biol. Evol. https://doi.org/10.1093/molbev/msaf153