Subcommands reference
[!WARNING] Code examples here are often piped to
headto avoid terminal cluttering. Because of that, VCF processing ends abruptly and some error messages may pop up in addition to the outputs shown here. This is not to be worried about.
Contents
calc
Calculate within- and between-group population statistics from a VCF file.
Usage: piawka calc [OPTIONS] -v file.vcf.gz [-g groups.tsv] [-b regions.bed]
Calculate statistics within & between groups of samples in a VCF file.
The only mandatory argument is the VCF file (--vcf). Regions (--bed) and sample grouping (--groups) are recommended.
Default --stats are pi,dxy; other can be provided as a spaceless comma-seprarted list (check `piawka list` for an overview).
EXAMPLE:
piawka calc [OPTIONS] -v file.tsv -b windows.bed -g ( groups.tsv | unite | divide ) > file.bed
Options:
-1, --persite output values for each site
-b, --bed <arg> BED file with regions to be analyzed
-d, --dependencies output dependencies stats as well (best for piping to
`piawka sum`)
-g, --groups <arg> either 2-columns sample / group table or
keywords "unite" (all samples in one group) or "divide"
(each sample is a separate group); defaults to "unite"
-j, --jobs <arg> number of parallel jobs to run
-m, --mult use populations with multiple alleles at a site
-q, --quiet do not output progress and warning messages
-R, --rand <arg> randomly use this share of sites, 0.0-1.0
-s, --stats <arg> stats to calculate, comma-separated with no spaces as in
"pi,dxy,fst"; see `piawka list`
-T, --targets <arg> BED file with targets (faster for numerous small regions)
-v, --vcf <arg> gzipped and tabixed VCF file
-h, --help show this help message and exit
Description
piawka calc is the core subcommand. It streams the VCF through tabix region queries and, for every site within each requested region, counts alleles per group. Per-site numerators and denominators are accumulated for each requested statistic, then a single summary row is printed per region per group (pair).
Groups can be specified three ways:
| Value | Behaviour |
|---|---|
groups.tsv | 2-column TSV mapping sample IDs to group names |
unite | All VCF samples form one group named all_samples |
divide | Each sample is its own group; all pairwise statistics are computed |
When --groups is omitted, unite is assumed.
Regions can be provided via:
-b / --bed: a BED file with one row per region; column 4 sets the locus name in the output.-T / --targets: a BED file with small targets (e.g., CDS coordinates).piawkamerges nearby targets into larger tabix queries automatically.- Neither: the VCF is processed as a single window per chromosome.
Parallelism: with -j N, piawka calc forks N child processes. Each child reads from a shared FIFO buffer of region lines; results are written to temporary files and assembled in order by the parent before printing. See Technical details.
Output: BED format with 10 columns:
#chr start end locus pop1 pop2 stat value numerator denominator
pop2 is . for within-group statistics.
Examples
Basic run — pi and dxy per gene:
$ cd piawka/examples
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-b genes.bed -g groups.tsv -s pi,dxy -q \
| head -6 | column -t
#chr start end locus pop1 pop2 stat value numerator denominator
scaffold_1 10035093 10035276 AL5G20950 CESiberia_2n LE_2n dxy 0.0071137 460 64664
scaffold_1 10035093 10035276 AL5G20950 PUWS_4n . pi 0.00588993 640 108660
scaffold_1 10035093 10035276 AL5G20950 LE_2n PUWS_4n dxy 0.00881262 1102 125048
scaffold_1 10035093 10035276 AL5G20950 LE_2n . pi 0.00772461 1078 139554
scaffold_1 10035093 10035276 AL5G20950 CESiberia_4n UKScandinavia_2n dxy 0.00722154 236 32680
Per-site output:
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-g groups.tsv -s pi --persite -q \
-b genes.bed | head -4 | column -t
#chr start end locus pop1 pop2 stat value numerator denominator
scaffold_1 10035093 10035094 AL5G20950 PUWS_4n . pi 0 0 756
scaffold_1 10035093 10035094 AL5G20950 LE_2n . pi 0 0 992
scaffold_1 10035093 10035094 AL5G20950 UKScandinavia_2n . pi 0 0 90
All samples as one group (default or with -g unite):
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-b genes.bed -s pi -q | head -3 | column -t
#chr start end locus pop1 pop2 stat value numerator denominator
scaffold_1 10035093 10035276 AL5G20950 all_samples . pi 0.00607559 5592 920404
scaffold_1 10035511 10036000 AL2G11890 all_samples . pi 0.0167103 36136 2162494
Parallel jobs with Fst and multiple stats:
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-b genes.bed -g groups.tsv -s pi,dxy,fst -j 4 -q \
| head -4 | column -t
#chr start end locus pop1 pop2 stat value numerator denominator
scaffold_1 10035093 10035276 AL5G20950 CESiberia_2n LE_2n dxy 0.0071137 460 64664
scaffold_1 10035093 10035276 AL5G20950 CESiberia_2n LE_2n fst 0.130414 0.137013 1.0506
scaffold_1 10035093 10035276 AL5G20950 PUWS_4n . pi 0.00588993 640 108660
dist
Convert pairwise piawka calc output to a PHYLIP or NEXUS distance matrix.
Usage: piawka dist [OPTIONS] file.bed
Convert `piawka calc` pairwise distance values as a PHYLIP/NEXUS matrix.
Input is piawka output with some pairwise statistics in.
By default, dxy is used. Can be changed to other statistic (fst, rho etc.)
If many values per sample pair are present, weighted average is taken.
EXAMPLE:
piawka dist [OPTIONS] file.bed > distmat.phy
Options:
-s, --stat <name> piawka pairwise stat to use as distance (default: dxy)
-n, --nexus write matrix in NEXUS format instead of PHYLIP
-h, --help show this help message and exit
Description
piawka dist reads the output of piawka calc and builds a pairwise distance matrix. If multiple rows exist for the same population pair (e.g., multiple windows), a weighted average is computed using the denominators. The matrix can be output in PHYLIP or Nexus format.
Only the pairwise statistic specified with -s is used; within-group rows are ignored. Diagonal entries are set to 0.
Example
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
groups.tsv -s dxy -q \
| piawka dist - | column -t
4
PUWS_4n 0.000000 0.010746 0.014952 0.010989
LE_2n 0.010746 0.000000 0.015055 0.010225
UKScandinavia_2n 0.014952 0.015055 0.000000 0.015170
CESiberia_2n 0.010989 0.010225 0.015170 0.000000
NEXUS output:
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
groups.tsv -s dxy -q \
| piawka dist --nexus -
#NEXUS
BEGIN TAXA;
DIMENSIONS ntax=4;
TAXLABELS
PUWS_4n LE_2n UKScandinavia_2n CESiberia_2n ;
END;
BEGIN DISTANCES;
DIMENSIONS ntax=4;
FORMAT TRIANGLE = BOTH LABELS = LEFT;
Matrix
PUWS_4n 0.000000 0.010746 0.014952 0.010989
LE_2n 0.010746 0.000000 0.015055 0.010225
UKScandinavia_2n 0.014952 0.015055 0.000000 0.015170
CESiberia_2n 0.010989 0.010225 0.015170 0.000000
;
END;
filt
Filter piawka calc output using AWK expressions on statistic values and field names.
Usage: piawka filt [OPTIONS] [file.bed ...]
Filter `piawka` output by values of calculated statistics.
BED entries for groups that satisfy the expression fully are kept.
Expressions can include arithmetic, comparison and logical operators,
numbers, statistics names and some field names (chr,start,end,locus,pop1,pop2).
Takes piawka output file(-s) on stdin and an AWK-compatible expression.
EXAMPLE:
piawka filt -e 'pi > 0.01 && dxy > 0.01' file.bed > file_filt.bed
Options:
-e, --expr <expr> AWK expression to evaluate (default: 1, passes all lines)
-h, --help show this help message and exit
Description
piawka filt pivots the long-format piawka calc output to a wide row per (region, group pair) where all available statistics are present as columns and evaluates the expression. Rows where the expression is true are emitted back in long format. Available variables in the expression:
- Field names:
chr,start,end,locus,pop1,pop2 - Statistic names: any stat name present in the input (e.g.,
pi,dxy,fst,miss) - Standard AWK operators:
>,<,==,!=,&&,||,!
When a statistic is not present for a particular row, its value is empty (evaluates as 0 in numeric context).
Examples
Keep only regions where pi > 0.01 in both groups of a pair:
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-b genes.bed -g groups.tsv -s pi,dxy -q \
| piawka filt -e 'pi > 0.01' \
| head -6 | column -t
#chr start end locus pop1 pop2 stat value numerator denominator
scaffold_1 10035511 10036000 AL2G11890 CESiberia_2n LE_2n dxy 0.0218687 3508 160412
scaffold_1 10035511 10036000 AL2G11890 LE_2n . pi 0.0150035 4716 314326
scaffold_1 10035511 10036000 AL2G11890 CESiberia_2n . pi 0.0253604 1988 78390
scaffold_1 10037631 10039653 AL1G31870 CESiberia_2n LE_2n dxy 0.0173026 9260 535180
scaffold_1 10037631 10039653 AL1G31870 LE_2n PUWS_4n dxy 0.014988 13788 919936
Keep sites with no missing data:
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-g groups.tsv -s pi,miss --persite -q \
| piawka filt -e 'miss == 0' \
| head -6 | column -t
#chr start end locus pop1 pop2 stat value numerator denominator
scaffold_1 10000000 10000001 scaffold_1:10000001 LE_2n . miss 0 0 34
scaffold_1 10000000 10000001 scaffold_1:10000001 LE_2n . pi 0 0 1122
scaffold_1 10000000 10000001 scaffold_1:10000001 CESiberia_2n . miss 0 0 16
scaffold_1 10000000 10000001 scaffold_1:10000001 CESiberia_2n . pi 0 0 240
scaffold_1 10000000 10000001 scaffold_1:10000001 PUWS_4n . miss 0 0 28
Keep sites that are fixed differences between two populations (privately differentially fixed):
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-g groups.tsv -s pi,dxy --persite -q \
| piawka filt -e 'pi == 0 && dxy > 0' | head -6 | column -t
#chr start end locus pop1 pop2 stat value numerator denominator
scaffold_1 10000444 10000445 scaffold_1:10000445 LE_2n UKScandinavia_2n dxy 1 340 340
scaffold_1 10000444 10000445 scaffold_1:10000445 CESiberia_2n UKScandinavia_2n dxy 1 160 160
scaffold_1 10000513 10000514 scaffold_1:10000514 LE_2n UKScandinavia_2n dxy 1 340 340
scaffold_1 10000513 10000514 scaffold_1:10000514 CESiberia_2n UKScandinavia_2n dxy 1 160 160
scaffold_1 10000950 10000951 scaffold_1:10000951 LE_2n PUWS_4n dxy 1 8 8
list
List all statistics available for piawka calc.
Usage: piawka list [OPTIONS]
List all statistics available for calculation with `piawka calc`.
Takes no input.
EXAMPLE:
piawka list [OPTIONS]
Options:
-d, --dependencies output helper/dependency stats as well
-h, --help show this help message and exit
Description
piawka list prints all available statistics with their descriptions, scope (within/between groups), and any dependencies on other statistics. By default, helper statistics (prefixed with helper:) are hidden; use -d to show them.
sum
Re-summarize piawka calc output across windows, genes, or any other grouping.
Usage: piawka sum [OPTIONS] [file.bed ...]
Summarize `piawka calc` results counted over several loci.
If dependencies are not given (see `piawka list`), defaults to sum(numerator)/sum(denominator).
It only takes the output file(-s) of `piawka calc` passed over stdin.
EXAMPLE:
piawka sum [OPTIONS] file.bed > file_sum.bed
Options:
-b, --bed <file> summarize stats by regions in a BED file
-g, --groups <file> group file to average stats across individuals/subgroups
-i, --ignore-chr summarize across all chromosomes using only the locus field to match
-I, --ignore-locus ignore locus name (merge rows from all loci)
-s, --stats <list> stats to summarize (default: all stats found in input)
-h, --help show this help message and exit
Description
piawka sum re-averages previously computed statistics while preserving numerical correctness. It accumulates numerators and denominators across rows that share the same chromosome + locus, then recomputes the value as sum(numerator) / sum(denominator). For statistics with finalize_* functions (e.g., Tajima’s D, Watterson’s theta), it re-runs the finalization step if dependency statistics are present in the input.
Use cases:
- Piping from
piawka calc: when-bis not given tocalc, it outputs results in small arbitrary windows which are automatically piped tosumfor chromosome-wide results. - Merging per-window results across sliding windows or chromosomes.
- Re-grouping individuals into higher-level groups with
-g. - Summarizing by custom regions with
-b.
Examples
Average per-region stats across the entire scaffolds:
Here we summarize cluster-wise pi values along the entire VCF file.
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-g groups.tsv -s pi -b genes.bed -q \
| piawka sum --ignore-locus | column -t
#chr start end locus pop1 pop2 stat value numerator denominator
scaffold_1 10035093 10496929 . PUWS_4n . pi 0.00988641 1217900 123189256
scaffold_1 10035093 10496929 . LE_2n . pi 0.00796761 1464568 183815248
scaffold_1 10035093 10496929 . UKScandinavia_2n . pi 0.0077786 115084 14794952
scaffold_1 10035093 10496929 . CESiberia_2n . pi 0.00711766 282906 39747074
Re-group individuals into higher-level groups:
The following is a way to get average genic heterozygosity across the same groups.
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-g divide -s pi -b genes.bed -q \
| piawka sum --ignore-locus -g groups.tsv | column -t
#chr start end locus pop1 pop2 stat value numerator denominator
scaffold_1 10035093 10496929 . PUWS_4n . pi 0.00796982 112704 14141352
scaffold_1 10035093 10496929 . LE_2n . pi 0.00734821 42128 5733096
scaffold_1 10035093 10496929 . UKScandinavia_2n . pi 0.00581118 9748 1677456
scaffold_1 10035093 10496929 . CESiberia_2n . pi 0.00695219 18812 2705910
win
Partition a VCF or genome into BED windows for use with piawka calc.
Usage: piawka win [OPTIONS] -v file.vcf.gz | -T targets.bed
Chunk a VCF file into windows for parallel processing or sub-chromosome resolution.
Default (slow): output windows containing --lines [100000] VCF lines each.
Alternatively, output windows of --window-size at --step
(requires --vcf with ##contig header lines or a --fai file).
If --targets are supplied instead of --vcf, produces bigger windows with multiple targets in.
EXAMPLE:
piawka win [OPTIONS] -v file.vcf.gz | -T targets.bed > win.bed
Options:
-f, --fai <file> FAI/TSV file with chr names and sizes (cols 1–2)
-l, --lines <n> VCF lines per window (default: 100000)
-n, --number enumerate windows in column 4
-s, --step <bp> step size for sliding windows (requires --window-size)
-T, --targets <file> BED file with small regions to aggregate into larger queries
-v, --vcf <file> bgzipped and tabix-indexed VCF file
-w, --window-size <bp> window size in bp (requires ##contig lines in VCF or --fai)
-h, --help show this help message and exit
Description
piawka win produces BED windows that are then passed as the -b argument to piawka calc. There are three modes:
| Mode | Required input | How windows are defined |
|---|---|---|
| VCF lines | --vcf | Fixed number of VCF lines per window |
| Fixed bp | --vcf with ##contig headers, or --fai | Fixed base-pair width; optionally sliding with --step |
| Target aggregation | --targets | Nearby small targets merged into ~1 kb blocks |
VCF-lines mode (default) is the safest: it guarantees every window has approximately the same number of variant sites regardless of the recombination or variant density landscape.
Fixed-bp mode with --window-size produces equal-length windows, useful for population-genetic scans where physical scale matters. Requires either ##contig lines in the VCF header or a FAI file.
Target aggregation with --targets is used internally by piawka calc -T to reduce the number of tabix queries when many small regions (e.g., single sites) are provided.
Examples
100 kb fixed windows from VCF header contigs:
$ piawka win -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-w 100000 | head -4
scaffold_1 0 100000
scaffold_1 100000 200000
scaffold_1 200000 300000
scaffold_1 300000 400000
Sliding 50 kb windows with 25 kb step:
$ piawka win -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-w 50000 -s 25000 | head -4
scaffold_1 0 50000
scaffold_1 25000 75000
scaffold_1 50000 100000
scaffold_1 75000 125000
Aggregate exon targets for faster queries:
$ piawka win -T degenotate/fourfolds.bed | head -4
scaffold_1 10035268 10035948
scaffold_1 10037748 10039496
scaffold_1 10040888 10041816
scaffold_1 10048561 10054581