Cookbook
Worked examples of common piawka workflows. All commands use the test data in the examples/ directory of the repository.
Contents
Site frequency spectrum (SFS)
The site frequency spectrum describes the distribution of allele frequencies across SNP sites. To compute it per site, use --persite with the daf (derived allele frequency) statistic.
Unfolded SFS: per-site DAF with missingness filter
$ cd piawka/examples
# Compute per-site DAF and missingness for all groups
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-g groups.tsv -s daf,miss --persite \
> persite.bed
# Keep only sites with no missing data, then inspect DAF distribution
$ piawka filt -e 'miss == 0 && pop1=="LE_2n"' persite.bed \
| grep daf \
| head -6 | column -t
scaffold_1 10000000 10000001 scaffold_1:10000001 LE_2n . daf 0 0 34
scaffold_1 10000001 10000002 scaffold_1:10000002 LE_2n . daf 0 0 34
scaffold_1 10000002 10000003 scaffold_1:10000003 LE_2n . daf 0 0 34
scaffold_1 10000003 10000004 scaffold_1:10000004 LE_2n . daf 0 0 34
scaffold_1 10000004 10000005 scaffold_1:10000005 LE_2n . daf 0 0 34
scaffold_1 10000005 10000006 scaffold_1:10000006 LE_2n . daf 0 0 34
The numerator column gives the raw derived allele count; the value gives the allele frequency as a fraction of 1. Both can be used directly to build an SFS for downstream tools.
Folded SFS: MAF
You can use maf instead of daf for a folded SFS that does not require a polarized VCF. The following example also suggests a way to count the allele frequencies:
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-g groups.tsv -s maf,miss --persite -q \
| piawka filt -e 'miss == 0 && pop1=="LE_2n"' \
| awk '$7=="maf"{print $9}' | sort -n | uniq -c | head -10
2302 1
875 2
517 3
564 4
550 5
368 6
293 7
300 8
311 9
297 10
pi0 / pi4 — synonymous vs. 0-fold nucleotide diversity
A classic neutrality test compares nucleotide diversity at synonymous (4-fold degenerate) sites (π₄) to diversity at 0-fold degenerate sites (π₀). The ratio π₀/π₄ reflects purifying selection against nonsynonymous mutations.
Using degenotate output
The examples/degenotate/ folder degeneracy classification files generated by degenotate. fourfolds.bed and zerofolds.bed were prepared by filtering degeneracy-all-sites.bed.
$ ls degenotate/
degeneracy-all-sites.bed fourfolds.bed
piawka_fourfolds.tsv piawka_zerofolds.tsv
zerofolds.bed
Two-BED approach: pi by gene
# pi at 4-fold sites, summarized per gene
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-g groups.tsv -s pi \
-T degenotate/fourfolds.bed \
-b genes.bed \
> pi4_genes.bed
# pi at 0-fold sites, summarized per gene
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-g groups.tsv -s pi \
-T degenotate/zerofolds.bed \
-b genes.bed \
> pi0_genes.bed
# In this case all groups have data for all genes so we can compare files line by line without joining by pop1 and locus fields
$ paste pi0_genes.bed pi4_genes.bed | awk 'NR>1 && $18>0 { print $4 , $5 , "pi0/pi4" , $8/$18 }' | column -t | head
AL5G20950 LE_2n pi0/pi4 0.572241
AL5G20950 CESiberia_2n pi0/pi4 0
AL2G11890 PUWS_4n pi0/pi4 1.00032
AL2G11890 LE_2n pi0/pi4 0.918975
AL2G11890 UKScandinavia_2n pi0/pi4 0.51226
AL2G11890 CESiberia_2n pi0/pi4 0.708472
AL1G31870 PUWS_4n pi0/pi4 0.242626
AL1G31870 LE_2n pi0/pi4 0.244054
AL1G31870 UKScandinavia_2n pi0/pi4 0.793963
AL1G31870 CESiberia_2n pi0/pi4 0.353163
Distance matrix
A pairwise distance matrix is essential for phylogenetic and clustering analysis. Use piawka dist to convert piawka calc output into PHYLIP or NEXUS format.
PHYLIP matrix (default)
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-g groups.tsv -s dxy -q \
| piawka dist - | column -t
4
PUWS_4n 0.000000 0.010746 0.014952 0.010989
LE_2n 0.010746 0.000000 0.015055 0.010225
UKScandinavia_2n 0.014952 0.015055 0.000000 0.015170
CESiberia_2n 0.010989 0.010225 0.015170 0.000000
Using Fst as a distance
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-g groups.tsv -s fst -q \
| piawka dist -s fst - | column -t
4
PUWS_4n 0.000000 0.170518 0.365378 0.187935
LE_2n 0.170518 0.000000 0.424453 0.194423
UKScandinavia_2n 0.365378 0.424453 0.000000 0.424824
CESiberia_2n 0.187935 0.194423 0.424824 0.000000
Note: Fst is not a proper distance metric (it does not satisfy the triangle inequality), but it can still be useful for exploratory clustering.
Heterozygosity and within-individual diversity
Individual heterozygosity is equivalent to nucleotide diversity measured within a single diploid (or polyploid) individual — that is, the probability that the two alleles at a site differ.
Per-individual heterozygosity (divide mode)
Use --groups divide to treat each sample as its own group:
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-b genes.bed -g divide -s pi -q \
| head -4 | column -t
#chr start end locus pop1 pop2 stat value numerator denominator
scaffold_1 10035093 10035276 AL5G20950 WS_1.7-1_4n . pi 0.00462963 8 1728
scaffold_1 10035093 10035276 AL5G20950 MW0079491 . pi 0.00694444 2 288
scaffold_1 10035093 10035276 AL5G20950 TE_4.5-2 . pi 0.013986 4 286
Average heterozygosity per group
To compute the average heterozygosity across all individuals in a group, pipe the per-individual per-site output to piawka sum -g. Note that the output values are different from running piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz -b genes.bed -g groups.tsv -s pi directly: what we get here is not pi = expected heterozygosity, but average per-individual pi (= observed heterozygosity in diploid samples). That means e.g. that “pi == 1” will be found in sites where every individual is heterozygous.
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-b genes.bed -g divide -s pi --persite -q \
| piawka sum -g groups.tsv \
| head -4 | column -t
#chr start end locus pop1 pop2 stat value numerator denominator
scaffold_1 10035093 10035276 AL5G20950 PUWS_4n . pi 0.005627 68 12084
scaffold_1 10035093 10035276 AL5G20950 LE_2n . pi 0.008257 38 4602
scaffold_1 10035093 10035276 AL5G20950 UKScandinavia_2n . pi 0.001897 2 1054
Segregating and differentially fixed sites
pi and dxy together classify sites into biologically meaningful categories:
| $\pi_i$ | $\pi_j$ | $d_{XY}$ | Interpretation |
|---|---|---|---|
| > 0 | = 0 | > 0 | Privately segregating in group i |
| = 0 | > 0 | > 0 | Privately segregating in group j |
| = 0 | = 0 | = 1 | Fixed difference between groups |
| > 0 | > 0 | > 0 | Shared polymorphism (or partially shared) |
| = 0 | = 0 | = 0 | Monomorphic across both groups |
Use piawka filt to extract each category from --persite output.
Fixed differences (differentially fixed sites)
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
-g groups.tsv -s pi,dxy --persite \
| piawka filt -e 'pi == "0" && dxy > 0' \
| head -4 | column -t
#chr start end locus pop1 pop2 stat value numerator denominator
scaffold_1 10000444 10000445 scaffold_1:10000445 LE_2n UKScandinavia_2n dxy 1 340 340
scaffold_1 10000444 10000445 scaffold_1:10000445 CESiberia_2n UKScandinavia_2n dxy 1 160 160
scaffold_1 10000513 10000514 scaffold_1:10000514 LE_2n UKScandinavia_2n dxy 1 340 340
Privately segregating sites in one group
Any site with pi>0 in one population and pi==0 in all other populations is privately segregating in it.
Shared polymorphisms — a caveat
Sites with pi > 0 in both groups and dxy > 0 appear to be shared polymorphisms, but this classification is unreliable for multiallelic sites:
Example site: Group A → T/A (pi > 0)
Group B → T/G (pi > 0)
dxy for T vs. A and T vs. G → dxy > 0
Here both groups are polymorphic and dxy > 0, but the groups share the same REF allele T while carrying different ALT alleles — not a shared polymorphism in the biological sense. Filter out multiallelic sites e.g. with bcftools view -M2 file.vcf.gz to restrict the analysis to biallelic comparisons at the cost of losing some informative sites.