Cookbook

Worked examples of common piawka workflows. All commands use the test data in the examples/ directory of the repository.

Contents

Site frequency spectrum (SFS)
1. Unfolded SFS: per-site DAF with missingness filter
2. Folded SFS: MAF
pi0 / pi4 — synonymous vs. 0-fold nucleotide diversity
1. Using degenotate output
2. Two-BED approach: pi by gene
Distance matrix
1. PHYLIP matrix (default)
2. Using Fst as a distance
Heterozygosity and within-individual diversity
1. Per-individual heterozygosity (divide mode)
2. Average heterozygosity per group
Segregating and differentially fixed sites

Site frequency spectrum (SFS)

The site frequency spectrum describes the distribution of allele frequencies across SNP sites. To compute it per site, use --persite with the daf (derived allele frequency) statistic.

Unfolded SFS: per-site DAF with missingness filter

$ cd piawka/examples

# Compute per-site DAF and missingness for all groups
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
              -g groups.tsv -s daf,miss --persite \
  > persite.bed

# Keep only sites with no missing data, then inspect DAF distribution
$ piawka filt -e 'miss == 0 && pop1=="LE_2n"' persite.bed \
  | grep daf \
  | head -6 | column -t
scaffold_1  10000000  10000001  scaffold_1:10000001  LE_2n  .  daf  0  0  34
scaffold_1  10000001  10000002  scaffold_1:10000002  LE_2n  .  daf  0  0  34
scaffold_1  10000002  10000003  scaffold_1:10000003  LE_2n  .  daf  0  0  34
scaffold_1  10000003  10000004  scaffold_1:10000004  LE_2n  .  daf  0  0  34
scaffold_1  10000004  10000005  scaffold_1:10000005  LE_2n  .  daf  0  0  34
scaffold_1  10000005  10000006  scaffold_1:10000006  LE_2n  .  daf  0  0  34

The numerator column gives the raw derived allele count; the value gives the allele frequency as a fraction of 1. Both can be used directly to build an SFS for downstream tools.

Folded SFS: MAF

You can use maf instead of daf for a folded SFS that does not require a polarized VCF. The following example also suggests a way to count the allele frequencies:

$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
              -g groups.tsv -s maf,miss --persite -q \
  | piawka filt -e 'miss == 0 && pop1=="LE_2n"' \
  | awk '$7=="maf"{print $9}' | sort -n | uniq -c | head -10
1
2
3
4
5
6
7
8
9
10

pi0 / pi4 — synonymous vs. 0-fold nucleotide diversity

A classic neutrality test compares nucleotide diversity at synonymous (4-fold degenerate) sites (π₄) to diversity at 0-fold degenerate sites (π₀). The ratio π₀/π₄ reflects purifying selection against nonsynonymous mutations.

Using degenotate output

The examples/degenotate/ folder degeneracy classification files generated by degenotate. fourfolds.bed and zerofolds.bed were prepared by filtering degeneracy-all-sites.bed.

$ ls degenotate/
degeneracy-all-sites.bed   fourfolds.bed
piawka_fourfolds.tsv       piawka_zerofolds.tsv
zerofolds.bed

Two-BED approach: pi by gene

# pi at 4-fold sites, summarized per gene
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
              -g groups.tsv -s pi \
              -T degenotate/fourfolds.bed \
              -b genes.bed \
  > pi4_genes.bed

# pi at 0-fold sites, summarized per gene
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
              -g groups.tsv -s pi \
              -T degenotate/zerofolds.bed \
              -b genes.bed \
  > pi0_genes.bed

# In this case all groups have data for all genes so we can compare files line by line without joining by pop1 and locus fields
$ paste pi0_genes.bed pi4_genes.bed | awk 'NR>1 && $18>0 { print $4 , $5 , "pi0/pi4" , $8/$18 }' | column -t | head
AL5G20950  LE_2n             pi0/pi4  0.572241
AL5G20950  CESiberia_2n      pi0/pi4  0
AL2G11890  PUWS_4n           pi0/pi4  1.00032
AL2G11890  LE_2n             pi0/pi4  0.918975
AL2G11890  UKScandinavia_2n  pi0/pi4  0.51226
AL2G11890  CESiberia_2n      pi0/pi4  0.708472
AL1G31870  PUWS_4n           pi0/pi4  0.242626
AL1G31870  LE_2n             pi0/pi4  0.244054
AL1G31870  UKScandinavia_2n  pi0/pi4  0.793963
AL1G31870  CESiberia_2n      pi0/pi4  0.353163

Distance matrix

A pairwise distance matrix is essential for phylogenetic and clustering analysis. Use piawka dist to convert piawka calc output into PHYLIP or NEXUS format.

PHYLIP matrix (default)

$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
              -g groups.tsv -s dxy -q \
  | piawka dist - | column -t
4
PUWS_4n           0.000000  0.010746  0.014952  0.010989
LE_2n             0.010746  0.000000  0.015055  0.010225
UKScandinavia_2n  0.014952  0.015055  0.000000  0.015170
CESiberia_2n      0.010989  0.010225  0.015170  0.000000

Using Fst as a distance

$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
              -g groups.tsv -s fst -q \
  | piawka dist -s fst - | column -t
4
PUWS_4n           0.000000  0.170518  0.365378  0.187935
LE_2n             0.170518  0.000000  0.424453  0.194423
UKScandinavia_2n  0.365378  0.424453  0.000000  0.424824
CESiberia_2n      0.187935  0.194423  0.424824  0.000000

Note: Fst is not a proper distance metric (it does not satisfy the triangle inequality), but it can still be useful for exploratory clustering.

Heterozygosity and within-individual diversity

Individual heterozygosity is equivalent to nucleotide diversity measured within a single diploid (or polyploid) individual — that is, the probability that the two alleles at a site differ.

Per-individual heterozygosity (divide mode)

Use --groups divide to treat each sample as its own group:

$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
              -b genes.bed -g divide -s pi -q \
  | head -4 | column -t
#chr        start     end       locus      pop1         pop2  stat  value       numerator  denominator
scaffold_1  10035093  10035276  AL5G20950  WS_1.7-1_4n  .     pi    0.00462963  8          1728
scaffold_1  10035093  10035276  AL5G20950  MW0079491    .     pi    0.00694444  2          288
scaffold_1  10035093  10035276  AL5G20950  TE_4.5-2     .     pi    0.013986    4          286

Average heterozygosity per group

To compute the average heterozygosity across all individuals in a group, pipe the per-individual per-site output to piawka sum -g. Note that the output values are different from running piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz -b genes.bed -g groups.tsv -s pi directly: what we get here is not pi = expected heterozygosity, but average per-individual pi (= observed heterozygosity in diploid samples). That means e.g. that “pi == 1” will be found in sites where every individual is heterozygous.

$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
              -b genes.bed -g divide -s pi --persite -q \
  | piawka sum -g groups.tsv \
  | head -4 | column -t
#chr        start     end       locus      pop1              pop2  stat  value     numerator  denominator
scaffold_1  10035093  10035276  AL5G20950  PUWS_4n           .     pi    0.005627  68         12084
scaffold_1  10035093  10035276  AL5G20950  LE_2n             .     pi    0.008257  38         4602
scaffold_1  10035093  10035276  AL5G20950  UKScandinavia_2n  .     pi    0.001897  2          1054

Segregating and differentially fixed sites

pi and dxy together classify sites into biologically meaningful categories:

$\pi_i$	$\pi_j$	$d_{XY}$	Interpretation
> 0	= 0	> 0	Privately segregating in group i
= 0	> 0	> 0	Privately segregating in group j
= 0	= 0	= 1	Fixed difference between groups
> 0	> 0	> 0	Shared polymorphism (or partially shared)
= 0	= 0	= 0	Monomorphic across both groups

Use piawka filt to extract each category from --persite output.

Fixed differences (differentially fixed sites)

$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz \
              -g groups.tsv -s pi,dxy --persite \
  | piawka filt -e 'pi == "0" && dxy > 0' \
  | head -4 | column -t
#chr        start     end       locus                pop1          pop2              stat  value  numerator  denominator
scaffold_1  10000444  10000445  scaffold_1:10000445  LE_2n         UKScandinavia_2n  dxy   1      340        340
scaffold_1  10000444  10000445  scaffold_1:10000445  CESiberia_2n  UKScandinavia_2n  dxy   1      160        160
scaffold_1  10000513  10000514  scaffold_1:10000514  LE_2n         UKScandinavia_2n  dxy   1      340        340

Privately segregating sites in one group

Any site with pi>0 in one population and pi==0 in all other populations is privately segregating in it.

Shared polymorphisms — a caveat

Sites with pi > 0 in both groups and dxy > 0 appear to be shared polymorphisms, but this classification is unreliable for multiallelic sites:

Example site:  Group A  →  T/A  (pi > 0)
               Group B  →  T/G  (pi > 0)
               dxy for T vs. A and T vs. G → dxy > 0

Here both groups are polymorphic and dxy > 0, but the groups share the same REF allele T while carrying different ALT alleles — not a shared polymorphism in the biological sense. Filter out multiallelic sites e.g. with bcftools view -M2 file.vcf.gz to restrict the analysis to biallelic comparisons at the cost of losing some informative sites.