Best practices

Doing a genome-wide association study?

If you want to evaluate the effect of individual genomic variants on a phenotype of interest, while accounting for possible confounders, you will want to run a genome-wide association study. This will give a p-value for every genomic variant, comparing the alternative hypothesis that the variant does have an effect on phenotype (has an effect size \(\beta > 0\)) with the null hypothesis that the variant has no effect.

For this mode you will need at least three input files:

Genetic variants (--kmers, --vcf or --pres).
A phenotype (--phenotypes).
A representation of the population structure (--distances or --similarity).

For a starting point, have a look at GWAS tutorial.

Current ‘best-practice’ GWAS recommendations:

Use the --lmm mode.
Use a phylogeny to generate the --similarity matrix.
Use unitigs as the input, provided with the --kmers option. End-to-end analysis is identical to k-mers.
If you have covariates, provide them with –covariates and –use-covariates.

Once this works, you may also wish to also add the following extra analyses:

A burden test with --vcf and --burden.
Tests of other forms of variation (genes, structural variants from panaroo.
Extract lineage effects with --lineage.

Trying to predict a phenotype from genetics?

If you want to predict a phenotype in new samples where it is unmeasured, or look at the power of genetic variants to predict a phenotype, you’ll want to use a whole-genome model.

You will need:

Genetic variants (--kmers, --vcf or --pres).
A phenotype (--phenotypes).

A good starting place is to read Prediction tutorial.

Current ‘best-practice’ prediction recommendations:

Use --wg enet --save-vars and --wg enet --load-vars to save time in future runs.
Use unitigs, if you can.
For large variant sets, use a small number of --cpu to keep memory use manageable.
Divide the population into strains with PopPUNK and use these definitions with --lineage-clusters and --sequence-reweighting.
Turn the correlation filter off with --cor-filter 0.

Trying to calculate heritability?

If you want an estimate of what proportion of the phenotype variance can be explained by genomic variation, known as the heritability \(h^2\), you can use either of the above modes to do this.

With --lmm an estimate for \(h^2\) will be printed to stderr, based on the GCTA model (all variants affect the phenotype, with normally distributed effect sizes).

With --wg enet and estimate for \(h^2\) will also be printed to stderr, based on the average prediction accuracy \(R^2\) in held-out samples during cross-validation.

For a comparison of these approaches, see:

Lees, John A., Mai, T. T., et al. Improved inference and prediction of bacterial genotype-phenotype associations using interpretable pangenome-spanning regressions. (2020)

Preprint: https://doi.org/10.1101/852426