Best practices

Doing a genome-wide association study?

If you want to evaluate the effect of individual genomic variants on a phenotype of interest, while accounting for possible confounders, you will want to run a genome-wide association study. This will give a p-value for every genomic variant, comparing the alternative hypothesis that the variant does have an effect on phenotype (has an effect size \(\beta > 0\)) with the null hypothesis that the variant has no effect.

For this mode you will need at least three input files:

  • Genetic variants (--kmers, --vcf or --pres).

  • A phenotype (--phenotypes).

  • A representation of the population structure (--distances or --similarity).

For a starting point, have a look at GWAS tutorial.

Current ‘best-practice’ GWAS recommendations:

  • Use the --lmm mode.

  • Use a phylogeny to generate the --similarity matrix.

  • Use unitigs as the input, provided with the --kmers option. End-to-end analysis is identical to k-mers.

  • If you have covariates, provide them with –covariates and –use-covariates.

Once this works, you may also wish to also add the following extra analyses:

  • A burden test with --vcf and --burden.

  • Tests of other forms of variation (genes, structural variants from panaroo.

  • Extract lineage effects with --lineage.

Trying to predict a phenotype from genetics?

If you want to predict a phenotype in new samples where it is unmeasured, or look at the power of genetic variants to predict a phenotype, you’ll want to use a whole-genome model.

You will need:

  • Genetic variants (--kmers, --vcf or --pres).

  • A phenotype (--phenotypes).

A good starting place is to read Prediction tutorial.

Current ‘best-practice’ prediction recommendations:

  • Use --wg enet --save-vars and --wg enet --load-vars to save time in future runs.

  • Use unitigs, if you can.

  • For large variant sets, use a small number of --cpu to keep memory use manageable.

  • Divide the population into strains with PopPUNK and use these definitions with --lineage-clusters and --sequence-reweighting.

  • Turn the correlation filter off with --cor-filter 0.

Trying to calculate heritability?

If you want an estimate of what proportion of the phenotype variance can be explained by genomic variation, known as the heritability \(h^2\), you can use either of the above modes to do this.

With --lmm an estimate for \(h^2\) will be printed to stderr, based on the GCTA model (all variants affect the phenotype, with normally distributed effect sizes).

With --wg enet and estimate for \(h^2\) will also be printed to stderr, based on the average prediction accuracy \(R^2\) in held-out samples during cross-validation.

For a comparison of these approaches, see:

Lees, John A., Mai, T. T., et al. Improved inference and prediction of bacterial genotype-phenotype associations using interpretable pangenome-spanning regressions. (2020)