Reference documentation

input.py

Functions to read data into pyseer and iterate over instances

pyseer.input.file_hash(filename)[source]

Calculates the hash of an entire file on disk

Parameters:

filename (str) – Location of file on disk

Returns:

hash (str)

SHA256 checksum

pyseer.input.hash_pattern(k)[source]

Calculates the hash of a presence/absence vector

Parameters:

k (numpy.array) – Variant presence/absence binary vector (n, 1)

Returns:

hash (byte)

Hashed pattern

pyseer.input.iter_variants(p, m, cov, var_type, burden, burden_regions, infile, all_strains, sample_order, lineage_effects, lineage_clusters, min_af, max_af, max_missing, filter_pvalue, lrt_pvalue, null_fit, firth_null, uncompressed, continuous)[source]

Make an iterable to pass single variants to fixed effects regression

Parameters:
  • p (pandas.DataFrame) – Phenotype vector (n, 1)

  • m (numpy.array) – Population structure matrix (n, k)

  • cov (pandas.DataFrame) – Covariates matrix (n, m)

  • var_type (str) – Variants type (one of: kmers, vcf or Rtab)

  • burden (bool) – Whether to slice a vcf file by burden regions

  • burden_regions (collections.deque) – Burden regions to slice the vcf with

  • infile (opened file) – Handle to opened variant file

  • all_strains (set-like) – All sample labels that should be present

  • sample_order – Sampes order to interpret each Rtab line

  • lineage_effects (bool) – Whether to fit lineage effects

  • clusters (lineage) – Lineage clusters indexes

  • min_af (float) – Minimum allele frequency (inclusive)

  • max_af (float) – maximum allele frequency (inclusive)

  • max_missing (float) – maximum missing frequency

  • filter_pvalue (float) – Pre-filtering p-value threshold

  • lrt_pvalue (float) – Filtering p-value threshold

  • null_fit (float or statsmodels.regression.linear_model.RegressionResultsWrapper) – Null-fit likelihood (binary) or model (continuous)

  • firth_null (float) – Firth regression likelihood

  • uncompressed (bool) – Whether the kmers file is uncompressed

  • continuous (bool) – Whether the phenotype is continuous or not

Returns:

var_name (str)

Variant name

v (numpy.array)

Phenotypes vector (n, 1)

k (numpy.array)

Variant presence/absence vector (n, 1)

m (numpy.array)

Population structure matrix (n, k)

c (numpy.array)

Covariates matrix (n, m)

af (float)

Allele frequency

pattern (bytes)

Variant hash

lineage_effects (bool)

Whether to fit lineage effects

lineage clusters (list)

Lineage clusters indexes

filter_pvalue (float)

Pre-filtering p-value threshold

lrt_pvalue (float)

Filtering p-value threshold

null_fit (float or statsmodels.regression.linear_model.RegressionResultsWrapper)

Null-fit likelihood (binary) or model (continuous)

firth_null (float)

Firth regression likelihood

kstrains (iterable)

Sample labels with the variant

nkstrains (iterable)

Sample labels without the variant

continuous (bool)

Whether the phenotype is continuous or not

pyseer.input.iter_variants_lmm(variant_iter, lmm, h2, lineage, lineage_clusters, covariates, continuous, filter_pvalue, lrt_pvalue)[source]

Make an iterable to pass single variants to fixed effects regression

pyseer.input.load_burden(infile, burden_regions)[source]

Load burden regions for VCF analysis

Parameters:
  • infile (str) – Input file for burden regions

  • burden_regions (list) – List to be filled in-place

pyseer.input.load_covariates(infile, covariates, p)[source]

Load and encode a covariates matrix

Parameters:
  • infile (str) – Input file for the covariates matrix

  • covariates (iterable or None) – List of string indicating which columns to use and their interpretation. Example: 2q indicates that the second column from the file is a quantitative variable, 2 indicates that that same column is categorical. If None, the matrix is loaded but nothing is done with it.

  • p (pandas.Series) – Phenotypes vector (n, 1)

Returns:

cov (pandas.DataFrame)

Covariance matrix (n, m)

pyseer.input.load_lineage(infile, p)[source]

Load custom lineage clusters definitions

Parameters:
  • infile (str) – Input file for lineage clusters

  • p (pandas.Series) – Phenotypes vector (n, 1)

Returns:

result (tuple of (numpy.array, list))

Lineage binary matrix and cluster labels

pyseer.input.load_phenotypes(infile, column)[source]

Load phenotypes vector

Parameters:
  • infile (str) – Matrix input file

  • column (str or None) – Phenotype column name or None to pick the last column

Returns:

p (pandas.Series)

Phenotype vector (n, 1)

pyseer.input.load_structure(infile, p, max_dimensions, mds_type='classic', n_cpus=1, seed=None)[source]

Load population structure and apply multidimensional scaling

Parameters:
  • infile (str) – Population structure (distance matrix) input file

  • p (pandas.Series) – Phenotype vector (n, 1)

  • max_dimensions (int) – Maximum dimensions to consider when applying metric or non-metric MDS

  • mds_type (str) – MDS algorithm to apply. One of classic, metric or non-metric. Any other input will trigger the metric MDS

  • n_cpus (int) – Number of CPUs to be used for the metric or non-metric MDS

  • seed (int or None) – Random seed for metric or non-metric MDS, None if not required

Returns:

m (pandas.DataFrame)

Population structure after MDS (n, m)

pyseer.input.load_var_block(var_type, p, burden, burden_regions, infile, all_strains, sample_order, min_af, max_af, max_missing, uncompressed, block_size)[source]

Make in iterable to load blocks of variants for LMM

Parameters:
  • var_type (str) – Variants type (one of: kmers, vcf or Rtab)

  • p (pandas.DataFrame) – Phenotype vector (n, 1)

  • burden (bool) – Whether to slice a vcf file by burden regions

  • burden_regions (collections.deque) – Burden regions to slice the vcf with

  • infile (opened file) – Handle to opened variant file

  • all_strains (set-like) – All sample labels that should be present

  • sample_order – Sampes order to interpret each Rtab line

  • min_af (float) – Minimum allele frequency (inclusive)

  • max_af (float) – maximum allele frequency (inclusive)

  • max_missing (float) – maximum missing frequency

  • uncompressed (bool) – Whether the kmers file is uncompressed

  • block_size (int) – How many variants to be loaded at once

Returns:

variants (iterable)

A collection of pyseer.classes.LMM objects describing the loaded variants (n,)

variant_mat (numpy.array)

Variant bloack presence/absence matrix (n, block_size)

eof (bool)

Whether we are at the end of the file

pyseer.input.open_variant_file(var_type, var_file, burden_file, burden_regions, uncompressed)[source]

Open a variant file for use as an iterable

Parameters:
  • var_type (str) – Type of variants file (kmers, vcf, Rtab)

  • var_file (str) – Location of file

  • burden_file (str) – File containing regions to group burden tests

  • burden_regions (list) – List of burden regions to be filled in-place

  • uncompressed (bool) – True if kmer file is not gzipped

pyseer.input.read_variant(infile, p, var_type, burden, burden_regions, uncompressed, all_strains, sample_order, keep_list=None, noparse=False)[source]

Read input line and parse depending on input file type

Return a variant name and pres/abs vector

Parameters:
  • infile (opened file) – Handle to opened variant file

  • p (pandas.Series) – Phenotypes vector (n, 1)

  • var_type (str) – Variants type (one of: kmers, vcf or Rtab)

  • burden (bool) – Whether to slice a vcf file by burden regions

  • burden_regions (collections.deque) – Burden regions to slice the vcf with

  • uncompressed (bool) – Whether the kmers file is uncompressed

  • all_strains (set-like) – All sample labels that should be present

  • sample_order – Samples order to interpret each Rtab line

  • keep_list (dict) –

    Variant names to properly read, any other will return None

    (default = None)

  • noparse (bool) –

    Set True to skip line without parsing and return None, irrespective of presence in skip_list

    (default = False)

Returns:

eof (bool)

Whether we are at the end of the file

k (numpy.array)

Variant presence/absence vector

var_name (str)

Variant name

kstrains (list)

Samples in which the variant is present

nkstrains (list)

Samples in which the variant is absent

af (float)

Allele frequency

missing (float)

Missing frequency

pyseer.input.read_vcf_var(variant, d, keep_list=None)[source]

Parses vcf variants from pysam

Returns None if filtered variant. Mutates passed dictionary d

Parameters:
  • variant (pysam.libcbcf.VariantRecord) – Variant to be parsed

  • d (dict) – Dictionary to be populated in-place

  • keep_list (list) – List of variants to read

model.py

Original SEER model (fixed effects) implementations

pyseer.model.firth_likelihood(beta, logit)[source]

Convenience function to calculate likelihood of Firth regression

Parameters:
  • beta (numpy.array) – (n, 1)

  • logit (statsmodels.discrete.discrete_model.Logit) – Logistic model

Returns:

likelihood (float)

Firth likelihood

pyseer.model.fit_firth(logit_model, start_vec, X, y, step_limit=1000, convergence_limit=0.0001)[source]

Do firth regression

Parameters:
  • logit (statsmodels.discrete.discrete_model.Logit) – Logistic model

  • start_vec (numpy.array) – Pre-initialized vector to speed-up convergence (n, 1)

  • X (numpy.array) – (n, m)

  • y (numpy.array) – (n, )

  • step_limit (int) – Maximum number of iterations

  • convergence_limit (float) – Convergence tolerance

Returns:

intercept (float)

Intercept

kbeta (float)

Variant beta

beta (iterable)

Covariates betas (n-2)

bse (float)

Beta std-err

fitll (float or None)

Likelihood of fit or None if could not fit

pyseer.model.fit_lineage_effect(lin, c, k)[source]

Fits the model k ~ Wa using binomial error with logit link. W are the lineages (either a projection of samples, or cluster indicators) and covariates. Returns the index of the most significant lineage

Parameters:
  • lin (numpy.array) – Population structure matrix or lineage association binary matrix (n, k)

  • c (numpy.array) – Covariants matrix (n, j)

  • k (numpy.array) – Variant presence-absence vector (n, 1)

Returns:

max_lineage (int or None)

Index of the most significant lineage or None is could not fit

pyseer.model.fit_null(p, m, cov, continuous, firth=False)[source]

Fit the null model i.e. regression without k-mer

y ~ Wa

Returns log-likelihood

Parameters:
  • p (numpy.array) – Phenotypes vector (n, 1)

  • m (numpy.array) – Population structure matrix (n, k)

  • cov (pandas.DataFrame) – Covariants dataframe (n, j)

  • continous (bool) – Whether phenotypes are continuous or binary

  • firth (bool) – For binary phenotypes whether to use firth regression

Returns:

null_res (statsmodels.regression.linear_model.RegressionResultsWrapper or float or None)

Fitted model or log-likelihood (if firth) or None if could not fit

pyseer.model.fixed_effects_regression(variant, p, k, m, c, af, pattern, lineage_effects, lin, pret, lrtt, null_res, null_firth, kstrains, nkstrains, continuous)[source]

Fits the model y ~ Xb + Wa using either binomial error with logit link (binary traits) or Gaussian error (continuous traits)

  • y is the phenotype

  • X is the variant presence/absence (fixed effects)

  • W are covariate fixed effects, including population structure

  • a and b are slopes to be fitted

Parameters:
  • variant (str) – Variant identifier

  • p (numpy.array) – Phenotype vector (binary or continuous) (n, 1)

  • k (numpy.array) – Variant presence/absence vector (n, 1)

  • m (numpy.array) – Population structure matrix (n, m)

  • c (numpy.array) – Covariants matrix (n, j)

  • af (float) – Allele frequency

  • pattern (str) – Variant hashed pattern

  • lineage_effects (bool) – Whether to fit lineages or not

  • lin (numpy.array) – Lineages matrix (n, k)

  • pret (float) – Pre-filtering p-value threshold

  • lrtt (float) – Post-fitting p-value threshold

  • null_res (float or statsmodels.regression.linear_model.RegressionResultsWrapper) – Null-fit likelihood (binary) or model (continuous)

  • null_firth (float) – Firth regression likelihood

  • kstrains (iterable) – Sample labels with the variant

  • nkstrains (iterable) – Sample labels without the variant

  • continuous (bool) – Whether the phenotype is continuous or not

Returns:

result (pyseer.classes.Seer)

Results container

pyseer.model.pre_filtering(p, k, continuous)[source]

Calculate a naive p-value from a chisq test (binary phenotype) or a t-test (continuous phenotype) which is not adjusted for population structure

Parameters:
  • p (numpy.array) – Phenotypes vector (n, 1)

  • k (numpy.array) – Variant presence-absence vector (n, 1)

  • continous (bool) – Whether phenotypes are continuous or binary

Returns:

prep (float)

Naive p-value

bad_chisq (boolean)

Whether the chisq test had small values in the contingency table

lmm.py

LMM interface implementations

pyseer.lmm.fit_lmm(lmm, h2, variants, variant_mat, lineage_effects, lineage_clusters, covariates, continuous, filter_pvalue, lrt_pvalue)[source]

Fits LMM and returns LMM tuples for printing

Parameters:
  • lmm (pyseer.fastlmm.lmm_cov.LMM) – Initialised LMM model

  • h2 (float) – Trait’s variance explained by covariates

  • variants (iterable) – Tuples with variant object, phenotype vector and variant vector (pyseer.classes.LMM, numpy.array, numpy.array)

  • variant_mat (numpy.array) – Variants presence absence matrix (n, k)

  • lineage_effects (bool) – Whether to fit lineage effects

  • lineage_clusters (numpy.array) – Population structure matrix or lineage association binary matrix (n, k)

  • covariates (numpy.array) – Covariates matrix (n, m)

  • continuous (bool) – Whether the phenotype is continuous

  • filter_pvalue (float) – Pre-filtering p-value threshold

  • lrt_pvalue (float) – Post-fitting p-value threshold

Returns:

all_variants (iterable)

All variant objects fitted or filtered

pyseer.lmm.fit_lmm_block(lmm, h2, variant_block)[source]

Actually fits the LMM to numpy variant array see map/reduce section of _internal_single in fastlmm.association.single_snp

Parameters:
  • lmm (pyseer.fastlmm.lmm_cov.LMM) – Initialised LMM model

  • h2 (float) – Trait’s variance explained by covariates

  • variant_block (numpy.array) – Variants presence absence matrix (n, k)

Returns:

lmm_results (dict)

LMM results for this variants block

pyseer.lmm.initialise_lmm(p, cov, K_in, lmm_cache_in=None, lmm_cache_out=None, lineage_samples=None)[source]

Initialises LMM using the similarity matrix see _internal_single in fastlmm.association.single_snp

Parameters:
  • p (pandas.Series) – Phenotypes vector (n, 1)

  • cov (pandas.DataFrame) – Covariance matrix (n, m)

  • K_in (str) – Similarity matrix filename

  • lmm_chache_in (str or None) – Filename for an input LMM cache, None if it has to be computed

  • lmm_chache_out (str or None) – Filename to save the LMM cache, None otherwise.

  • lineage_samples (list or None) – Sample names used for lineage (must match K_in)

Returns:

p (pandas.Series)

Phenotype vector with the samples present in the similarity matrix

lmm (pyseer.fastlmm.lmm_cov.LMM)

Initialised LMM model

h2 (float)

Trait’s variance explained by covariates

utils.py

Utilities

pyseer.utils.format_output(item, lineage_dict=None, model='seer', print_samples=False)[source]

Format results for a variant for stdout printing

Parameters:
  • item (pyseer.classes.Seer or pyseer.classes.LMM) – Variant results container

  • lineage_dict (list) – Lineage labels

  • model (str) – The model used

  • print_samples (bool) – Whether to add the samples list to the putput

Returns:

out (str)

Tab-delimited string to be printed

pyseer.utils.set_env(**environ)[source]

Temporarily set the process environment variables.

>>> with set_env(PLUGINS_DIR=u'test/plugins'):
...   "PLUGINS_DIR" in os.environ
True
>>> "PLUGINS_DIR" in os.environ
False

cmdscale.py

Function to perform classical MDS

pyseer.cmdscale.cmdscale(D)[source]

Classical multidimensional scaling (MDS)

Parameters:

D (numpy.array) – Symmetric distance matrix (n, n)

Returns:

Y (numpy.array)

Configuration matrix (n, p). Each column represents a dimension. Only the p dimensions corresponding to positive eigenvalues of B are returned. Note that each dimension is only determined up to an overall sign, corresponding to a reflection.

e (numpy.array)

Eigenvalues of B (n, 1)

enet.py

Elastic net model implementations

pyseer.enet.correlation_filter(p, all_vars, quantile_filter=0.25)[source]

Calculates correlations between phenotype and variants, giving those that are above the specified quantile

Parameters:
  • p (pandas.DataFrame) – Phenotype vector (n, 1)

  • all_vars (scipy.sparse.csr_matrix) – Narrow sparse matrix representation of all variants to fit to (rows = variants, columns = samples)

  • quantile_filter (float) –

    The quantile to discard at e.g. 0.25, retain top 75%

    [default = 0.25]

Returns:

cor_filter (numpy.array)

The indices of variants passing the filter

pyseer.enet.enet_predict(enet_fit, variants, continuous, responses=None)[source]

Use a fitted elastic net model to make predictions about new observations. Returns accuracy if true responses known

Parameters:
  • enet_fit (cvglmnet) – An elastic net model fitted using cvglmnet or similar

  • variants (scipy.sparse.csc_matrix) – Wide sparse matrix representation of all variants to predict with (rows = samples, columns = variants)

  • continuous (bool) – True if a continuous phenotype, False if a binary phenotype

  • responses (np.array) –

    True phenotypes to calculate R^2 with

    [default = None]

Returns:

preds (numpy.array)

Predicted phenotype for each input sample in variants

R2 (float)

Variance explained by model (or None if true labels not provided).

pyseer.enet.find_enet_selected(enet_betas, var_indices, p, c, var_type, fit_seer, burden, burden_regions, infile, all_strains, sample_order, continuous, find_lineage, lin, uncompressed)[source]

Read through the variant input file again, yielding just those variants which had a non-zero slope for printing

Parameters:
  • enet_betas (numpy.array) – Fitted slopes of intercept, covariants and variants from elastic net

  • var_indices (list) – The 0-indexed locations (in the original file) of variants represented in enet_betas

  • p (pandas.DataFrame) – Phenotype vector (n, 1)

  • c (numpy.array) – Covariate matrix (n, j)

  • var_type (str) – Variants type (one of: kmers, vcf or Rtab)

  • (tuple (fit_seer) – m, null_model, null_firth) Distance projection and null models required to fit fixed effect regression

  • burden (bool) – Whether to slice a vcf file by burden regions

  • burden_regions (collections.deque) – Burden regions to slice the vcf with

  • infile (opened file) – Handle to opened variant file

  • all_strains (set-like) – All sample labels that should be present

  • sample_order – Sample order to interpret each Rtab line

  • continuous (bool) – Is phenotype/fit continuous?

  • lineage_effects (bool) – Whether to fit lineages or not

  • lin (numpy.array) – Lineages matrix (n, k)

  • uncompressed (bool) – Whether the kmers file is uncompressed

Returns:

variant (var_obj.Enet)

Iterable of processed variants for printing

pyseer.enet.fit_enet(p, variants, covariates, weights, continuous, alpha, lineage_dict=None, fold_ids=None, n_folds=10, n_cpus=1)[source]

Fit an elastic net model to a set of variants. Prints information about model fit and prediction quality to STDERR

Parameters:
  • p (pandas.DataFrame) – Phenotype vector (n, 1)

  • variants (scipy.sparse.csc_matrix) – Wide sparse matrix representation of all variants to fit to (rows = samples, columns = variants)

  • covariates (pandas.DataFrame) – Covariate matrix (n, j)

  • weights (np.array) – Vector of sample weights (n, 1)

  • continuous (bool) – If True fit a Gaussian error model, otherwise Bionomial error

  • alpha (float) – Between 0-1, sets the mix between ridge regression and lasso regression

  • lineage_dict (list) –

    Names of lineages, indices corrsponding to fold_ids

    [default = None]

  • fold_ids (list) –

    Index of fold assignment for cross-validation, from 0 to 1-n_folds

    [default = None]

  • n_folds (int) –

    Number of folds in cross-validation

    [default = 10]

  • n_cpus (int) –

    Number of processes to use in cross-validation Set to -1 to use all available

    [default = 1]

Returns:

betas (numpy.array)

The fitted betas (slopes) for each variant

pyseer.enet.load_all_vars(var_type, p, burden, burden_regions, infile, all_strains, sample_order, min_af, max_af, max_missing, uncompressed)[source]

Load all variants in the input file into a sparse matrix representation

Parameters:
  • var_type (str) – Variants type (one of: kmers, vcf or Rtab)

  • p (pandas.DataFrame) – Phenotype vector (n, 1)

  • burden (bool) – Whether to slice a vcf file by burden regions

  • burden_regions (collections.deque) – Burden regions to slice the vcf with

  • infile (opened file) – Handle to opened variant file

  • all_strains (set-like) – All sample labels that should be present

  • sample_order – Sampes order to interpret each Rtab line

  • min_af (float) – Minimum allele frequency (inclusive)

  • max_af (float) – maximum allele frequency (inclusive)

  • max_missing (float) – maximum missing frequency

  • uncompressed (bool) – Whether the kmers file is uncompressed

Returns:

variants (scipy.sparse.csr_matrix)

A sparse matrix representation of all variants in the input

selected_vars (list)

0-Indices of variants in the input file in variants (which passed AF filtering)

var_idx (int)

The number of read variants (number of rows of variants)

pyseer.enet.write_lineage_predictions(true_values, predictions, fold_ids, lineage_dict, continuous, stderr_print=True)[source]

Writes prediction ability stratified by lineage to stderr

Parameters:
  • true_values (np.array) – Observed values of phenotype

  • predictions (np.array) – Predicted phenotype values

  • lineage_dict (list) – Names of lineages, indices corrsponding to fold_ids

  • fold_ids (list) – Index of fold assignment for cross-validation, from 0 to 1-n_folds

  • continuous (bool) – True if a continuous phenotype, False if a binary phenotype

  • stderr_print (bool) –

    Print output to stderr

    [default = True]

Returns:

R2_vals (list)

R2 values for each fold

confusion (list)

Tuple of tn, fp, fn, tp for each fold