Reference documentation

input.py

Functions to read data into pyseer and iterate over instances

pyseer.input.file_hash(filename)[source]

Calculates the hash of an entire file on disk

Parameters:filename (str) – Location of file on disk
Returns:
hash (str)
SHA256 checksum
pyseer.input.hash_pattern(k)[source]

Calculates the hash of a presence/absence vector

Parameters:k (numpy.array) – Variant presence/absence binary vector (n, 1)
Returns:
hash (byte)
Hashed pattern
pyseer.input.iter_variants(p, m, cov, var_type, burden, burden_regions, infile, all_strains, sample_order, lineage_effects, lineage_clusters, min_af, max_af, max_missing, filter_pvalue, lrt_pvalue, null_fit, firth_null, uncompressed, continuous)[source]

Make an iterable to pass single variants to fixed effects regression

Parameters:
  • p (pandas.DataFrame) – Phenotype vector (n, 1)
  • m (numpy.array) – Population structure matrix (n, k)
  • cov (pandas.DataFrame) – Covariates matrix (n, m)
  • var_type (str) – Variants type (one of: kmers, vcf or Rtab)
  • burden (bool) – Whether to slice a vcf file by burden regions
  • burden_regions (collections.deque) – Burden regions to slice the vcf with
  • infile (opened file) – Handle to opened variant file
  • all_strains (set-like) – All sample labels that should be present
  • sample_order – Sampes order to interpret each Rtab line
  • lineage_effects (bool) – Whether to fit lineage effects
  • clusters (lineage) – Lineage clusters indexes
  • min_af (float) – Minimum allele frequency (inclusive)
  • max_af (float) – maximum allele frequency (inclusive)
  • max_missing (float) – maximum missing frequency
  • filter_pvalue (float) – Pre-filtering p-value threshold
  • lrt_pvalue (float) – Filtering p-value threshold
  • null_fit (float or statsmodels.regression.linear_model.RegressionResultsWrapper) – Null-fit likelihood (binary) or model (continuous)
  • firth_null (float) – Firth regression likelihood
  • uncompressed (bool) – Whether the kmers file is uncompressed
  • continuous (bool) – Whether the phenotype is continuous or not
Returns:

var_name (str)

Variant name

v (numpy.array)

Phenotypes vector (n, 1)

k (numpy.array)

Variant presence/absence vector (n, 1)

m (numpy.array)

Population structure matrix (n, k)

c (numpy.array)

Covariates matrix (n, m)

af (float)

Allele frequency

pattern (bytes)

Variant hash

lineage_effects (bool)

Whether to fit lineage effects

lineage clusters (list)

Lineage clusters indexes

filter_pvalue (float)

Pre-filtering p-value threshold

lrt_pvalue (float)

Filtering p-value threshold

null_fit (float or statsmodels.regression.linear_model.RegressionResultsWrapper)

Null-fit likelihood (binary) or model (continuous)

firth_null (float)

Firth regression likelihood

kstrains (iterable)

Sample labels with the variant

nkstrains (iterable)

Sample labels without the variant

continuous (bool)

Whether the phenotype is continuous or not

pyseer.input.iter_variants_lmm(variant_iter, lmm, h2, lineage, lineage_clusters, covariates, continuous, filter_pvalue, lrt_pvalue)[source]

Make an iterable to pass single variants to fixed effects regression

pyseer.input.load_burden(infile, burden_regions)[source]

Load burden regions for VCF analysis

Parameters:
  • infile (str) – Input file for burden regions
  • burden_regions (list) – List to be filled in-place
pyseer.input.load_covariates(infile, covariates, p)[source]

Load and encode a covariates matrix

Parameters:
  • infile (str) – Input file for the covariates matrix
  • covariates (iterable or None) – List of string indicating which columns to use and their interpretation. Example: 2q indicates that the second column from the file is a quantitative variable, 2 indicates that that same column is categorical. If None, the matrix is loaded but nothing is done with it.
  • p (pandas.Series) – Phenotypes vector (n, 1)
Returns:

cov (pandas.DataFrame)

Covariance matrix (n, m)

pyseer.input.load_lineage(infile, p)[source]

Load custom lineage clusters definitions

Parameters:
  • infile (str) – Input file for lineage clusters
  • p (pandas.Series) – Phenotypes vector (n, 1)
Returns:

result (tuple of (numpy.array, list))

Lineage binary matrix and cluster labels

pyseer.input.load_phenotypes(infile, column)[source]

Load phenotypes vector

Parameters:
  • infile (str) – Matrix input file
  • column (str or None) – Phenotype column name or None to pick the last column
Returns:

p (pandas.Series)

Phenotype vector (n, 1)

pyseer.input.load_structure(infile, p, max_dimensions, mds_type='classic', n_cpus=1, seed=None)[source]

Load population structure and apply multidimensional scaling

Parameters:
  • infile (str) – Population structure (distance matrix) input file
  • p (pandas.Series) – Phenotype vector (n, 1)
  • max_dimensions (int) – Maximum dimensions to consider when applying metric or non-metric MDS
  • mds_type (str) – MDS algorithm to apply. One of classic, metric or non-metric. Any other input will trigger the metric MDS
  • n_cpus (int) – Number of CPUs to be used for the metric or non-metric MDS
  • seed (int or None) – Random seed for metric or non-metric MDS, None if not required
Returns:

m (pandas.DataFrame)

Population structure after MDS (n, m)

pyseer.input.load_var_block(var_type, p, burden, burden_regions, infile, all_strains, sample_order, min_af, max_af, max_missing, uncompressed, block_size)[source]

Make in iterable to load blocks of variants for LMM

Parameters:
  • var_type (str) – Variants type (one of: kmers, vcf or Rtab)
  • p (pandas.DataFrame) – Phenotype vector (n, 1)
  • burden (bool) – Whether to slice a vcf file by burden regions
  • burden_regions (collections.deque) – Burden regions to slice the vcf with
  • infile (opened file) – Handle to opened variant file
  • all_strains (set-like) – All sample labels that should be present
  • sample_order – Sampes order to interpret each Rtab line
  • min_af (float) – Minimum allele frequency (inclusive)
  • max_af (float) – maximum allele frequency (inclusive)
  • max_missing (float) – maximum missing frequency
  • uncompressed (bool) – Whether the kmers file is uncompressed
  • block_size (int) – How many variants to be loaded at once
Returns:

variants (iterable)

A collection of pyseer.classes.LMM objects describing the loaded variants (n,)

variant_mat (numpy.array)

Variant bloack presence/absence matrix (n, block_size)

eof (bool)

Whether we are at the end of the file

pyseer.input.open_variant_file(var_type, var_file, burden_file, burden_regions, uncompressed)[source]

Open a variant file for use as an iterable

Parameters:
  • var_type (str) – Type of variants file (kmers, vcf, Rtab)
  • var_file (str) – Location of file
  • burden_file (str) – File containing regions to group burden tests
  • burden_regions (list) – List of burden regions to be filled in-place
  • uncompressed (bool) – True if kmer file is not gzipped
pyseer.input.read_variant(infile, p, var_type, burden, burden_regions, uncompressed, all_strains, sample_order, keep_list=None, noparse=False)[source]

Read input line and parse depending on input file type

Return a variant name and pres/abs vector

Parameters:
  • infile (opened file) – Handle to opened variant file
  • p (pandas.Series) – Phenotypes vector (n, 1)
  • var_type (str) – Variants type (one of: kmers, vcf or Rtab)
  • burden (bool) – Whether to slice a vcf file by burden regions
  • burden_regions (collections.deque) – Burden regions to slice the vcf with
  • uncompressed (bool) – Whether the kmers file is uncompressed
  • all_strains (set-like) – All sample labels that should be present
  • sample_order – Samples order to interpret each Rtab line
  • keep_list (dict) –

    Variant names to properly read, any other will return None

    (default = None)

  • noparse (bool) –

    Set True to skip line without parsing and return None, irrespective of presence in skip_list

    (default = False)

Returns:

eof (bool)

Whether we are at the end of the file

k (numpy.array)

Variant presence/absence vector

var_name (str)

Variant name

kstrains (list)

Samples in which the variant is present

nkstrains (list)

Samples in which the variant is absent

af (float)

Allele frequency

missing (float)

Missing frequency

pyseer.input.read_vcf_var(variant, d, keep_list=None)[source]

Parses vcf variants from pysam

Returns None if filtered variant. Mutates passed dictionary d

Parameters:
  • variant (pysam.libcbcf.VariantRecord) – Variant to be parsed
  • d (dict) – Dictionary to be populated in-place
  • keep_list (list) – List of variants to read

model.py

Original SEER model (fixed effects) implementations

pyseer.model.firth_likelihood(beta, logit)[source]

Convenience function to calculate likelihood of Firth regression

Parameters:
  • beta (numpy.array) – (n, 1)
  • logit (statsmodels.discrete.discrete_model.Logit) – Logistic model
Returns:

likelihood (float)

Firth likelihood

pyseer.model.fit_firth(logit_model, start_vec, X, y, step_limit=1000, convergence_limit=0.0001)[source]

Do firth regression

Parameters:
  • logit (statsmodels.discrete.discrete_model.Logit) – Logistic model
  • start_vec (numpy.array) – Pre-initialized vector to speed-up convergence (n, 1)
  • X (numpy.array) – (n, m)
  • y (numpy.array) – (n, )
  • step_limit (int) – Maximum number of iterations
  • convergence_limit (float) – Convergence tolerance
Returns:

intercept (float)

Intercept

kbeta (float)

Variant beta

beta (iterable)

Covariates betas (n-2)

bse (float)

Beta std-err

fitll (float or None)

Likelihood of fit or None if could not fit

pyseer.model.fit_lineage_effect(lin, c, k)[source]

Fits the model k ~ Wa using binomial error with logit link. W are the lineages (either a projection of samples, or cluster indicators) and covariates. Returns the index of the most significant lineage

Parameters:
  • lin (numpy.array) – Population structure matrix or lineage association binary matrix (n, k)
  • c (numpy.array) – Covariants matrix (n, j)
  • k (numpy.array) – Variant presence-absence vector (n, 1)
Returns:

max_lineage (int or None)

Index of the most significant lineage or None is could not fit

pyseer.model.fit_null(p, m, cov, continuous, firth=False)[source]

Fit the null model i.e. regression without k-mer

y ~ Wa

Returns log-likelihood

Parameters:
  • p (numpy.array) – Phenotypes vector (n, 1)
  • m (numpy.array) – Population structure matrix (n, k)
  • cov (pandas.DataFrame) – Covariants dataframe (n, j)
  • continous (bool) – Whether phenotypes are continuous or binary
  • firth (bool) – For binary phenotypes whether to use firth regression
Returns:

null_res (statsmodels.regression.linear_model.RegressionResultsWrapper or float or None)

Fitted model or log-likelihood (if firth) or None if could not fit

pyseer.model.fixed_effects_regression(variant, p, k, m, c, af, pattern, lineage_effects, lin, pret, lrtt, null_res, null_firth, kstrains, nkstrains, continuous)[source]

Fits the model y ~ Xb + Wa using either binomial error with logit link (binary traits) or Gaussian error (continuous traits)

  • y is the phenotype
  • X is the variant presence/absence (fixed effects)
  • W are covariate fixed effects, including population structure
  • a and b are slopes to be fitted
Parameters:
  • variant (str) – Variant identifier
  • p (numpy.array) – Phenotype vector (binary or continuous) (n, 1)
  • k (numpy.array) – Variant presence/absence vector (n, 1)
  • m (numpy.array) – Population structure matrix (n, m)
  • c (numpy.array) – Covariants matrix (n, j)
  • af (float) – Allele frequency
  • pattern (str) – Variant hashed pattern
  • lineage_effects (bool) – Whether to fit lineages or not
  • lin (numpy.array) – Lineages matrix (n, k)
  • pret (float) – Pre-filtering p-value threshold
  • lrtt (float) – Post-fitting p-value threshold
  • null_res (float or statsmodels.regression.linear_model.RegressionResultsWrapper) – Null-fit likelihood (binary) or model (continuous)
  • null_firth (float) – Firth regression likelihood
  • kstrains (iterable) – Sample labels with the variant
  • nkstrains (iterable) – Sample labels without the variant
  • continuous (bool) – Whether the phenotype is continuous or not
Returns:

result (pyseer.classes.Seer)

Results container

pyseer.model.pre_filtering(p, k, continuous)[source]

Calculate a naive p-value from a chisq test (binary phenotype) or a t-test (continuous phenotype) which is not adjusted for population structure

Parameters:
  • p (numpy.array) – Phenotypes vector (n, 1)
  • k (numpy.array) – Variant presence-absence vector (n, 1)
  • continous (bool) – Whether phenotypes are continuous or binary
Returns:

prep (float)

Naive p-value

bad_chisq (boolean)

Whether the chisq test had small values in the contingency table

lmm.py

LMM interface implementations

pyseer.lmm.fit_lmm(lmm, h2, variants, variant_mat, lineage_effects, lineage_clusters, covariates, continuous, filter_pvalue, lrt_pvalue)[source]

Fits LMM and returns LMM tuples for printing

Parameters:
  • lmm (pyseer.fastlmm.lmm_cov.LMM) – Initialised LMM model
  • h2 (float) – Trait’s variance explained by covariates
  • variants (iterable) – Tuples with variant object, phenotype vector and variant vector (pyseer.classes.LMM, numpy.array, numpy.array)
  • variant_mat (numpy.array) – Variants presence absence matrix (n, k)
  • lineage_effects (bool) – Whether to fit lineage effects
  • lineage_clusters (numpy.array) – Population structure matrix or lineage association binary matrix (n, k)
  • covariates (numpy.array) – Covariates matrix (n, m)
  • continuous (bool) – Whether the phenotype is continuous
  • filter_pvalue (float) – Pre-filtering p-value threshold
  • lrt_pvalue (float) – Post-fitting p-value threshold
Returns:

all_variants (iterable)

All variant objects fitted or filtered

pyseer.lmm.fit_lmm_block(lmm, h2, variant_block)[source]

Actually fits the LMM to numpy variant array see map/reduce section of _internal_single in fastlmm.association.single_snp

Parameters:
  • lmm (pyseer.fastlmm.lmm_cov.LMM) – Initialised LMM model
  • h2 (float) – Trait’s variance explained by covariates
  • variant_block (numpy.array) – Variants presence absence matrix (n, k)
Returns:

lmm_results (dict)

LMM results for this variants block

pyseer.lmm.initialise_lmm(p, cov, K_in, lmm_cache_in=None, lmm_cache_out=None, lineage_samples=None)[source]

Initialises LMM using the similarity matrix see _internal_single in fastlmm.association.single_snp

Parameters:
  • p (pandas.Series) – Phenotypes vector (n, 1)
  • cov (pandas.DataFrame) – Covariance matrix (n, m)
  • K_in (str) – Similarity matrix filename
  • lmm_chache_in (str or None) – Filename for an input LMM cache, None if it has to be computed
  • lmm_chache_out (str or None) – Filename to save the LMM cache, None otherwise.
  • lineage_samples (list or None) – Sample names used for lineage (must match K_in)
Returns:

p (pandas.Series)

Phenotype vector with the samples present in the similarity matrix

lmm (pyseer.fastlmm.lmm_cov.LMM)

Initialised LMM model

h2 (float)

Trait’s variance explained by covariates

utils.py

Utilities

pyseer.utils.format_output(item, lineage_dict=None, model='seer', print_samples=False)[source]

Format results for a variant for stdout printing

Parameters:
  • item (pyseer.classes.Seer or pyseer.classes.LMM) – Variant results container
  • lineage_dict (list) – Lineage labels
  • model (str) – The model used
  • print_samples (bool) – Whether to add the samples list to the putput
Returns:

out (str)

Tab-delimited string to be printed

pyseer.utils.set_env(**environ)[source]

Temporarily set the process environment variables.

>>> with set_env(PLUGINS_DIR=u'test/plugins'):
...   "PLUGINS_DIR" in os.environ
True
>>> "PLUGINS_DIR" in os.environ
False

cmdscale.py

Function to perform classical MDS

pyseer.cmdscale.cmdscale(D)[source]

Classical multidimensional scaling (MDS)

Parameters:D (numpy.array) – Symmetric distance matrix (n, n)
Returns:
Y (numpy.array)
Configuration matrix (n, p). Each column represents a dimension. Only the p dimensions corresponding to positive eigenvalues of B are returned. Note that each dimension is only determined up to an overall sign, corresponding to a reflection.
e (numpy.array)
Eigenvalues of B (n, 1)