Reference documentation

input.py

Functions to read data into pyseer and iterate over instances

pyseer.input.file_hash(filename)[source]

Calculates the hash of an entire file on disk

Parameters:

filename (str) – Location of file on disk

Returns:

hash (str): SHA256 checksum

pyseer.input.hash_pattern(k)[source]

Calculates the hash of a presence/absence vector

Parameters:

k (numpy.array) – Variant presence/absence binary vector (n, 1)

Returns:

hash (byte): Hashed pattern

pyseer.input.iter_variants(p, m, cov, var_type, burden, burden_regions, infile, all_strains, sample_order, lineage_effects, lineage_clusters, min_af, max_af, max_missing, filter_pvalue, lrt_pvalue, null_fit, firth_null, uncompressed, continuous)[source]

Make an iterable to pass single variants to fixed effects regression

Parameters:

p (pandas.DataFrame) – Phenotype vector (n, 1)
m (numpy.array) – Population structure matrix (n, k)
cov (pandas.DataFrame) – Covariates matrix (n, m)
var_type (str) – Variants type (one of: kmers, vcf or Rtab)
burden (bool) – Whether to slice a vcf file by burden regions
burden_regions (collections.deque) – Burden regions to slice the vcf with
infile (opened file) – Handle to opened variant file
all_strains (set-like) – All sample labels that should be present
sample_order – Sampes order to interpret each Rtab line
lineage_effects (bool) – Whether to fit lineage effects
clusters (lineage) – Lineage clusters indexes
min_af (float) – Minimum allele frequency (inclusive)
max_af (float) – maximum allele frequency (inclusive)
max_missing (float) – maximum missing frequency
filter_pvalue (float) – Pre-filtering p-value threshold
lrt_pvalue (float) – Filtering p-value threshold
null_fit (float or statsmodels.regression.linear_model.RegressionResultsWrapper) – Null-fit likelihood (binary) or model (continuous)
firth_null (float) – Firth regression likelihood
uncompressed (bool) – Whether the kmers file is uncompressed
continuous (bool) – Whether the phenotype is continuous or not

Returns:

var_name (str): Variant name
v (numpy.array): Phenotypes vector (n, 1)
k (numpy.array): Variant presence/absence vector (n, 1)
m (numpy.array): Population structure matrix (n, k)
c (numpy.array): Covariates matrix (n, m)
af (float): Allele frequency
pattern (bytes): Variant hash
lineage_effects (bool): Whether to fit lineage effects
lineage clusters (list): Lineage clusters indexes
filter_pvalue (float): Pre-filtering p-value threshold
lrt_pvalue (float): Filtering p-value threshold
null_fit (float or statsmodels.regression.linear_model.RegressionResultsWrapper): Null-fit likelihood (binary) or model (continuous)
firth_null (float): Firth regression likelihood
kstrains (iterable): Sample labels with the variant
nkstrains (iterable): Sample labels without the variant
continuous (bool): Whether the phenotype is continuous or not

pyseer.input.iter_variants_lmm(variant_iter, lmm, h2, lineage, lineage_clusters, covariates, continuous, filter_pvalue, lrt_pvalue)[source]: Make an iterable to pass single variants to fixed effects regression

pyseer.input.load_burden(infile, burden_regions)[source]

Load burden regions for VCF analysis

Parameters:

infile (str) – Input file for burden regions
burden_regions (list) – List to be filled in-place

pyseer.input.load_covariates(infile, covariates, p)[source]

Load and encode a covariates matrix

Parameters:

infile (str) – Input file for the covariates matrix
covariates (iterable or None) – List of string indicating which columns to use and their interpretation. Example: 2q indicates that the second column from the file is a quantitative variable, 2 indicates that that same column is categorical. If None, the matrix is loaded but nothing is done with it.
p (pandas.Series) – Phenotypes vector (n, 1)

Returns:

cov (pandas.DataFrame): Covariance matrix (n, m)

pyseer.input.load_lineage(infile, p)[source]

Load custom lineage clusters definitions

Parameters:

infile (str) – Input file for lineage clusters
p (pandas.Series) – Phenotypes vector (n, 1)

Returns:

result (tuple of (numpy.array, list)): Lineage binary matrix and cluster labels

pyseer.input.load_phenotypes(infile, column)[source]

Load phenotypes vector

Parameters:

infile (str) – Matrix input file
column (str or None) – Phenotype column name or None to pick the last column

Returns:

p (pandas.Series): Phenotype vector (n, 1)

pyseer.input.load_structure(infile, p, max_dimensions, mds_type='classic', n_cpus=1, seed=None)[source]

Load population structure and apply multidimensional scaling

Parameters:

infile (str) – Population structure (distance matrix) input file
p (pandas.Series) – Phenotype vector (n, 1)
max_dimensions (int) – Maximum dimensions to consider when applying metric or non-metric MDS
mds_type (str) – MDS algorithm to apply. One of classic, metric or non-metric. Any other input will trigger the metric MDS
n_cpus (int) – Number of CPUs to be used for the metric or non-metric MDS
seed (int or None) – Random seed for metric or non-metric MDS, None if not required

Returns:

m (pandas.DataFrame): Population structure after MDS (n, m)

pyseer.input.load_var_block(var_type, p, burden, burden_regions, infile, all_strains, sample_order, min_af, max_af, max_missing, uncompressed, block_size)[source]

Make in iterable to load blocks of variants for LMM

Parameters:

var_type (str) – Variants type (one of: kmers, vcf or Rtab)
p (pandas.DataFrame) – Phenotype vector (n, 1)
burden (bool) – Whether to slice a vcf file by burden regions
burden_regions (collections.deque) – Burden regions to slice the vcf with
infile (opened file) – Handle to opened variant file
all_strains (set-like) – All sample labels that should be present
sample_order – Sampes order to interpret each Rtab line
min_af (float) – Minimum allele frequency (inclusive)
max_af (float) – maximum allele frequency (inclusive)
max_missing (float) – maximum missing frequency
uncompressed (bool) – Whether the kmers file is uncompressed
block_size (int) – How many variants to be loaded at once

Returns:

variants (iterable): A collection of pyseer.classes.LMM objects describing the loaded variants (n,)
variant_mat (numpy.array): Variant bloack presence/absence matrix (n, block_size)
eof (bool): Whether we are at the end of the file

pyseer.input.open_variant_file(var_type, var_file, burden_file, burden_regions, uncompressed)[source]

Open a variant file for use as an iterable

Parameters:

var_type (str) – Type of variants file (kmers, vcf, Rtab)
var_file (str) – Location of file
burden_file (str) – File containing regions to group burden tests
burden_regions (list) – List of burden regions to be filled in-place
uncompressed (bool) – True if kmer file is not gzipped

pyseer.input.read_variant(infile, p, var_type, burden, burden_regions, uncompressed, all_strains, sample_order, keep_list=None, noparse=False)[source]

Read input line and parse depending on input file type

Return a variant name and pres/abs vector

Parameters:

infile (opened file) – Handle to opened variant file
p (pandas.Series) – Phenotypes vector (n, 1)
var_type (str) – Variants type (one of: kmers, vcf or Rtab)
burden (bool) – Whether to slice a vcf file by burden regions
burden_regions (collections.deque) – Burden regions to slice the vcf with
uncompressed (bool) – Whether the kmers file is uncompressed
all_strains (set-like) – All sample labels that should be present
sample_order – Samples order to interpret each Rtab line
keep_list (dict) –
Variant names to properly read, any other will return None

(default = None)
noparse (bool) –
Set True to skip line without parsing and return None, irrespective of presence in skip_list

(default = False)

Returns:

eof (bool): Whether we are at the end of the file
k (numpy.array): Variant presence/absence vector
var_name (str): Variant name
kstrains (list): Samples in which the variant is present
nkstrains (list): Samples in which the variant is absent
af (float): Allele frequency
missing (float): Missing frequency

pyseer.input.read_vcf_var(variant, d, keep_list=None)[source]

Parses vcf variants from pysam

Returns None if filtered variant. Mutates passed dictionary d

Parameters:

variant (pysam.libcbcf.VariantRecord) – Variant to be parsed
d (dict) – Dictionary to be populated in-place
keep_list (list) – List of variants to read

model.py

Original SEER model (fixed effects) implementations

pyseer.model.firth_likelihood(beta, logit)[source]

Convenience function to calculate likelihood of Firth regression

Parameters:

beta (numpy.array) – (n, 1)
logit (statsmodels.discrete.discrete_model.Logit) – Logistic model

Returns:

likelihood (float): Firth likelihood

pyseer.model.fit_firth(logit_model, start_vec, X, y, step_limit=1000, convergence_limit=0.0001)[source]

Do firth regression

Parameters:

logit (statsmodels.discrete.discrete_model.Logit) – Logistic model
start_vec (numpy.array) – Pre-initialized vector to speed-up convergence (n, 1)
X (numpy.array) – (n, m)
y (numpy.array) – (n, )
step_limit (int) – Maximum number of iterations
convergence_limit (float) – Convergence tolerance

Returns:

intercept (float): Intercept
kbeta (float): Variant beta
beta (iterable): Covariates betas (n-2)
bse (float): Beta std-err
fitll (float or None): Likelihood of fit or None if could not fit

pyseer.model.fit_lineage_effect(lin, c, k)[source]

Fits the model k ~ Wa using binomial error with logit link. W are the lineages (either a projection of samples, or cluster indicators) and covariates. Returns the index of the most significant lineage

Parameters:

lin (numpy.array) – Population structure matrix or lineage association binary matrix (n, k)
c (numpy.array) – Covariants matrix (n, j)
k (numpy.array) – Variant presence-absence vector (n, 1)

Returns:

max_lineage (int or None): Index of the most significant lineage or None is could not fit

pyseer.model.fit_null(p, m, cov, continuous, firth=False)[source]

Fit the null model i.e. regression without k-mer

y ~ Wa

Returns log-likelihood

Parameters:

p (numpy.array) – Phenotypes vector (n, 1)
m (numpy.array) – Population structure matrix (n, k)
cov (pandas.DataFrame) – Covariants dataframe (n, j)
continous (bool) – Whether phenotypes are continuous or binary
firth (bool) – For binary phenotypes whether to use firth regression

Returns:

null_res (statsmodels.regression.linear_model.RegressionResultsWrapper or float or None): Fitted model or log-likelihood (if firth) or None if could not fit

pyseer.model.fixed_effects_regression(variant, p, k, m, c, af, pattern, lineage_effects, lin, pret, lrtt, null_res, null_firth, kstrains, nkstrains, continuous)[source]

Fits the model y ~ Xb + Wa using either binomial error with logit link (binary traits) or Gaussian error (continuous traits)

y is the phenotype
X is the variant presence/absence (fixed effects)
W are covariate fixed effects, including population structure
a and b are slopes to be fitted

Parameters:

variant (str) – Variant identifier
p (numpy.array) – Phenotype vector (binary or continuous) (n, 1)
k (numpy.array) – Variant presence/absence vector (n, 1)
m (numpy.array) – Population structure matrix (n, m)
c (numpy.array) – Covariants matrix (n, j)
af (float) – Allele frequency
pattern (str) – Variant hashed pattern
lineage_effects (bool) – Whether to fit lineages or not
lin (numpy.array) – Lineages matrix (n, k)
pret (float) – Pre-filtering p-value threshold
lrtt (float) – Post-fitting p-value threshold
null_res (float or statsmodels.regression.linear_model.RegressionResultsWrapper) – Null-fit likelihood (binary) or model (continuous)
null_firth (float) – Firth regression likelihood
kstrains (iterable) – Sample labels with the variant
nkstrains (iterable) – Sample labels without the variant
continuous (bool) – Whether the phenotype is continuous or not

Returns:

result (pyseer.classes.Seer): Results container

pyseer.model.pre_filtering(p, k, continuous)[source]

Calculate a naive p-value from a chisq test (binary phenotype) or a t-test (continuous phenotype) which is not adjusted for population structure

Parameters:

p (numpy.array) – Phenotypes vector (n, 1)
k (numpy.array) – Variant presence-absence vector (n, 1)
continous (bool) – Whether phenotypes are continuous or binary

Returns:

prep (float): Naive p-value
bad_chisq (boolean): Whether the chisq test had small values in the contingency table

lmm.py

LMM interface implementations

pyseer.lmm.fit_lmm(lmm, h2, variants, variant_mat, lineage_effects, lineage_clusters, covariates, continuous, filter_pvalue, lrt_pvalue)[source]

Fits LMM and returns LMM tuples for printing

Parameters:

lmm (pyseer.fastlmm.lmm_cov.LMM) – Initialised LMM model
h2 (float) – Trait’s variance explained by covariates
variants (iterable) – Tuples with variant object, phenotype vector and variant vector (pyseer.classes.LMM, numpy.array, numpy.array)
variant_mat (numpy.array) – Variants presence absence matrix (n, k)
lineage_effects (bool) – Whether to fit lineage effects
lineage_clusters (numpy.array) – Population structure matrix or lineage association binary matrix (n, k)
covariates (numpy.array) – Covariates matrix (n, m)
continuous (bool) – Whether the phenotype is continuous
filter_pvalue (float) – Pre-filtering p-value threshold
lrt_pvalue (float) – Post-fitting p-value threshold

Returns:

all_variants (iterable): All variant objects fitted or filtered

pyseer.lmm.fit_lmm_block(lmm, h2, variant_block)[source]

Actually fits the LMM to numpy variant array see map/reduce section of _internal_single in fastlmm.association.single_snp

Parameters:

lmm (pyseer.fastlmm.lmm_cov.LMM) – Initialised LMM model
h2 (float) – Trait’s variance explained by covariates
variant_block (numpy.array) – Variants presence absence matrix (n, k)

Returns:

lmm_results (dict): LMM results for this variants block

pyseer.lmm.initialise_lmm(p, cov, K_in, lmm_cache_in=None, lmm_cache_out=None, lineage_samples=None)[source]

Initialises LMM using the similarity matrix see _internal_single in fastlmm.association.single_snp

Parameters:

p (pandas.Series) – Phenotypes vector (n, 1)
cov (pandas.DataFrame) – Covariance matrix (n, m)
K_in (str) – Similarity matrix filename
lmm_chache_in (str or None) – Filename for an input LMM cache, None if it has to be computed
lmm_chache_out (str or None) – Filename to save the LMM cache, None otherwise.
lineage_samples (list or None) – Sample names used for lineage (must match K_in)

Returns:

p (pandas.Series): Phenotype vector with the samples present in the similarity matrix
lmm (pyseer.fastlmm.lmm_cov.LMM): Initialised LMM model
h2 (float): Trait’s variance explained by covariates

utils.py

Utilities

pyseer.utils.format_output(item, lineage_dict=None, model='seer', print_samples=False)[source]

Format results for a variant for stdout printing

Parameters:

item (pyseer.classes.Seer or pyseer.classes.LMM) – Variant results container
lineage_dict (list) – Lineage labels
model (str) – The model used
print_samples (bool) – Whether to add the samples list to the putput

Returns:

out (str): Tab-delimited string to be printed

pyseer.utils.set_env(**environ)[source]

Temporarily set the process environment variables.

>>> with set_env(PLUGINS_DIR=u'test/plugins'):
...   "PLUGINS_DIR" in os.environ
True

>>> "PLUGINS_DIR" in os.environ
False

cmdscale.py

Function to perform classical MDS

pyseer.cmdscale.cmdscale(D)[source]

Classical multidimensional scaling (MDS)

Parameters:

D (numpy.array) – Symmetric distance matrix (n, n)

Returns:

Y (numpy.array): Configuration matrix (n, p). Each column represents a dimension. Only the p dimensions corresponding to positive eigenvalues of B are returned. Note that each dimension is only determined up to an overall sign, corresponding to a reflection.
e (numpy.array): Eigenvalues of B (n, 1)

enet.py

Elastic net model implementations

pyseer.enet.correlation_filter(p, all_vars, quantile_filter=0.25)[source]

Calculates correlations between phenotype and variants, giving those that are above the specified quantile

Parameters:

p (pandas.DataFrame) – Phenotype vector (n, 1)
all_vars (scipy.sparse.csr_matrix) – Narrow sparse matrix representation of all variants to fit to (rows = variants, columns = samples)
quantile_filter (float) –
The quantile to discard at e.g. 0.25, retain top 75%

[default = 0.25]

Returns:

cor_filter (numpy.array): The indices of variants passing the filter

pyseer.enet.enet_predict(enet_fit, variants, continuous, responses=None)[source]

Use a fitted elastic net model to make predictions about new observations. Returns accuracy if true responses known

Parameters:

enet_fit (cvglmnet) – An elastic net model fitted using cvglmnet or similar
variants (scipy.sparse.csc_matrix) – Wide sparse matrix representation of all variants to predict with (rows = samples, columns = variants)
continuous (bool) – True if a continuous phenotype, False if a binary phenotype
responses (np.array) –
True phenotypes to calculate R^2 with

[default = None]

Returns:

preds (numpy.array): Predicted phenotype for each input sample in variants
R2 (float): Variance explained by model (or None if true labels not provided).

pyseer.enet.find_enet_selected(enet_betas, var_indices, p, c, var_type, fit_seer, burden, burden_regions, infile, all_strains, sample_order, continuous, find_lineage, lin, uncompressed)[source]

Read through the variant input file again, yielding just those variants which had a non-zero slope for printing

Parameters:

enet_betas (numpy.array) – Fitted slopes of intercept, covariants and variants from elastic net
var_indices (list) – The 0-indexed locations (in the original file) of variants represented in enet_betas
p (pandas.DataFrame) – Phenotype vector (n, 1)
c (numpy.array) – Covariate matrix (n, j)
var_type (str) – Variants type (one of: kmers, vcf or Rtab)
(tuple (fit_seer) – m, null_model, null_firth) Distance projection and null models required to fit fixed effect regression
burden (bool) – Whether to slice a vcf file by burden regions
burden_regions (collections.deque) – Burden regions to slice the vcf with
infile (opened file) – Handle to opened variant file
all_strains (set-like) – All sample labels that should be present
sample_order – Sample order to interpret each Rtab line
continuous (bool) – Is phenotype/fit continuous?
lineage_effects (bool) – Whether to fit lineages or not
lin (numpy.array) – Lineages matrix (n, k)
uncompressed (bool) – Whether the kmers file is uncompressed

Returns:

variant (var_obj.Enet): Iterable of processed variants for printing

pyseer.enet.fit_enet(p, variants, covariates, weights, continuous, alpha, lineage_dict=None, fold_ids=None, n_folds=10, n_cpus=1)[source]

Fit an elastic net model to a set of variants. Prints information about model fit and prediction quality to STDERR

Parameters:

p (pandas.DataFrame) – Phenotype vector (n, 1)
variants (scipy.sparse.csc_matrix) – Wide sparse matrix representation of all variants to fit to (rows = samples, columns = variants)
covariates (pandas.DataFrame) – Covariate matrix (n, j)
weights (np.array) – Vector of sample weights (n, 1)
continuous (bool) – If True fit a Gaussian error model, otherwise Bionomial error
alpha (float) – Between 0-1, sets the mix between ridge regression and lasso regression
lineage_dict (list) –
Names of lineages, indices corrsponding to fold_ids

[default = None]
fold_ids (list) –
Index of fold assignment for cross-validation, from 0 to 1-n_folds

[default = None]
n_folds (int) –
Number of folds in cross-validation

[default = 10]
n_cpus (int) –
Number of processes to use in cross-validation Set to -1 to use all available

[default = 1]

Returns:

betas (numpy.array): The fitted betas (slopes) for each variant

pyseer.enet.load_all_vars(var_type, p, burden, burden_regions, infile, all_strains, sample_order, min_af, max_af, max_missing, uncompressed)[source]

Load all variants in the input file into a sparse matrix representation

Parameters:

var_type (str) – Variants type (one of: kmers, vcf or Rtab)
p (pandas.DataFrame) – Phenotype vector (n, 1)
burden (bool) – Whether to slice a vcf file by burden regions
burden_regions (collections.deque) – Burden regions to slice the vcf with
infile (opened file) – Handle to opened variant file
all_strains (set-like) – All sample labels that should be present
sample_order – Sampes order to interpret each Rtab line
min_af (float) – Minimum allele frequency (inclusive)
max_af (float) – maximum allele frequency (inclusive)
max_missing (float) – maximum missing frequency
uncompressed (bool) – Whether the kmers file is uncompressed

Returns:

variants (scipy.sparse.csr_matrix): A sparse matrix representation of all variants in the input
selected_vars (list): 0-Indices of variants in the input file in variants (which passed AF filtering)
var_idx (int): The number of read variants (number of rows of variants)

pyseer.enet.write_lineage_predictions(true_values, predictions, fold_ids, lineage_dict, continuous, stderr_print=True)[source]

Writes prediction ability stratified by lineage to stderr

Parameters:

true_values (np.array) – Observed values of phenotype
predictions (np.array) – Predicted phenotype values
lineage_dict (list) – Names of lineages, indices corrsponding to fold_ids
fold_ids (list) – Index of fold assignment for cross-validation, from 0 to 1-n_folds
continuous (bool) – True if a continuous phenotype, False if a binary phenotype
stderr_print (bool) –
Print output to stderr

[default = True]

Returns:

R2_vals (list): R2 values for each fold
confusion (list): Tuple of tn, fp, fn, tp for each fold