Reference documentation
input.py
Functions to read data into pyseer and iterate over instances
- pyseer.input.file_hash(filename)[source]
Calculates the hash of an entire file on disk
- Parameters:
filename (str) – Location of file on disk
- Returns:
- hash (str)
SHA256 checksum
- pyseer.input.hash_pattern(k)[source]
Calculates the hash of a presence/absence vector
- Parameters:
k (numpy.array) – Variant presence/absence binary vector (n, 1)
- Returns:
- hash (byte)
Hashed pattern
- pyseer.input.iter_variants(p, m, cov, var_type, burden, burden_regions, infile, all_strains, sample_order, lineage_effects, lineage_clusters, min_af, max_af, max_missing, filter_pvalue, lrt_pvalue, null_fit, firth_null, uncompressed, continuous)[source]
Make an iterable to pass single variants to fixed effects regression
- Parameters:
p (pandas.DataFrame) – Phenotype vector (n, 1)
m (numpy.array) – Population structure matrix (n, k)
cov (pandas.DataFrame) – Covariates matrix (n, m)
var_type (str) – Variants type (one of: kmers, vcf or Rtab)
burden (bool) – Whether to slice a vcf file by burden regions
burden_regions (collections.deque) – Burden regions to slice the vcf with
infile (opened file) – Handle to opened variant file
all_strains (set-like) – All sample labels that should be present
sample_order – Sampes order to interpret each Rtab line
lineage_effects (bool) – Whether to fit lineage effects
clusters (lineage) – Lineage clusters indexes
min_af (float) – Minimum allele frequency (inclusive)
max_af (float) – maximum allele frequency (inclusive)
max_missing (float) – maximum missing frequency
filter_pvalue (float) – Pre-filtering p-value threshold
lrt_pvalue (float) – Filtering p-value threshold
null_fit (float or statsmodels.regression.linear_model.RegressionResultsWrapper) – Null-fit likelihood (binary) or model (continuous)
firth_null (float) – Firth regression likelihood
uncompressed (bool) – Whether the kmers file is uncompressed
continuous (bool) – Whether the phenotype is continuous or not
- Returns:
- var_name (str)
Variant name
- v (numpy.array)
Phenotypes vector (n, 1)
- k (numpy.array)
Variant presence/absence vector (n, 1)
- m (numpy.array)
Population structure matrix (n, k)
- c (numpy.array)
Covariates matrix (n, m)
- af (float)
Allele frequency
- pattern (bytes)
Variant hash
- lineage_effects (bool)
Whether to fit lineage effects
- lineage clusters (list)
Lineage clusters indexes
- filter_pvalue (float)
Pre-filtering p-value threshold
- lrt_pvalue (float)
Filtering p-value threshold
- null_fit (float or statsmodels.regression.linear_model.RegressionResultsWrapper)
Null-fit likelihood (binary) or model (continuous)
- firth_null (float)
Firth regression likelihood
- kstrains (iterable)
Sample labels with the variant
- nkstrains (iterable)
Sample labels without the variant
- continuous (bool)
Whether the phenotype is continuous or not
- pyseer.input.iter_variants_lmm(variant_iter, lmm, h2, lineage, lineage_clusters, covariates, continuous, filter_pvalue, lrt_pvalue)[source]
Make an iterable to pass single variants to fixed effects regression
- pyseer.input.load_burden(infile, burden_regions)[source]
Load burden regions for VCF analysis
- Parameters:
infile (str) – Input file for burden regions
burden_regions (list) – List to be filled in-place
- pyseer.input.load_covariates(infile, covariates, p)[source]
Load and encode a covariates matrix
- Parameters:
infile (str) – Input file for the covariates matrix
covariates (iterable or None) – List of string indicating which columns to use and their interpretation. Example: 2q indicates that the second column from the file is a quantitative variable, 2 indicates that that same column is categorical. If None, the matrix is loaded but nothing is done with it.
p (pandas.Series) – Phenotypes vector (n, 1)
- Returns:
- cov (pandas.DataFrame)
Covariance matrix (n, m)
- pyseer.input.load_lineage(infile, p)[source]
Load custom lineage clusters definitions
- Parameters:
infile (str) – Input file for lineage clusters
p (pandas.Series) – Phenotypes vector (n, 1)
- Returns:
- result (tuple of (numpy.array, list))
Lineage binary matrix and cluster labels
- pyseer.input.load_phenotypes(infile, column)[source]
Load phenotypes vector
- Parameters:
infile (str) – Matrix input file
column (str or None) – Phenotype column name or None to pick the last column
- Returns:
- p (pandas.Series)
Phenotype vector (n, 1)
- pyseer.input.load_structure(infile, p, max_dimensions, mds_type='classic', n_cpus=1, seed=None)[source]
Load population structure and apply multidimensional scaling
- Parameters:
infile (str) – Population structure (distance matrix) input file
p (pandas.Series) – Phenotype vector (n, 1)
max_dimensions (int) – Maximum dimensions to consider when applying metric or non-metric MDS
mds_type (str) – MDS algorithm to apply. One of classic, metric or non-metric. Any other input will trigger the metric MDS
n_cpus (int) – Number of CPUs to be used for the metric or non-metric MDS
seed (int or None) – Random seed for metric or non-metric MDS, None if not required
- Returns:
- m (pandas.DataFrame)
Population structure after MDS (n, m)
- pyseer.input.load_var_block(var_type, p, burden, burden_regions, infile, all_strains, sample_order, min_af, max_af, max_missing, uncompressed, block_size)[source]
Make in iterable to load blocks of variants for LMM
- Parameters:
var_type (str) – Variants type (one of: kmers, vcf or Rtab)
p (pandas.DataFrame) – Phenotype vector (n, 1)
burden (bool) – Whether to slice a vcf file by burden regions
burden_regions (collections.deque) – Burden regions to slice the vcf with
infile (opened file) – Handle to opened variant file
all_strains (set-like) – All sample labels that should be present
sample_order – Sampes order to interpret each Rtab line
min_af (float) – Minimum allele frequency (inclusive)
max_af (float) – maximum allele frequency (inclusive)
max_missing (float) – maximum missing frequency
uncompressed (bool) – Whether the kmers file is uncompressed
block_size (int) – How many variants to be loaded at once
- Returns:
- variants (iterable)
A collection of pyseer.classes.LMM objects describing the loaded variants (n,)
- variant_mat (numpy.array)
Variant bloack presence/absence matrix (n, block_size)
- eof (bool)
Whether we are at the end of the file
- pyseer.input.open_variant_file(var_type, var_file, burden_file, burden_regions, uncompressed)[source]
Open a variant file for use as an iterable
- Parameters:
var_type (str) – Type of variants file (kmers, vcf, Rtab)
var_file (str) – Location of file
burden_file (str) – File containing regions to group burden tests
burden_regions (list) – List of burden regions to be filled in-place
uncompressed (bool) – True if kmer file is not gzipped
- pyseer.input.read_variant(infile, p, var_type, burden, burden_regions, uncompressed, all_strains, sample_order, keep_list=None, noparse=False)[source]
Read input line and parse depending on input file type
Return a variant name and pres/abs vector
- Parameters:
infile (opened file) – Handle to opened variant file
p (pandas.Series) – Phenotypes vector (n, 1)
var_type (str) – Variants type (one of: kmers, vcf or Rtab)
burden (bool) – Whether to slice a vcf file by burden regions
burden_regions (collections.deque) – Burden regions to slice the vcf with
uncompressed (bool) – Whether the kmers file is uncompressed
all_strains (set-like) – All sample labels that should be present
sample_order – Samples order to interpret each Rtab line
keep_list (dict) –
Variant names to properly read, any other will return None
(default = None)
noparse (bool) –
Set True to skip line without parsing and return None, irrespective of presence in skip_list
(default = False)
- Returns:
- eof (bool)
Whether we are at the end of the file
- k (numpy.array)
Variant presence/absence vector
- var_name (str)
Variant name
- kstrains (list)
Samples in which the variant is present
- nkstrains (list)
Samples in which the variant is absent
- af (float)
Allele frequency
- missing (float)
Missing frequency
- pyseer.input.read_vcf_var(variant, d, keep_list=None)[source]
Parses vcf variants from pysam
Returns None if filtered variant. Mutates passed dictionary d
- Parameters:
variant (pysam.libcbcf.VariantRecord) – Variant to be parsed
d (dict) – Dictionary to be populated in-place
keep_list (list) – List of variants to read
model.py
Original SEER model (fixed effects) implementations
- pyseer.model.firth_likelihood(beta, logit)[source]
Convenience function to calculate likelihood of Firth regression
- Parameters:
beta (numpy.array) – (n, 1)
logit (statsmodels.discrete.discrete_model.Logit) – Logistic model
- Returns:
- likelihood (float)
Firth likelihood
- pyseer.model.fit_firth(logit_model, start_vec, X, y, step_limit=1000, convergence_limit=0.0001)[source]
Do firth regression
- Parameters:
logit (statsmodels.discrete.discrete_model.Logit) – Logistic model
start_vec (numpy.array) – Pre-initialized vector to speed-up convergence (n, 1)
X (numpy.array) – (n, m)
y (numpy.array) – (n, )
step_limit (int) – Maximum number of iterations
convergence_limit (float) – Convergence tolerance
- Returns:
- intercept (float)
Intercept
- kbeta (float)
Variant beta
- beta (iterable)
Covariates betas (n-2)
- bse (float)
Beta std-err
- fitll (float or None)
Likelihood of fit or None if could not fit
- pyseer.model.fit_lineage_effect(lin, c, k)[source]
Fits the model k ~ Wa using binomial error with logit link. W are the lineages (either a projection of samples, or cluster indicators) and covariates. Returns the index of the most significant lineage
- Parameters:
lin (numpy.array) – Population structure matrix or lineage association binary matrix (n, k)
c (numpy.array) – Covariants matrix (n, j)
k (numpy.array) – Variant presence-absence vector (n, 1)
- Returns:
- max_lineage (int or None)
Index of the most significant lineage or None is could not fit
- pyseer.model.fit_null(p, m, cov, continuous, firth=False)[source]
Fit the null model i.e. regression without k-mer
y ~ Wa
Returns log-likelihood
- Parameters:
p (numpy.array) – Phenotypes vector (n, 1)
m (numpy.array) – Population structure matrix (n, k)
cov (pandas.DataFrame) – Covariants dataframe (n, j)
continous (bool) – Whether phenotypes are continuous or binary
firth (bool) – For binary phenotypes whether to use firth regression
- Returns:
- null_res (statsmodels.regression.linear_model.RegressionResultsWrapper or float or None)
Fitted model or log-likelihood (if firth) or None if could not fit
- pyseer.model.fixed_effects_regression(variant, p, k, m, c, af, pattern, lineage_effects, lin, pret, lrtt, null_res, null_firth, kstrains, nkstrains, continuous)[source]
Fits the model y ~ Xb + Wa using either binomial error with logit link (binary traits) or Gaussian error (continuous traits)
y is the phenotype
X is the variant presence/absence (fixed effects)
W are covariate fixed effects, including population structure
a and b are slopes to be fitted
- Parameters:
variant (str) – Variant identifier
p (numpy.array) – Phenotype vector (binary or continuous) (n, 1)
k (numpy.array) – Variant presence/absence vector (n, 1)
m (numpy.array) – Population structure matrix (n, m)
c (numpy.array) – Covariants matrix (n, j)
af (float) – Allele frequency
pattern (str) – Variant hashed pattern
lineage_effects (bool) – Whether to fit lineages or not
lin (numpy.array) – Lineages matrix (n, k)
pret (float) – Pre-filtering p-value threshold
lrtt (float) – Post-fitting p-value threshold
null_res (float or statsmodels.regression.linear_model.RegressionResultsWrapper) – Null-fit likelihood (binary) or model (continuous)
null_firth (float) – Firth regression likelihood
kstrains (iterable) – Sample labels with the variant
nkstrains (iterable) – Sample labels without the variant
continuous (bool) – Whether the phenotype is continuous or not
- Returns:
- result (pyseer.classes.Seer)
Results container
- pyseer.model.pre_filtering(p, k, continuous)[source]
Calculate a naive p-value from a chisq test (binary phenotype) or a t-test (continuous phenotype) which is not adjusted for population structure
- Parameters:
p (numpy.array) – Phenotypes vector (n, 1)
k (numpy.array) – Variant presence-absence vector (n, 1)
continous (bool) – Whether phenotypes are continuous or binary
- Returns:
- prep (float)
Naive p-value
- bad_chisq (boolean)
Whether the chisq test had small values in the contingency table
lmm.py
LMM interface implementations
- pyseer.lmm.fit_lmm(lmm, h2, variants, variant_mat, lineage_effects, lineage_clusters, covariates, continuous, filter_pvalue, lrt_pvalue)[source]
Fits LMM and returns LMM tuples for printing
- Parameters:
lmm (pyseer.fastlmm.lmm_cov.LMM) – Initialised LMM model
h2 (float) – Trait’s variance explained by covariates
variants (iterable) – Tuples with variant object, phenotype vector and variant vector (pyseer.classes.LMM, numpy.array, numpy.array)
variant_mat (numpy.array) – Variants presence absence matrix (n, k)
lineage_effects (bool) – Whether to fit lineage effects
lineage_clusters (numpy.array) – Population structure matrix or lineage association binary matrix (n, k)
covariates (numpy.array) – Covariates matrix (n, m)
continuous (bool) – Whether the phenotype is continuous
filter_pvalue (float) – Pre-filtering p-value threshold
lrt_pvalue (float) – Post-fitting p-value threshold
- Returns:
- all_variants (iterable)
All variant objects fitted or filtered
- pyseer.lmm.fit_lmm_block(lmm, h2, variant_block)[source]
Actually fits the LMM to numpy variant array see map/reduce section of _internal_single in fastlmm.association.single_snp
- Parameters:
lmm (pyseer.fastlmm.lmm_cov.LMM) – Initialised LMM model
h2 (float) – Trait’s variance explained by covariates
variant_block (numpy.array) – Variants presence absence matrix (n, k)
- Returns:
- lmm_results (dict)
LMM results for this variants block
- pyseer.lmm.initialise_lmm(p, cov, K_in, lmm_cache_in=None, lmm_cache_out=None, lineage_samples=None)[source]
Initialises LMM using the similarity matrix see _internal_single in fastlmm.association.single_snp
- Parameters:
p (pandas.Series) – Phenotypes vector (n, 1)
cov (pandas.DataFrame) – Covariance matrix (n, m)
K_in (str) – Similarity matrix filename
lmm_chache_in (str or None) – Filename for an input LMM cache, None if it has to be computed
lmm_chache_out (str or None) – Filename to save the LMM cache, None otherwise.
lineage_samples (list or None) – Sample names used for lineage (must match K_in)
- Returns:
- p (pandas.Series)
Phenotype vector with the samples present in the similarity matrix
- lmm (pyseer.fastlmm.lmm_cov.LMM)
Initialised LMM model
- h2 (float)
Trait’s variance explained by covariates
utils.py
Utilities
- pyseer.utils.format_output(item, lineage_dict=None, model='seer', print_samples=False)[source]
Format results for a variant for stdout printing
- Parameters:
item (pyseer.classes.Seer or pyseer.classes.LMM) – Variant results container
lineage_dict (list) – Lineage labels
model (str) – The model used
print_samples (bool) – Whether to add the samples list to the putput
- Returns:
- out (str)
Tab-delimited string to be printed
cmdscale.py
Function to perform classical MDS
- pyseer.cmdscale.cmdscale(D)[source]
Classical multidimensional scaling (MDS)
- Parameters:
D (numpy.array) – Symmetric distance matrix (n, n)
- Returns:
- Y (numpy.array)
Configuration matrix (n, p). Each column represents a dimension. Only the p dimensions corresponding to positive eigenvalues of B are returned. Note that each dimension is only determined up to an overall sign, corresponding to a reflection.
- e (numpy.array)
Eigenvalues of B (n, 1)
enet.py
Elastic net model implementations
- pyseer.enet.correlation_filter(p, all_vars, quantile_filter=0.25)[source]
Calculates correlations between phenotype and variants, giving those that are above the specified quantile
- Parameters:
p (pandas.DataFrame) – Phenotype vector (n, 1)
all_vars (scipy.sparse.csr_matrix) – Narrow sparse matrix representation of all variants to fit to (rows = variants, columns = samples)
quantile_filter (float) –
The quantile to discard at e.g. 0.25, retain top 75%
[default = 0.25]
- Returns:
- cor_filter (numpy.array)
The indices of variants passing the filter
- pyseer.enet.enet_predict(enet_fit, variants, continuous, responses=None)[source]
Use a fitted elastic net model to make predictions about new observations. Returns accuracy if true responses known
- Parameters:
enet_fit (cvglmnet) – An elastic net model fitted using cvglmnet or similar
variants (scipy.sparse.csc_matrix) – Wide sparse matrix representation of all variants to predict with (rows = samples, columns = variants)
continuous (bool) – True if a continuous phenotype, False if a binary phenotype
responses (np.array) –
True phenotypes to calculate R^2 with
[default = None]
- Returns:
- preds (numpy.array)
Predicted phenotype for each input sample in variants
- R2 (float)
Variance explained by model (or None if true labels not provided).
- pyseer.enet.find_enet_selected(enet_betas, var_indices, p, c, var_type, fit_seer, burden, burden_regions, infile, all_strains, sample_order, continuous, find_lineage, lin, uncompressed)[source]
Read through the variant input file again, yielding just those variants which had a non-zero slope for printing
- Parameters:
enet_betas (numpy.array) – Fitted slopes of intercept, covariants and variants from elastic net
var_indices (list) – The 0-indexed locations (in the original file) of variants represented in enet_betas
p (pandas.DataFrame) – Phenotype vector (n, 1)
c (numpy.array) – Covariate matrix (n, j)
var_type (str) – Variants type (one of: kmers, vcf or Rtab)
(tuple (fit_seer) – m, null_model, null_firth) Distance projection and null models required to fit fixed effect regression
burden (bool) – Whether to slice a vcf file by burden regions
burden_regions (collections.deque) – Burden regions to slice the vcf with
infile (opened file) – Handle to opened variant file
all_strains (set-like) – All sample labels that should be present
sample_order – Sample order to interpret each Rtab line
continuous (bool) – Is phenotype/fit continuous?
lineage_effects (bool) – Whether to fit lineages or not
lin (numpy.array) – Lineages matrix (n, k)
uncompressed (bool) – Whether the kmers file is uncompressed
- Returns:
- variant (var_obj.Enet)
Iterable of processed variants for printing
- pyseer.enet.fit_enet(p, variants, covariates, weights, continuous, alpha, lineage_dict=None, fold_ids=None, n_folds=10, n_cpus=1)[source]
Fit an elastic net model to a set of variants. Prints information about model fit and prediction quality to STDERR
- Parameters:
p (pandas.DataFrame) – Phenotype vector (n, 1)
variants (scipy.sparse.csc_matrix) – Wide sparse matrix representation of all variants to fit to (rows = samples, columns = variants)
covariates (pandas.DataFrame) – Covariate matrix (n, j)
weights (np.array) – Vector of sample weights (n, 1)
continuous (bool) – If True fit a Gaussian error model, otherwise Bionomial error
alpha (float) – Between 0-1, sets the mix between ridge regression and lasso regression
lineage_dict (list) –
Names of lineages, indices corrsponding to fold_ids
[default = None]
fold_ids (list) –
Index of fold assignment for cross-validation, from 0 to 1-n_folds
[default = None]
n_folds (int) –
Number of folds in cross-validation
[default = 10]
n_cpus (int) –
Number of processes to use in cross-validation Set to -1 to use all available
[default = 1]
- Returns:
- betas (numpy.array)
The fitted betas (slopes) for each variant
- pyseer.enet.load_all_vars(var_type, p, burden, burden_regions, infile, all_strains, sample_order, min_af, max_af, max_missing, uncompressed)[source]
Load all variants in the input file into a sparse matrix representation
- Parameters:
var_type (str) – Variants type (one of: kmers, vcf or Rtab)
p (pandas.DataFrame) – Phenotype vector (n, 1)
burden (bool) – Whether to slice a vcf file by burden regions
burden_regions (collections.deque) – Burden regions to slice the vcf with
infile (opened file) – Handle to opened variant file
all_strains (set-like) – All sample labels that should be present
sample_order – Sampes order to interpret each Rtab line
min_af (float) – Minimum allele frequency (inclusive)
max_af (float) – maximum allele frequency (inclusive)
max_missing (float) – maximum missing frequency
uncompressed (bool) – Whether the kmers file is uncompressed
- Returns:
- variants (scipy.sparse.csr_matrix)
A sparse matrix representation of all variants in the input
- selected_vars (list)
0-Indices of variants in the input file in variants (which passed AF filtering)
- var_idx (int)
The number of read variants (number of rows of variants)
- pyseer.enet.write_lineage_predictions(true_values, predictions, fold_ids, lineage_dict, continuous, stderr_print=True)[source]
Writes prediction ability stratified by lineage to stderr
- Parameters:
true_values (np.array) – Observed values of phenotype
predictions (np.array) – Predicted phenotype values
lineage_dict (list) – Names of lineages, indices corrsponding to fold_ids
fold_ids (list) – Index of fold assignment for cross-validation, from 0 to 1-n_folds
continuous (bool) – True if a continuous phenotype, False if a binary phenotype
stderr_print (bool) –
Print output to stderr
[default = True]
- Returns:
- R2_vals (list)
R2 values for each fold
- confusion (list)
Tuple of tn, fp, fn, tp for each fold