pyproteome.data_sets package¶
-
class
pyproteome.data_sets.
DataSet
(name='', psms=None, search_name=None, channels=None, groups=None, cmp_groups=None, fix_channel_names=True, dropna=False, pick_best_psm=True, constand_norm=False, merge_duplicates=True, filter_bad=True, check_raw=True, skip_load=False, skip_logging=False)[source]¶ Bases:
object
Class that encompasses a proteomics data set. Data sets can be initialized by calling this class’s constructor directly, or using
load_all_data()
.Includes peptide-spectrum matches, quantification info, and mappings between channels, samples, and sample groups.
Data sets are automatically loaded, filtered, and merged by default. See
DEFAULT_FILTER_BAD
for default filtering parameters. SeeDataSet.merge_duplicates()
for info on how multiple peptide-spectrum matches are integrated together.Attributes: - search_name : str, optional
Name of the search file this data set was loaded from.
- psms :
pandas.DataFrame
Contains at least ‘Proteins’, ‘Sequence’, and ‘Modifications’ columns as well as any quantication data.
- channels : dict of str, str
Maps label channel to sample name.
- groups : dict of str, list of str
Maps groups to list of sample names. The primary group is considered as the first in this sequence.
- cmp_groups : list of list of str
List of groups that are being compared.
- name : str
Name of this data set.
- levels : dict or str, float
Peptide levels used for normalization.
- intra_normalized : bool
Indicates if the data set has been normalized within a TMT-plex analysis.
- inter_normalized : bool
Indicates if the data set has been normalized for comparison across TMT-plex analyses.
- sets : int
Number of sets merged into this data set.
-
accessions
¶ Get all uniprot accessions occuring in this data set.
Returns: - list of str
Examples
>>> ds.accessions ['P42227', 'Q920G3', 'Q9ES52']
-
add_peptide
(inserts)[source]¶ Manually add a peptide or list of peptides to a data set.
Parameters: - insert : dict or list of dict
Examples
This example demostrates how to manually insert a peptide that was manually validated by the user:
prots = data_sets.protein.Proteins( proteins=( data_sets.protein.Protein( accession='Q920G3', gene='Siglec5', description='Sialic acid-binding Ig-like lectin 5', full_sequence=( 'MRWAWLLPLLWAGCLATDGYSLSVTGSVTVQEGLCVFVACQVQYPNSKGPVFGYWFREGA' 'NIFSGSPVATNDPQRSVLKEAQGRFYLMGKENSHNCSLDIRDAQKIDTGTYFFRLDGSVK' 'YSFQKSMLSVLVIALTEVPNIQVTSTLVSGNSTKLLCSVPWACEQGTPPIFSWMSSALTS' 'LGHRTTLSSELNLTPRPQDNGTNLTCQVNLPGTGVTVERTQQLSVIYAPQKMTIRVSWGD' 'DTGTKVLQSGASLQIQEGESLSLVCMADSNPPAVLSWERPTQKPFQLSTPAELQLPRAEL' 'EDQGKYICQAQNSQGAQTASVSLSIRSLLQLLGPSCSFEGQGLHCSCSSRAWPAPSLRWR' 'LGEGVLEGNSSNGSFTVKSSSAGQWANSSLILSMEFSSNHRLSCEAWSDNRVQRATILLV' 'SGPKVSQAGKSETSRGTVLGAIWGAGLMALLAVCLCLIFFTVKVLRKKSALKVAATKGNH' 'LAKNPASTINSASITSSNIALGYPIQGHLNEPGSQTQKEQPPLATVPDTQKDEPELHYAS' 'LSFQGPMPPKPQNTEAMKSVYTEIKIHKC' ), ), ), ) seq = data_sets.sequence.extract_sequence(prots, 'SVyTEIK') mods = data_sets.modification.Modifications( mods=[ data_sets.modification.Modification( rel_pos=0, mod_type='TMT10plex', nterm=True, sequence=seq, ), data_sets.modification.Modification( rel_pos=2, mod_type='Phospho', sequence=seq, ), data_sets.modification.Modification( rel_pos=6, mod_type='TMT10plex', sequence=seq, ), ], ) seq.modifications = mods ckh_sigf1_py_insert = { 'Proteins': prots, 'Sequence': seq, 'Modifications': mods, '126': 1.46e4, '127N': 2.18e4, '127C': 1.88e4, '128N': 4.66e3, '128C': 6.70e3, '129N': 7.88e3, '129C': 1.03e4, '130N': 7.28e3, '130C': 2.98e3, '131': 6.01e3, 'Validated': True, 'First Scan': {23074}, 'Raw Paths': {'2019-04-24-CKp25-SiglecF-1-py-SpinCol-col189.raw'}, 'Scan Paths': {'CK-7wk-H1-pY'}, 'IonScore': 30, 'Isolation Interference': 0, } ds.add_peptide([ckh_sigf1_py_insert])
-
check_raw
()[source]¶ Checks that all raw files referenced in search data can be found in
pyproteome.paths.MS_RAW_DIR
.Returns: - found_all : bool
-
data
¶ Get the quantification data for all samples and peptides in a data set.
Returns: - df :
pandas.DataFrame
- df :
-
dropna
(columns=None, how=None, thresh=None, groups=None, inplace=False)[source]¶ Drop any channels with NaN values.
Parameters: - columns : list of str, optional
- how : str, optional
- groups : list of str, optional
Only drop rows with NaN in columns within groups.
- inplace : bool, optional
Returns: - ds :
DataSet
-
filter
(filters=None, inplace=False, **kwargs)[source]¶ Filters a data set.
Parameters: - filters : list of dict or dict, optional
List of filters to apply to data set. Filters are also pulled from kwargs (see below).
- inplace : bool, optional
Perform the filter on self, other create a copy and return the new object.
Returns: - ds :
DataSet
Notes
These parameters filter your data set to only include peptides that match a given attribute. For example:
>>> data.filter(mod='Y', p=0.01, fold=2)
This function interprets both the argument filter and python kwargs magic. The three functions are all equivalent:
>>> data.filter(p=0.01) >>> data.filter([{'p': 0.01}]) >>> data.filter({'p': 0.01})
Filter parameters can be one of any below:
Name Description series Use a pandas series (data.psms[series]). fn Use data.psms.apply(fn). group_a Calculate p / fold change values from group_a. group_b Calculate p / fold change values from group_b. ambiguous Include peptides with ambiguous PTMs if true, filter them out if false. confidence Discoverer’s peptide confidence (High|Medium|Low). ion_score MASCOT’s ion score. isolation Discoverer’s isolation inference. missed_cleavage Missed cleaves <= cutoff. median_quant Median quantification signal >= cutoff. median_cv Median coefficient of variation <= cutoff. p p-value < cutoff. q q-value < cutoff. asym_fold Change > val if cutoff > 1 else Change < val. fold Change > cutoff or Change < 1 / cutoff. motif Filter for motif. protein Filter for protein or list of proteins. protein Filter for protein or list of proteins. accession Filter for protein or list of UniProt accessions. sequence Filter for sequence or list of sequences. mod Filter for modifications. only_validated Use rows validated by CAMV. any Use rows that many any filter. inverse Use all rows that are rejected by a filter. rename Change the new data sets name to a new value.
-
fix_channel_names
()[source]¶ Correct quantification channel names to those present in the search file.
i.e. from 130_C to 130C (or vice versa).
-
genes
¶ Get all uniprot gene names occuring in this data set.
Returns: - list of str
-
get_data
(groups=None, mods=None, short_name=False)[source]¶ Get the quantification data for all samples and peptides in a data set, with more customizable options.
Parameters: - groups : list of str, optional
Select samples from a given list of groups, otherwise select all samples.
- mods : str or list of str, optional
Option passed to
modification.Modification.__str__()
.- short_name : bool, optional
Use the short abbreviation of a gene name (using
pyproteome.utils.get_name()
). Otherwise use the long version.
Returns: - df :
pandas.DataFrame
-
get_groups
(group_a=None, group_b=None)[source]¶ Get channels associated with two groups.
Parameters: - group_a : str or list of str, optional
- group_b : str or list of str, optional
Returns: - samples : list of str
- labels : list of str
- groups : tuple of (str or list of str)
-
get_samples
(groups=None)[source]¶ Get a list of sample names in this data set.
Parameters: - groups : optional, list of (list of str)
Returns: - list of str
-
intensity_data
¶ Get the quantification data for all samples and peptides in a data set.
Parameters: - norm_cmp : bool, optional
Returns: - df :
pandas.DataFrame
-
inter_normalize
(norm_channels=None, other=None, inplace=False)[source]¶ Normalize runs to one channel for inter-run comparions.
Parameters: - other :
DataSet
, optional Second data set to normalize quantification values against, using a common normalization channels.
- norm_channels : list of str, optional
Normalization channels to use for cross-run normalization.
- inplace : bool, optional
Modify this data set in place.
Returns: - ds :
DataSet
- other :
-
log_stats
()[source]¶ Log statistics information about peptides contained in the data set. This information includes total numbers, phospho-specificity, modification ambiguity, completeness of labeling, and missed cleavage counts.
-
merge_duplicates
(inplace=False)[source]¶ Merge together all duplicate peptides. New quantification values are calculated from a weighted sum of each channel’s values.
Parameters: - inplace : bool, optional
Modify the data set in place, otherwise create a copy and return the new object.
Returns: - ds :
DataSet
-
merge_subsequences
(inplace=False)[source]¶ Merges petides that are a subsequence of another peptide. (i.e. SVYTEIKIHK + SVYTEIK -> SVYTEIK)
Only merges peptides that contain the same set of modifications and that map to the same protein(s).
Parameters: - inplace : bool, optional
Returns: - ds :
DataSet
-
norm_cmp_groups
(cmp_groups, ctrl_groups=None, inplace=False)[source]¶ Normalize between groups in a list. This can be used to compare data sets that have comparable control groups.
Channnels within each list of groups are normalized to the mean of the group’s channels.
Parameters: - cmp_groups : list of list of str
List of groups to be normalized to each other. i.e. [(‘CK Hip’, ‘CK-p25 Hip’), (‘CK Cortex’, ‘CK-p25 Cortex’)]
- ctrl_groups : list of str, optional
List of groups to use for baseline normalization. If not set, the first group from each comparison will be used.
- inplace : bool, optional
Modify the data set in place, otherwise create a copy and return the new object.
Returns: - ds :
DataSet
Examples
>>> channels = ['a', 'b', 'c', 'd'] >>> groups = {i: [i] for i in channels} >>> ds = data_sets.DataSet(channels=channels, groups=groups) >>> ds.add_peptide({'a': 1000, 'b': 500, 'c': 100, 'd': 25}) >>> ds = ds.norm_cmp_groups([['a', 'b'], ['c', 'd']]) >>> ds.data {'a': 1, 'b': 0.5, 'c': 1, 'd': .25}
-
normalize
(lvls, inplace=False)[source]¶ Normalize channels to given levels for intra-run comparisons.
Divides all channel values by a given level.
Parameters: - lvls : dict of str, float
Mapping of channel names to normalized levels. All quantification values for each channel are divided by the normalization factor.
- inplace : bool, optional
Modify this data set in place.
Returns: - ds :
DataSet
-
phosphosites
¶ Get a list of all unique phosphosites identified in a data set.
Returns: - list of str
Examples
>>> ds.phosphosites ['Siglec5 pY561', 'Stat3 pY705', 'Inpp5d pY868']
-
rename_channels
(inplace=False)[source]¶ Rename all columns names for quantification channels to sample names. (i.e. ‘126’ => ‘Mouse #1’).
Parameters: - inplace : bool, optional
Returns: - ds :
DataSet
-
samples
¶ Get a list of sample names in this data set.
Returns: - list of str
-
shape
¶ Get the size of a data set in (rows, columns) format.
Returns: - shape : tuple of (int, int)
-
update_group_changes
(group_a=None, group_b=None)[source]¶ Update a DataSet’s Fold Change, and p-value for each peptide using the give two-group comparison.
Values are calculated based on changes between group_a and group_b. p-values are calculated as a 2-sample t-test.
Parameters: - psms :
pandas.DataFrame
- group_a : str or list of str, optional
Single or multiple groups to use for fold change numerator.
- group_b : str or list of str, optional
Single or multiple groups to use for fold change denominator.
- psms :
-
pyproteome.data_sets.
load_all_data
(chan_mapping=None, group_mapping=None, loaded_fn=None, norm_mapping=None, merge_mapping=None, merged_fn=None, kw_mapping=None, merge_only=True, replace_norm=True, **kwargs)[source]¶ Load, normalize, and merge all data sets found in
pyproteome.paths.MS_SEARCHED_DIR
.Parameters: - chan_mapping : dict, optional
- group_mapping : dict, optional
- loaded_fn : func, optional
- norm_mapping : dict, optional
- merge_mapping : dict, optional
- merged_fn : func, optional
- kw_mapping : dict of (str, dict)
- merge_only : bool, optional
- replace_norm : bool, optional
If true, only keep the normalized version of a data set. Otherwise return both normalized and unnormalized version.
- kwargs : dict
Any extra arguments are passed directly to
DataSet
during initialization.
Returns: - datas : dict of str,
DataSet
Examples
This example demostrates how to automatically load, filter, normalize, and together several data sets:
ckh_channels = OrderedDict( [ ('3130 CK Hip', '126'), ('3131 CK-p25 Hip', '127'), ('3145 CK-p25 Hip', '128'), ('3146 CK-p25 Hip', '129'), ('3148 CK Hip', '130'), ('3157 CK Hip', '131'), ] ) ckx_channels = OrderedDict( [ ('3130 CK Cortex', '126'), ('3131 CK-p25 Cortex', '127'), ('3145 CK-p25 Cortex', '128'), ('3146 CK-p25 Cortex', '129'), ('3148 CK Cortex', '130'), ('3157 CK Cortex', '131'), ] ) ckp25_groups = OrderedDict( [ ( 'CK', [ '3130 CK Hip', '3148 CK Hip', '3157 CK Hip', '3130 CK Cortex', '3148 CK Cortex', '3157 CK Cortex', ], ), ( 'CK-p25', [ '3131 CK-p25 Hip', '3145 CK-p25 Hip', '3146 CK-p25 Hip', '3131 CK-p25 Cortex', '3145 CK-p25 Cortex', '3146 CK-p25 Cortex', ], ), ] ) # With search data located as follows: # Searched/ # CK-H1-pY.msf # CK-H1-pST.msf # CK-H1-Global.msf # CK-X1-pY.msf # CK-X1-pST.msf # CK-X1-Global.msf # Load each data set, normalized to its respective global proteome analysis: datas = data_sets.load_all_data( chan_mapping={ 'CK-H': ckh_channels, 'CK-X': ckx_channels, }, # Normalize pY, pST, and Global runs to each sample's global data norm_mapping={ 'CK-H1': 'CK-H1-Global', 'CK-X1': 'CK-X1-Global', ]), # Merge together normalized hippocampus and cortex runs merge_mapping={ 'CK Hip': ['CK-H1-pY', 'CK-H1-pST', 'CK-H1-Global'], 'CK Cortex': ['CK-X1-pY', 'CK-X1-pST', 'CK-X1-Global'], 'CK All': ['CK Hip', 'CK Cortex'], }, groups=ckp25_groups, ) # Alternatively, load each data set, using CONSTANd normalization: data_sets.constand.DEFAULT_CONSTAND_COL = 'kde' datas = data_sets.load_all_data( chan_mapping={ 'CK-H': ckh_channels, 'CK-X': ckx_channels, }, norm_mapping='constand', # Merge together normalized hippocampus and cortex runs merge_mapping={ 'CK Hip': ['CK-H1-pY', 'CK-H1-pST', 'CK-H1-Global'], 'CK Cortex': ['CK-X1-pY', 'CK-X1-pST', 'CK-X1-Global'], 'CK All': ['CK Hip', 'CK Cortex'], }, groups=ckp25_groups, )
-
pyproteome.data_sets.
norm_all_data
(datas, norm_mapping, replace_norm=True, inplace=True)[source]¶ Normalize all data sets.
Parameters: - datas : dict of (str,
DataSet
) - norm_mapping : dict of (str, str)
- replace_norm : bool, optional
- inplace : bool, optional
Modify datas object inplace.
Returns: - datas : dict of (str,
DataSet
) - mapped_names : dict of (str, str)
- datas : dict of (str,
-
pyproteome.data_sets.
merge_all_data
(datas, merge_mapping, mapped_names=None, merged_fn=None, inplace=True)[source]¶ Merge together multiple data sets.
Parameters: - datas : dict of (str,
DataSet
) - merge_mapping : dict of (str, list of str)
- mapped_names : dict of (str, str), optional
- merged_fn : func, optional
- inplace : bool, optional
Modify datas object inplace.
Returns: - datas : dict of (str,
DataSet
)
- datas : dict of (str,
-
pyproteome.data_sets.
merge_data
(data_sets, name=None, norm_channels=None, merge_duplicates=True)[source]¶ Merge a list of data sets together.
Parameters: - data_sets : list of
DataSet
- name : str, optional
- norm_channels : dict of (str, str)
- merge_duplicates : bool, optional
Returns: - ds :
DataSet
- data_sets : list of
-
pyproteome.data_sets.
merge_proteins
(ds, inplace=False, fn=None)[source]¶ Merge together all peptides mapped to the same protein. Maintains the first available peptide and calculates the median quantification value for each protein across all of its peptides.
Parameters: - ds :
DataSet
- inplace : bool, optional
Returns: - ds :
DataSet
- ds :
-
pyproteome.data_sets.
update_correlation
(ds, corr, metric='spearman', min_periods=5)[source]¶ Update a table’s Fold-Change, and p-value columns.
Values are calculated based on changes between group_a and group_b.
Parameters: - ds :
DataSet
- corr :
pandas.Series
- metric : str, optional
- min_periods : int, optional
Returns: - ds :
DataSet
- ds :
-
class
pyproteome.data_sets.
Modification
(rel_pos=0, mod_type='', sequence=None, nterm=False, cterm=False)[source]¶ Bases:
object
Contains information for a single peptide modification.
Attributes: - rel_pos : int
The relative position of a modification in a peptide sequence (0-indexed).
- mod_type : str
A short name for this type of modification (i.e. ‘Phospho’, ‘Carbamidomethyl’, ‘Oxidation’, ‘TMT6’, ‘TMT10’)
- nterm : bool
Boolean indicator of whether this modification is applied to the peptide N-terminus.
- cterm : bool
Boolean indicator of whether this modification is applied to the peptide C-terminus.
-
abs_pos
¶ The absolute positions of this modification in the full sequence of each mapped protein (0-indexed).
Returns: - tuple of int
-
copy
()[source]¶ Creates a copy of a modification. Does not copy the underlying sequence object.
Returns: - mod :
Modification
- mod :
-
display_mod_type
()[source]¶ Return the mod_type in an abbreviated form (i.e. ‘p’ for ‘Phospho’)
Returns: - abbrev : str
-
exact
¶ Indicates whether each peptide-protein mapping for this modification is an exact or partial match.
Returns: - exact : tuple of bool
-
letter
¶ This modification’s one-letter amino acid code (i.e. ‘Y’), or ‘N-term’ / ‘C-term’ for terminal modifications.
Returns: - letter : str
-
class
pyproteome.data_sets.
Modifications
(mods=None)[source]¶ Bases:
object
A list of modifications.
Wraps the Modification objects and provides several utility functions.
Attributes: - mods : list of
Modification
-
copy
()[source]¶ Creates a copy of a set of modifications. Does not copy the underlying sequence object.
Returns: - mods :
Modifications
- mods :
-
get_mods
(letter_mod_types)[source]¶ Filter the list of modifications.
Only keeps modifications with a given letter, mod_type, or both.
Parameters: - letter_mod_types : list of tuple of str, str
Returns: - mods :
Modifications
Examples
>>> from pyproteome.sequence import Sequence >>> from pyproteome.modification import Modification, Modifications >>> s = Sequence(pep_seq='SVYTEIK') >>> m = Modifications( ... [ ... Modification(mod_type='TMT', nterm=True, sequence=s), ... Modification(mod_type='Phospho', rel_pos=2, sequence=s), ... Modification(mod_type='TMT', rel_pos=6, sequence=s), ... ] ... ) >>> m.get_mods('TMT') ['TMT A0', 'TMT K6'] >>> m.get_mods('Phospho') ['pY2'] >>> m.get_mods('Y') ['pY2'] >>> m.get_mods('S') [] >>> m.get_mods([('Y', 'Phospho')]) ['pY2'] >>> m.get_mods([('S', 'Phospho')]) []
-
skip_labels
()[source]¶ Get modifications, skipping over any that are peptide labels.
Returns: - mods : list of
Modification
- mods : list of
- mods : list of
-
class
pyproteome.data_sets.
Protein
(accession=None, gene=None, description=None, full_sequence=None)[source]¶ Bases:
object
Contains information about a single protein.
Attributes: - accession : str
The UniProt accession (i.e. ‘P40763’).
- gene : str
The UniProt gene name (i.e. ‘STAT3’).
- description : str
A brief description of the protein (i.e. ‘Signal transducer and activator of transcription 3’).
- full_sequence : str
The full sequence of the protein.
-
class
pyproteome.data_sets.
Proteins
(proteins=None)[source]¶ Bases:
object
Wraps a list of proteins.
Attributes: - proteins : tuple of
Protein
List of proteins to which a peptide sequence is mapped.
-
accessions
¶ List of UniPort accessions for a group of proteins.
Returns: - tuple of str
-
descriptions
¶ List of protein descriptions for a group of proteins.
Returns: - tuple of str
-
genes
¶ List of UniPort gene names for a group of proteins.
Returns: - tuple of str
- proteins : tuple of
-
class
pyproteome.data_sets.
Sequence
(pep_seq='', protein_matches=None, modifications=None)[source]¶ Bases:
object
Contains information about a sequence and which proteins it matches to.
Attributes: - pep_seq : str
Peptide sequence, in 1-letter amino code.
- protein_matches : list of
ProteinMatch
Object mapping all proteins that a peptide sequence matches.
- modifications :
modification.Modifications
Object listing all post-translation modifications identified on a peptide.
-
is_labeled
¶ Checks whether a sequence is modified on any residue with a quantification label.
Returns: - is_labeled : bool
-
is_underlabeled
¶ Checks whether a sequence is modified with quantification labels on fewer than all expected residues.
Returns: - is_underlabeled : bool
-
class
pyproteome.data_sets.
ProteinMatch
(protein, rel_pos, exact)[source]¶ Bases:
object
Contains information about how a peptide sequence maps onto a protein.
Attributes: - protein :
protein.Protein
Protein object.
- rel_pos : int
Relative position of the peptide start within the protein sequence.
- exact : bool
Indicates whether a peptide sequence exact matches its protein sequence.
- protein :
-
pyproteome.data_sets.
extract_sequence
(proteins, sequence_string)[source]¶ Extract a Sequence object from a list of proteins and sequence string.
Does not set the Sequence.modifications attribute.
Parameters: - proteins : list of
protein.Protein
- sequence_string : str
Returns: - seqs : list of
Sequence
- proteins : list of
pyproteome.data_sets.constand module¶
This module provides functionality for manipulating proteomics data sets.
Functionality includes merging data sets and interfacing with attributes in a structured format.
-
pyproteome.data_sets.constand.
CONSTAND_METHODS
= {'kde': <function <lambda>>, 'mean': <sphinx.ext.autodoc.importer._MockObject object>, 'median': <sphinx.ext.autodoc.importer._MockObject object>}¶ Methods to estimate the center of a data set’s rows or columns.
One of [‘mean’, ‘median’, ‘kde’]. ‘mean’: applies
numpy.nanmean()
, ‘median’ appliesnumpy.nanmedian()
, and ‘kde’ applieslevels.kde_max()
.
-
pyproteome.data_sets.constand.
constand
(ds, inplace=False, n_iters=25, tol=1e-05, row_method=None, col_method=None)[source]¶ Normalize channels for intra-run comparisons. Iteratively fits the matrix of quantification values such that each row and column are centered around a calculated value. Uses row means and column median values for centering by default. See constand.CONSTAND_METHODS for other options.
Parameters: - ds :
data_sets.DataSet
Data set to apply CONSTANd normalization on.
- inplace : bool, optional
Modify this data set in place.
- n_iters : int, optional
Max number of normalization iterations. Rows are normalized on the odd step and columns are normalized on the even step.
- tol : float, optional
Minimum error tolerance to use to end iterations early.
- row_method : str, optional
Row normalization method to use. Default value is ‘mean’.
- col_method : str, optional
Column normalization method to use. Default value is ‘median’.
Returns: - ds :
data_sets.DataSet
- ds :
pyproteome.data_sets.data_set module¶
This module provides functionality for manipulating proteomics data sets.
Functionality includes merging data sets and interfacing with attributes in a structured format.
-
pyproteome.data_sets.data_set.
DATA_SET_COLS
= ['Proteins', 'Sequence', 'Modifications', 'Validated', 'Confidence Level', 'Ion Score', 'q-value', 'Isolation Interference', 'Missed Cleavages', 'Ambiguous', 'Charges', 'Masses', 'RTs', 'Intensities', 'Raw Paths', 'Scan Paths', 'Scan', 'Fold Change', 'p-value']¶ Columns available in DataSet.psms.
Note that this does not include columns for quantification or weights.
-
pyproteome.data_sets.data_set.
DEFAULT_FILTER_BAD
= {'ion_score': 15, 'isolation': 30, 'median_quant': 1500.0, 'q': 0.01}¶ Default parameters for filtering data sets.
Selects all ions with an ion score > 15, isolation interference < 50, median quantification signal > 1e3, and optional false-discovery q-value < 0.05.
-
class
pyproteome.data_sets.data_set.
DataSet
(name='', psms=None, search_name=None, channels=None, groups=None, cmp_groups=None, fix_channel_names=True, dropna=False, pick_best_psm=True, constand_norm=False, merge_duplicates=True, filter_bad=True, check_raw=True, skip_load=False, skip_logging=False)[source]¶ Bases:
object
Class that encompasses a proteomics data set. Data sets can be initialized by calling this class’s constructor directly, or using
load_all_data()
.Includes peptide-spectrum matches, quantification info, and mappings between channels, samples, and sample groups.
Data sets are automatically loaded, filtered, and merged by default. See
DEFAULT_FILTER_BAD
for default filtering parameters. SeeDataSet.merge_duplicates()
for info on how multiple peptide-spectrum matches are integrated together.Attributes: - search_name : str, optional
Name of the search file this data set was loaded from.
- psms :
pandas.DataFrame
Contains at least ‘Proteins’, ‘Sequence’, and ‘Modifications’ columns as well as any quantication data.
- channels : dict of str, str
Maps label channel to sample name.
- groups : dict of str, list of str
Maps groups to list of sample names. The primary group is considered as the first in this sequence.
- cmp_groups : list of list of str
List of groups that are being compared.
- name : str
Name of this data set.
- levels : dict or str, float
Peptide levels used for normalization.
- intra_normalized : bool
Indicates if the data set has been normalized within a TMT-plex analysis.
- inter_normalized : bool
Indicates if the data set has been normalized for comparison across TMT-plex analyses.
- sets : int
Number of sets merged into this data set.
-
accessions
¶ Get all uniprot accessions occuring in this data set.
Returns: - list of str
Examples
>>> ds.accessions ['P42227', 'Q920G3', 'Q9ES52']
-
add_peptide
(inserts)[source]¶ Manually add a peptide or list of peptides to a data set.
Parameters: - insert : dict or list of dict
Examples
This example demostrates how to manually insert a peptide that was manually validated by the user:
prots = data_sets.protein.Proteins( proteins=( data_sets.protein.Protein( accession='Q920G3', gene='Siglec5', description='Sialic acid-binding Ig-like lectin 5', full_sequence=( 'MRWAWLLPLLWAGCLATDGYSLSVTGSVTVQEGLCVFVACQVQYPNSKGPVFGYWFREGA' 'NIFSGSPVATNDPQRSVLKEAQGRFYLMGKENSHNCSLDIRDAQKIDTGTYFFRLDGSVK' 'YSFQKSMLSVLVIALTEVPNIQVTSTLVSGNSTKLLCSVPWACEQGTPPIFSWMSSALTS' 'LGHRTTLSSELNLTPRPQDNGTNLTCQVNLPGTGVTVERTQQLSVIYAPQKMTIRVSWGD' 'DTGTKVLQSGASLQIQEGESLSLVCMADSNPPAVLSWERPTQKPFQLSTPAELQLPRAEL' 'EDQGKYICQAQNSQGAQTASVSLSIRSLLQLLGPSCSFEGQGLHCSCSSRAWPAPSLRWR' 'LGEGVLEGNSSNGSFTVKSSSAGQWANSSLILSMEFSSNHRLSCEAWSDNRVQRATILLV' 'SGPKVSQAGKSETSRGTVLGAIWGAGLMALLAVCLCLIFFTVKVLRKKSALKVAATKGNH' 'LAKNPASTINSASITSSNIALGYPIQGHLNEPGSQTQKEQPPLATVPDTQKDEPELHYAS' 'LSFQGPMPPKPQNTEAMKSVYTEIKIHKC' ), ), ), ) seq = data_sets.sequence.extract_sequence(prots, 'SVyTEIK') mods = data_sets.modification.Modifications( mods=[ data_sets.modification.Modification( rel_pos=0, mod_type='TMT10plex', nterm=True, sequence=seq, ), data_sets.modification.Modification( rel_pos=2, mod_type='Phospho', sequence=seq, ), data_sets.modification.Modification( rel_pos=6, mod_type='TMT10plex', sequence=seq, ), ], ) seq.modifications = mods ckh_sigf1_py_insert = { 'Proteins': prots, 'Sequence': seq, 'Modifications': mods, '126': 1.46e4, '127N': 2.18e4, '127C': 1.88e4, '128N': 4.66e3, '128C': 6.70e3, '129N': 7.88e3, '129C': 1.03e4, '130N': 7.28e3, '130C': 2.98e3, '131': 6.01e3, 'Validated': True, 'First Scan': {23074}, 'Raw Paths': {'2019-04-24-CKp25-SiglecF-1-py-SpinCol-col189.raw'}, 'Scan Paths': {'CK-7wk-H1-pY'}, 'IonScore': 30, 'Isolation Interference': 0, } ds.add_peptide([ckh_sigf1_py_insert])
-
check_raw
()[source]¶ Checks that all raw files referenced in search data can be found in
pyproteome.paths.MS_RAW_DIR
.Returns: - found_all : bool
-
data
¶ Get the quantification data for all samples and peptides in a data set.
Returns: - df :
pandas.DataFrame
- df :
-
dropna
(columns=None, how=None, thresh=None, groups=None, inplace=False)[source]¶ Drop any channels with NaN values.
Parameters: - columns : list of str, optional
- how : str, optional
- groups : list of str, optional
Only drop rows with NaN in columns within groups.
- inplace : bool, optional
Returns: - ds :
DataSet
-
filter
(filters=None, inplace=False, **kwargs)[source]¶ Filters a data set.
Parameters: - filters : list of dict or dict, optional
List of filters to apply to data set. Filters are also pulled from kwargs (see below).
- inplace : bool, optional
Perform the filter on self, other create a copy and return the new object.
Returns: - ds :
DataSet
Notes
These parameters filter your data set to only include peptides that match a given attribute. For example:
>>> data.filter(mod='Y', p=0.01, fold=2)
This function interprets both the argument filter and python kwargs magic. The three functions are all equivalent:
>>> data.filter(p=0.01) >>> data.filter([{'p': 0.01}]) >>> data.filter({'p': 0.01})
Filter parameters can be one of any below:
Name Description series Use a pandas series (data.psms[series]). fn Use data.psms.apply(fn). group_a Calculate p / fold change values from group_a. group_b Calculate p / fold change values from group_b. ambiguous Include peptides with ambiguous PTMs if true, filter them out if false. confidence Discoverer’s peptide confidence (High|Medium|Low). ion_score MASCOT’s ion score. isolation Discoverer’s isolation inference. missed_cleavage Missed cleaves <= cutoff. median_quant Median quantification signal >= cutoff. median_cv Median coefficient of variation <= cutoff. p p-value < cutoff. q q-value < cutoff. asym_fold Change > val if cutoff > 1 else Change < val. fold Change > cutoff or Change < 1 / cutoff. motif Filter for motif. protein Filter for protein or list of proteins. protein Filter for protein or list of proteins. accession Filter for protein or list of UniProt accessions. sequence Filter for sequence or list of sequences. mod Filter for modifications. only_validated Use rows validated by CAMV. any Use rows that many any filter. inverse Use all rows that are rejected by a filter. rename Change the new data sets name to a new value.
-
fix_channel_names
()[source]¶ Correct quantification channel names to those present in the search file.
i.e. from 130_C to 130C (or vice versa).
-
genes
¶ Get all uniprot gene names occuring in this data set.
Returns: - list of str
-
get_data
(groups=None, mods=None, short_name=False)[source]¶ Get the quantification data for all samples and peptides in a data set, with more customizable options.
Parameters: - groups : list of str, optional
Select samples from a given list of groups, otherwise select all samples.
- mods : str or list of str, optional
Option passed to
modification.Modification.__str__()
.- short_name : bool, optional
Use the short abbreviation of a gene name (using
pyproteome.utils.get_name()
). Otherwise use the long version.
Returns: - df :
pandas.DataFrame
-
get_groups
(group_a=None, group_b=None)[source]¶ Get channels associated with two groups.
Parameters: - group_a : str or list of str, optional
- group_b : str or list of str, optional
Returns: - samples : list of str
- labels : list of str
- groups : tuple of (str or list of str)
-
get_samples
(groups=None)[source]¶ Get a list of sample names in this data set.
Parameters: - groups : optional, list of (list of str)
Returns: - list of str
-
intensity_data
¶ Get the quantification data for all samples and peptides in a data set.
Parameters: - norm_cmp : bool, optional
Returns: - df :
pandas.DataFrame
-
inter_normalize
(norm_channels=None, other=None, inplace=False)[source]¶ Normalize runs to one channel for inter-run comparions.
Parameters: - other :
DataSet
, optional Second data set to normalize quantification values against, using a common normalization channels.
- norm_channels : list of str, optional
Normalization channels to use for cross-run normalization.
- inplace : bool, optional
Modify this data set in place.
Returns: - ds :
DataSet
- other :
-
log_stats
()[source]¶ Log statistics information about peptides contained in the data set. This information includes total numbers, phospho-specificity, modification ambiguity, completeness of labeling, and missed cleavage counts.
-
merge_duplicates
(inplace=False)[source]¶ Merge together all duplicate peptides. New quantification values are calculated from a weighted sum of each channel’s values.
Parameters: - inplace : bool, optional
Modify the data set in place, otherwise create a copy and return the new object.
Returns: - ds :
DataSet
-
merge_subsequences
(inplace=False)[source]¶ Merges petides that are a subsequence of another peptide. (i.e. SVYTEIKIHK + SVYTEIK -> SVYTEIK)
Only merges peptides that contain the same set of modifications and that map to the same protein(s).
Parameters: - inplace : bool, optional
Returns: - ds :
DataSet
-
norm_cmp_groups
(cmp_groups, ctrl_groups=None, inplace=False)[source]¶ Normalize between groups in a list. This can be used to compare data sets that have comparable control groups.
Channnels within each list of groups are normalized to the mean of the group’s channels.
Parameters: - cmp_groups : list of list of str
List of groups to be normalized to each other. i.e. [(‘CK Hip’, ‘CK-p25 Hip’), (‘CK Cortex’, ‘CK-p25 Cortex’)]
- ctrl_groups : list of str, optional
List of groups to use for baseline normalization. If not set, the first group from each comparison will be used.
- inplace : bool, optional
Modify the data set in place, otherwise create a copy and return the new object.
Returns: - ds :
DataSet
Examples
>>> channels = ['a', 'b', 'c', 'd'] >>> groups = {i: [i] for i in channels} >>> ds = data_sets.DataSet(channels=channels, groups=groups) >>> ds.add_peptide({'a': 1000, 'b': 500, 'c': 100, 'd': 25}) >>> ds = ds.norm_cmp_groups([['a', 'b'], ['c', 'd']]) >>> ds.data {'a': 1, 'b': 0.5, 'c': 1, 'd': .25}
-
normalize
(lvls, inplace=False)[source]¶ Normalize channels to given levels for intra-run comparisons.
Divides all channel values by a given level.
Parameters: - lvls : dict of str, float
Mapping of channel names to normalized levels. All quantification values for each channel are divided by the normalization factor.
- inplace : bool, optional
Modify this data set in place.
Returns: - ds :
DataSet
-
phosphosites
¶ Get a list of all unique phosphosites identified in a data set.
Returns: - list of str
Examples
>>> ds.phosphosites ['Siglec5 pY561', 'Stat3 pY705', 'Inpp5d pY868']
-
rename_channels
(inplace=False)[source]¶ Rename all columns names for quantification channels to sample names. (i.e. ‘126’ => ‘Mouse #1’).
Parameters: - inplace : bool, optional
Returns: - ds :
DataSet
-
samples
¶ Get a list of sample names in this data set.
Returns: - list of str
-
shape
¶ Get the size of a data set in (rows, columns) format.
Returns: - shape : tuple of (int, int)
-
update_group_changes
(group_a=None, group_b=None)[source]¶ Update a DataSet’s Fold Change, and p-value for each peptide using the give two-group comparison.
Values are calculated based on changes between group_a and group_b. p-values are calculated as a 2-sample t-test.
Parameters: - psms :
pandas.DataFrame
- group_a : str or list of str, optional
Single or multiple groups to use for fold change numerator.
- group_b : str or list of str, optional
Single or multiple groups to use for fold change denominator.
- psms :
-
pyproteome.data_sets.data_set.
load_all_data
(chan_mapping=None, group_mapping=None, loaded_fn=None, norm_mapping=None, merge_mapping=None, merged_fn=None, kw_mapping=None, merge_only=True, replace_norm=True, **kwargs)[source]¶ Load, normalize, and merge all data sets found in
pyproteome.paths.MS_SEARCHED_DIR
.Parameters: - chan_mapping : dict, optional
- group_mapping : dict, optional
- loaded_fn : func, optional
- norm_mapping : dict, optional
- merge_mapping : dict, optional
- merged_fn : func, optional
- kw_mapping : dict of (str, dict)
- merge_only : bool, optional
- replace_norm : bool, optional
If true, only keep the normalized version of a data set. Otherwise return both normalized and unnormalized version.
- kwargs : dict
Any extra arguments are passed directly to
DataSet
during initialization.
Returns: - datas : dict of str,
DataSet
Examples
This example demostrates how to automatically load, filter, normalize, and together several data sets:
ckh_channels = OrderedDict( [ ('3130 CK Hip', '126'), ('3131 CK-p25 Hip', '127'), ('3145 CK-p25 Hip', '128'), ('3146 CK-p25 Hip', '129'), ('3148 CK Hip', '130'), ('3157 CK Hip', '131'), ] ) ckx_channels = OrderedDict( [ ('3130 CK Cortex', '126'), ('3131 CK-p25 Cortex', '127'), ('3145 CK-p25 Cortex', '128'), ('3146 CK-p25 Cortex', '129'), ('3148 CK Cortex', '130'), ('3157 CK Cortex', '131'), ] ) ckp25_groups = OrderedDict( [ ( 'CK', [ '3130 CK Hip', '3148 CK Hip', '3157 CK Hip', '3130 CK Cortex', '3148 CK Cortex', '3157 CK Cortex', ], ), ( 'CK-p25', [ '3131 CK-p25 Hip', '3145 CK-p25 Hip', '3146 CK-p25 Hip', '3131 CK-p25 Cortex', '3145 CK-p25 Cortex', '3146 CK-p25 Cortex', ], ), ] ) # With search data located as follows: # Searched/ # CK-H1-pY.msf # CK-H1-pST.msf # CK-H1-Global.msf # CK-X1-pY.msf # CK-X1-pST.msf # CK-X1-Global.msf # Load each data set, normalized to its respective global proteome analysis: datas = data_sets.load_all_data( chan_mapping={ 'CK-H': ckh_channels, 'CK-X': ckx_channels, }, # Normalize pY, pST, and Global runs to each sample's global data norm_mapping={ 'CK-H1': 'CK-H1-Global', 'CK-X1': 'CK-X1-Global', ]), # Merge together normalized hippocampus and cortex runs merge_mapping={ 'CK Hip': ['CK-H1-pY', 'CK-H1-pST', 'CK-H1-Global'], 'CK Cortex': ['CK-X1-pY', 'CK-X1-pST', 'CK-X1-Global'], 'CK All': ['CK Hip', 'CK Cortex'], }, groups=ckp25_groups, ) # Alternatively, load each data set, using CONSTANd normalization: data_sets.constand.DEFAULT_CONSTAND_COL = 'kde' datas = data_sets.load_all_data( chan_mapping={ 'CK-H': ckh_channels, 'CK-X': ckx_channels, }, norm_mapping='constand', # Merge together normalized hippocampus and cortex runs merge_mapping={ 'CK Hip': ['CK-H1-pY', 'CK-H1-pST', 'CK-H1-Global'], 'CK Cortex': ['CK-X1-pY', 'CK-X1-pST', 'CK-X1-Global'], 'CK All': ['CK Hip', 'CK Cortex'], }, groups=ckp25_groups, )
-
pyproteome.data_sets.data_set.
merge_all_data
(datas, merge_mapping, mapped_names=None, merged_fn=None, inplace=True)[source]¶ Merge together multiple data sets.
Parameters: - datas : dict of (str,
DataSet
) - merge_mapping : dict of (str, list of str)
- mapped_names : dict of (str, str), optional
- merged_fn : func, optional
- inplace : bool, optional
Modify datas object inplace.
Returns: - datas : dict of (str,
DataSet
)
- datas : dict of (str,
-
pyproteome.data_sets.data_set.
merge_data
(data_sets, name=None, norm_channels=None, merge_duplicates=True)[source]¶ Merge a list of data sets together.
Parameters: - data_sets : list of
DataSet
- name : str, optional
- norm_channels : dict of (str, str)
- merge_duplicates : bool, optional
Returns: - ds :
DataSet
- data_sets : list of
-
pyproteome.data_sets.data_set.
merge_proteins
(ds, inplace=False, fn=None)[source]¶ Merge together all peptides mapped to the same protein. Maintains the first available peptide and calculates the median quantification value for each protein across all of its peptides.
Parameters: - ds :
DataSet
- inplace : bool, optional
Returns: - ds :
DataSet
- ds :
-
pyproteome.data_sets.data_set.
norm_all_data
(datas, norm_mapping, replace_norm=True, inplace=True)[source]¶ Normalize all data sets.
Parameters: - datas : dict of (str,
DataSet
) - norm_mapping : dict of (str, str)
- replace_norm : bool, optional
- inplace : bool, optional
Modify datas object inplace.
Returns: - datas : dict of (str,
DataSet
) - mapped_names : dict of (str, str)
- datas : dict of (str,
-
pyproteome.data_sets.data_set.
update_correlation
(ds, corr, metric='spearman', min_periods=5)[source]¶ Update a table’s Fold-Change, and p-value columns.
Values are calculated based on changes between group_a and group_b.
Parameters: - ds :
DataSet
- corr :
pandas.Series
- metric : str, optional
- min_periods : int, optional
Returns: - ds :
DataSet
- ds :
pyproteome.data_sets.modification module¶
This module provides functionality for post-translational modifications.
Wraps modifications in a structured class and allows filtering of modifications by amino acid and modification type.
-
pyproteome.data_sets.modification.
LABEL_NAME_TARGETS
= ('TMT', 'ITRAQ', 'plex')¶ Substrings used to identify and import novel label names from .msf files.
-
pyproteome.data_sets.modification.
MERGE_UNDERLABELED
= True¶ Merge peptides that have satured TMT labeling with peptides that are underlabeled.
-
class
pyproteome.data_sets.modification.
Modification
(rel_pos=0, mod_type='', sequence=None, nterm=False, cterm=False)[source]¶ Bases:
object
Contains information for a single peptide modification.
Attributes: - rel_pos : int
The relative position of a modification in a peptide sequence (0-indexed).
- mod_type : str
A short name for this type of modification (i.e. ‘Phospho’, ‘Carbamidomethyl’, ‘Oxidation’, ‘TMT6’, ‘TMT10’)
- nterm : bool
Boolean indicator of whether this modification is applied to the peptide N-terminus.
- cterm : bool
Boolean indicator of whether this modification is applied to the peptide C-terminus.
-
abs_pos
¶ The absolute positions of this modification in the full sequence of each mapped protein (0-indexed).
Returns: - tuple of int
-
copy
()[source]¶ Creates a copy of a modification. Does not copy the underlying sequence object.
Returns: - mod :
Modification
- mod :
-
display_mod_type
()[source]¶ Return the mod_type in an abbreviated form (i.e. ‘p’ for ‘Phospho’)
Returns: - abbrev : str
-
exact
¶ Indicates whether each peptide-protein mapping for this modification is an exact or partial match.
Returns: - exact : tuple of bool
-
letter
¶ This modification’s one-letter amino acid code (i.e. ‘Y’), or ‘N-term’ / ‘C-term’ for terminal modifications.
Returns: - letter : str
-
class
pyproteome.data_sets.modification.
Modifications
(mods=None)[source]¶ Bases:
object
A list of modifications.
Wraps the Modification objects and provides several utility functions.
Attributes: - mods : list of
Modification
-
copy
()[source]¶ Creates a copy of a set of modifications. Does not copy the underlying sequence object.
Returns: - mods :
Modifications
- mods :
-
get_mods
(letter_mod_types)[source]¶ Filter the list of modifications.
Only keeps modifications with a given letter, mod_type, or both.
Parameters: - letter_mod_types : list of tuple of str, str
Returns: - mods :
Modifications
Examples
>>> from pyproteome.sequence import Sequence >>> from pyproteome.modification import Modification, Modifications >>> s = Sequence(pep_seq='SVYTEIK') >>> m = Modifications( ... [ ... Modification(mod_type='TMT', nterm=True, sequence=s), ... Modification(mod_type='Phospho', rel_pos=2, sequence=s), ... Modification(mod_type='TMT', rel_pos=6, sequence=s), ... ] ... ) >>> m.get_mods('TMT') ['TMT A0', 'TMT K6'] >>> m.get_mods('Phospho') ['pY2'] >>> m.get_mods('Y') ['pY2'] >>> m.get_mods('S') [] >>> m.get_mods([('Y', 'Phospho')]) ['pY2'] >>> m.get_mods([('S', 'Phospho')]) []
-
skip_labels
()[source]¶ Get modifications, skipping over any that are peptide labels.
Returns: - mods : list of
Modification
- mods : list of
- mods : list of
-
pyproteome.data_sets.modification.
allowed_mod_type
(mod, any_letter=None, any_mod=None, letter_mod=None)[source]¶ Check if a modification is of a given type.
Filters by letter, mod_type, or both.
Parameters: - mod :
Modification
- any_letter : set of str
- any_mod : set of str
- letter_mod : set of tuple of str, str
Returns: - is_type : bool
- mod :
pyproteome.data_sets.protein module¶
This module provides functionality for interfacing with protein data.
-
class
pyproteome.data_sets.protein.
Protein
(accession=None, gene=None, description=None, full_sequence=None)[source]¶ Bases:
object
Contains information about a single protein.
Attributes: - accession : str
The UniProt accession (i.e. ‘P40763’).
- gene : str
The UniProt gene name (i.e. ‘STAT3’).
- description : str
A brief description of the protein (i.e. ‘Signal transducer and activator of transcription 3’).
- full_sequence : str
The full sequence of the protein.
-
class
pyproteome.data_sets.protein.
Proteins
(proteins=None)[source]¶ Bases:
object
Wraps a list of proteins.
Attributes: - proteins : tuple of
Protein
List of proteins to which a peptide sequence is mapped.
-
accessions
¶ List of UniPort accessions for a group of proteins.
Returns: - tuple of str
-
descriptions
¶ List of protein descriptions for a group of proteins.
Returns: - tuple of str
-
genes
¶ List of UniPort gene names for a group of proteins.
Returns: - tuple of str
- proteins : tuple of
pyproteome.data_sets.sequence module¶
This module provides functionality for manipulating sequences.
-
class
pyproteome.data_sets.sequence.
ProteinMatch
(protein, rel_pos, exact)[source]¶ Bases:
object
Contains information about how a peptide sequence maps onto a protein.
Attributes: - protein :
protein.Protein
Protein object.
- rel_pos : int
Relative position of the peptide start within the protein sequence.
- exact : bool
Indicates whether a peptide sequence exact matches its protein sequence.
- protein :
-
class
pyproteome.data_sets.sequence.
Sequence
(pep_seq='', protein_matches=None, modifications=None)[source]¶ Bases:
object
Contains information about a sequence and which proteins it matches to.
Attributes: - pep_seq : str
Peptide sequence, in 1-letter amino code.
- protein_matches : list of
ProteinMatch
Object mapping all proteins that a peptide sequence matches.
- modifications :
modification.Modifications
Object listing all post-translation modifications identified on a peptide.
-
is_labeled
¶ Checks whether a sequence is modified on any residue with a quantification label.
Returns: - is_labeled : bool
-
is_underlabeled
¶ Checks whether a sequence is modified with quantification labels on fewer than all expected residues.
Returns: - is_underlabeled : bool
-
pyproteome.data_sets.sequence.
extract_sequence
(proteins, sequence_string)[source]¶ Extract a Sequence object from a list of proteins and sequence string.
Does not set the Sequence.modifications attribute.
Parameters: - proteins : list of
protein.Protein
- sequence_string : str
Returns: - seqs : list of
Sequence
- proteins : list of