pyproteome.data_sets package

class pyproteome.data_sets.DataSet(name='', psms=None, search_name=None, channels=None, groups=None, cmp_groups=None, fix_channel_names=True, dropna=False, pick_best_psm=True, constand_norm=False, merge_duplicates=True, filter_bad=True, check_raw=True, skip_load=False, skip_logging=False)[source]

Bases: object

Class that encompasses a proteomics data set. Data sets can be initialized by calling this class’s constructor directly, or using load_all_data().

Includes peptide-spectrum matches, quantification info, and mappings between channels, samples, and sample groups.

Data sets are automatically loaded, filtered, and merged by default. See DEFAULT_FILTER_BAD for default filtering parameters. See DataSet.merge_duplicates() for info on how multiple peptide-spectrum matches are integrated together.

Attributes:
search_name : str, optional

Name of the search file this data set was loaded from.

psms : pandas.DataFrame

Contains at least ‘Proteins’, ‘Sequence’, and ‘Modifications’ columns as well as any quantication data.

channels : dict of str, str

Maps label channel to sample name.

groups : dict of str, list of str

Maps groups to list of sample names. The primary group is considered as the first in this sequence.

cmp_groups : list of list of str

List of groups that are being compared.

name : str

Name of this data set.

levels : dict or str, float

Peptide levels used for normalization.

intra_normalized : bool

Indicates if the data set has been normalized within a TMT-plex analysis.

inter_normalized : bool

Indicates if the data set has been normalized for comparison across TMT-plex analyses.

sets : int

Number of sets merged into this data set.

accessions

Get all uniprot accessions occuring in this data set.

Returns:
list of str

Examples

>>> ds.accessions
['P42227', 'Q920G3', 'Q9ES52']
add_peptide(inserts)[source]

Manually add a peptide or list of peptides to a data set.

Parameters:
insert : dict or list of dict

Examples

This example demostrates how to manually insert a peptide that was manually validated by the user:

prots = data_sets.protein.Proteins(
    proteins=(
        data_sets.protein.Protein(
            accession='Q920G3',
            gene='Siglec5',
            description='Sialic acid-binding Ig-like lectin 5',
            full_sequence=(
                'MRWAWLLPLLWAGCLATDGYSLSVTGSVTVQEGLCVFVACQVQYPNSKGPVFGYWFREGA'
                'NIFSGSPVATNDPQRSVLKEAQGRFYLMGKENSHNCSLDIRDAQKIDTGTYFFRLDGSVK'
                'YSFQKSMLSVLVIALTEVPNIQVTSTLVSGNSTKLLCSVPWACEQGTPPIFSWMSSALTS'
                'LGHRTTLSSELNLTPRPQDNGTNLTCQVNLPGTGVTVERTQQLSVIYAPQKMTIRVSWGD'
                'DTGTKVLQSGASLQIQEGESLSLVCMADSNPPAVLSWERPTQKPFQLSTPAELQLPRAEL'
                'EDQGKYICQAQNSQGAQTASVSLSIRSLLQLLGPSCSFEGQGLHCSCSSRAWPAPSLRWR'
                'LGEGVLEGNSSNGSFTVKSSSAGQWANSSLILSMEFSSNHRLSCEAWSDNRVQRATILLV'
                'SGPKVSQAGKSETSRGTVLGAIWGAGLMALLAVCLCLIFFTVKVLRKKSALKVAATKGNH'
                'LAKNPASTINSASITSSNIALGYPIQGHLNEPGSQTQKEQPPLATVPDTQKDEPELHYAS'
                'LSFQGPMPPKPQNTEAMKSVYTEIKIHKC'
            ),
        ),
    ),
)
seq = data_sets.sequence.extract_sequence(prots, 'SVyTEIK')

mods = data_sets.modification.Modifications(
    mods=[
        data_sets.modification.Modification(
            rel_pos=0,
            mod_type='TMT10plex',
            nterm=True,
            sequence=seq,
        ),
        data_sets.modification.Modification(
            rel_pos=2,
            mod_type='Phospho',
            sequence=seq,
        ),
        data_sets.modification.Modification(
            rel_pos=6,
            mod_type='TMT10plex',
            sequence=seq,
        ),
    ],
)

seq.modifications = mods

ckh_sigf1_py_insert = {
    'Proteins': prots,
    'Sequence': seq,
    'Modifications': mods,
    '126':  1.46e4,
    '127N': 2.18e4,
    '127C': 1.88e4,
    '128N': 4.66e3,
    '128C': 6.70e3,
    '129N': 7.88e3,
    '129C': 1.03e4,
    '130N': 7.28e3,
    '130C': 2.98e3,
    '131':  6.01e3,
    'Validated': True,
    'First Scan': {23074},
    'Raw Paths': {'2019-04-24-CKp25-SiglecF-1-py-SpinCol-col189.raw'},
    'Scan Paths': {'CK-7wk-H1-pY'},
    'IonScore': 30,
    'Isolation Interference': 0,
}
ds.add_peptide([ckh_sigf1_py_insert])
check_raw()[source]

Checks that all raw files referenced in search data can be found in pyproteome.paths.MS_RAW_DIR.

Returns:
found_all : bool
copy()[source]

Make a copy of self.

Returns:
ds : DataSet
data

Get the quantification data for all samples and peptides in a data set.

Returns:
df : pandas.DataFrame
dropna(columns=None, how=None, thresh=None, groups=None, inplace=False)[source]

Drop any channels with NaN values.

Parameters:
columns : list of str, optional
how : str, optional
groups : list of str, optional

Only drop rows with NaN in columns within groups.

inplace : bool, optional
Returns:
ds : DataSet
filter(filters=None, inplace=False, **kwargs)[source]

Filters a data set.

Parameters:
filters : list of dict or dict, optional

List of filters to apply to data set. Filters are also pulled from kwargs (see below).

inplace : bool, optional

Perform the filter on self, other create a copy and return the new object.

Returns:
ds : DataSet

Notes

These parameters filter your data set to only include peptides that match a given attribute. For example:

>>> data.filter(mod='Y', p=0.01, fold=2)

This function interprets both the argument filter and python kwargs magic. The three functions are all equivalent:

>>> data.filter(p=0.01)
>>> data.filter([{'p': 0.01}])
>>> data.filter({'p': 0.01})

Filter parameters can be one of any below:

Name Description
series Use a pandas series (data.psms[series]).
fn Use data.psms.apply(fn).
group_a Calculate p / fold change values from group_a.
group_b Calculate p / fold change values from group_b.
ambiguous Include peptides with ambiguous PTMs if true, filter them out if false.
confidence Discoverer’s peptide confidence (High|Medium|Low).
ion_score MASCOT’s ion score.
isolation Discoverer’s isolation inference.
missed_cleavage Missed cleaves <= cutoff.
median_quant Median quantification signal >= cutoff.
median_cv Median coefficient of variation <= cutoff.
p p-value < cutoff.
q q-value < cutoff.
asym_fold Change > val if cutoff > 1 else Change < val.
fold Change > cutoff or Change < 1 / cutoff.
motif Filter for motif.
protein Filter for protein or list of proteins.
protein Filter for protein or list of proteins.
accession Filter for protein or list of UniProt accessions.
sequence Filter for sequence or list of sequences.
mod Filter for modifications.
only_validated Use rows validated by CAMV.
any Use rows that many any filter.
inverse Use all rows that are rejected by a filter.
rename Change the new data sets name to a new value.
fix_channel_names()[source]

Correct quantification channel names to those present in the search file.

i.e. from 130_C to 130C (or vice versa).

genes

Get all uniprot gene names occuring in this data set.

Returns:
list of str
get_data(groups=None, mods=None, short_name=False)[source]

Get the quantification data for all samples and peptides in a data set, with more customizable options.

Parameters:
groups : list of str, optional

Select samples from a given list of groups, otherwise select all samples.

mods : str or list of str, optional

Option passed to modification.Modification.__str__().

short_name : bool, optional

Use the short abbreviation of a gene name (using pyproteome.utils.get_name()). Otherwise use the long version.

Returns:
df : pandas.DataFrame
get_groups(group_a=None, group_b=None)[source]

Get channels associated with two groups.

Parameters:
group_a : str or list of str, optional
group_b : str or list of str, optional
Returns:
samples : list of str
labels : list of str
groups : tuple of (str or list of str)
get_samples(groups=None)[source]

Get a list of sample names in this data set.

Parameters:
groups : optional, list of (list of str)
Returns:
list of str
intensity_data

Get the quantification data for all samples and peptides in a data set.

Parameters:
norm_cmp : bool, optional
Returns:
df : pandas.DataFrame
inter_normalize(norm_channels=None, other=None, inplace=False)[source]

Normalize runs to one channel for inter-run comparions.

Parameters:
other : DataSet, optional

Second data set to normalize quantification values against, using a common normalization channels.

norm_channels : list of str, optional

Normalization channels to use for cross-run normalization.

inplace : bool, optional

Modify this data set in place.

Returns:
ds : DataSet
log_stats()[source]

Log statistics information about peptides contained in the data set. This information includes total numbers, phospho-specificity, modification ambiguity, completeness of labeling, and missed cleavage counts.

merge_duplicates(inplace=False)[source]

Merge together all duplicate peptides. New quantification values are calculated from a weighted sum of each channel’s values.

Parameters:
inplace : bool, optional

Modify the data set in place, otherwise create a copy and return the new object.

Returns:
ds : DataSet
merge_subsequences(inplace=False)[source]

Merges petides that are a subsequence of another peptide. (i.e. SVYTEIKIHK + SVYTEIK -> SVYTEIK)

Only merges peptides that contain the same set of modifications and that map to the same protein(s).

Parameters:
inplace : bool, optional
Returns:
ds : DataSet
norm_cmp_groups(cmp_groups, ctrl_groups=None, inplace=False)[source]

Normalize between groups in a list. This can be used to compare data sets that have comparable control groups.

Channnels within each list of groups are normalized to the mean of the group’s channels.

Parameters:
cmp_groups : list of list of str

List of groups to be normalized to each other. i.e. [(‘CK Hip’, ‘CK-p25 Hip’), (‘CK Cortex’, ‘CK-p25 Cortex’)]

ctrl_groups : list of str, optional

List of groups to use for baseline normalization. If not set, the first group from each comparison will be used.

inplace : bool, optional

Modify the data set in place, otherwise create a copy and return the new object.

Returns:
ds : DataSet

Examples

>>> channels = ['a', 'b', 'c', 'd']
>>> groups = {i: [i] for i in channels}
>>> ds = data_sets.DataSet(channels=channels, groups=groups)
>>> ds.add_peptide({'a': 1000, 'b': 500, 'c': 100, 'd': 25})
>>> ds = ds.norm_cmp_groups([['a', 'b'], ['c', 'd']])
>>> ds.data
{'a': 1, 'b': 0.5, 'c': 1, 'd': .25}
normalize(lvls, inplace=False)[source]

Normalize channels to given levels for intra-run comparisons.

Divides all channel values by a given level.

Parameters:
lvls : dict of str, float

Mapping of channel names to normalized levels. All quantification values for each channel are divided by the normalization factor.

inplace : bool, optional

Modify this data set in place.

Returns:
ds : DataSet
phosphosites

Get a list of all unique phosphosites identified in a data set.

Returns:
list of str

Examples

>>> ds.phosphosites
['Siglec5 pY561', 'Stat3 pY705', 'Inpp5d pY868']
rename_channels(inplace=False)[source]

Rename all columns names for quantification channels to sample names. (i.e. ‘126’ => ‘Mouse #1’).

Parameters:
inplace : bool, optional
Returns:
ds : DataSet
samples

Get a list of sample names in this data set.

Returns:
list of str
shape

Get the size of a data set in (rows, columns) format.

Returns:
shape : tuple of (int, int)
update_group_changes(group_a=None, group_b=None)[source]

Update a DataSet’s Fold Change, and p-value for each peptide using the give two-group comparison.

Values are calculated based on changes between group_a and group_b. p-values are calculated as a 2-sample t-test.

Parameters:
psms : pandas.DataFrame
group_a : str or list of str, optional

Single or multiple groups to use for fold change numerator.

group_b : str or list of str, optional

Single or multiple groups to use for fold change denominator.

pyproteome.data_sets.load_all_data(chan_mapping=None, group_mapping=None, loaded_fn=None, norm_mapping=None, merge_mapping=None, merged_fn=None, kw_mapping=None, merge_only=True, replace_norm=True, **kwargs)[source]

Load, normalize, and merge all data sets found in pyproteome.paths.MS_SEARCHED_DIR.

Parameters:
chan_mapping : dict, optional
group_mapping : dict, optional
loaded_fn : func, optional
norm_mapping : dict, optional
merge_mapping : dict, optional
merged_fn : func, optional
kw_mapping : dict of (str, dict)
merge_only : bool, optional
replace_norm : bool, optional

If true, only keep the normalized version of a data set. Otherwise return both normalized and unnormalized version.

kwargs : dict

Any extra arguments are passed directly to DataSet during initialization.

Returns:
datas : dict of str, DataSet

Examples

This example demostrates how to automatically load, filter, normalize, and together several data sets:

ckh_channels = OrderedDict(
    [
        ('3130 CK Hip',     '126'),
        ('3131 CK-p25 Hip', '127'),
        ('3145 CK-p25 Hip', '128'),
        ('3146 CK-p25 Hip', '129'),
        ('3148 CK Hip',     '130'),
        ('3157 CK Hip',     '131'),
    ]
)
ckx_channels = OrderedDict(
    [
        ('3130 CK Cortex',     '126'),
        ('3131 CK-p25 Cortex', '127'),
        ('3145 CK-p25 Cortex', '128'),
        ('3146 CK-p25 Cortex', '129'),
        ('3148 CK Cortex',     '130'),
        ('3157 CK Cortex',     '131'),
    ]
)
ckp25_groups = OrderedDict(
    [
        (
            'CK',
            [
                '3130 CK Hip',
                '3148 CK Hip',
                '3157 CK Hip',
                '3130 CK Cortex',
                '3148 CK Cortex',
                '3157 CK Cortex',
            ],
        ),
        (
            'CK-p25',
            [
                '3131 CK-p25 Hip',
                '3145 CK-p25 Hip',
                '3146 CK-p25 Hip',
                '3131 CK-p25 Cortex',
                '3145 CK-p25 Cortex',
                '3146 CK-p25 Cortex',
            ],
        ),
    ]
)
# With search data located as follows:
#   Searched/
#       CK-H1-pY.msf
#       CK-H1-pST.msf
#       CK-H1-Global.msf
#       CK-X1-pY.msf
#       CK-X1-pST.msf
#       CK-X1-Global.msf
# Load each data set, normalized to its respective global proteome analysis:
datas = data_sets.load_all_data(
    chan_mapping={
        'CK-H': ckh_channels,
        'CK-X': ckx_channels,
    },
    # Normalize pY, pST, and Global runs to each sample's global data
    norm_mapping={
        'CK-H1': 'CK-H1-Global',
        'CK-X1': 'CK-X1-Global',
    ]),
    # Merge together normalized hippocampus and cortex runs
    merge_mapping={
        'CK Hip': ['CK-H1-pY', 'CK-H1-pST', 'CK-H1-Global'],
        'CK Cortex': ['CK-X1-pY', 'CK-X1-pST', 'CK-X1-Global'],
        'CK All': ['CK Hip', 'CK Cortex'],
    },
    groups=ckp25_groups,
)

# Alternatively, load each data set, using CONSTANd normalization:
data_sets.constand.DEFAULT_CONSTAND_COL = 'kde'
datas = data_sets.load_all_data(
    chan_mapping={
        'CK-H': ckh_channels,
        'CK-X': ckx_channels,
    },
    norm_mapping='constand',
    # Merge together normalized hippocampus and cortex runs
    merge_mapping={
        'CK Hip': ['CK-H1-pY', 'CK-H1-pST', 'CK-H1-Global'],
        'CK Cortex': ['CK-X1-pY', 'CK-X1-pST', 'CK-X1-Global'],
        'CK All': ['CK Hip', 'CK Cortex'],
    },
    groups=ckp25_groups,
)
pyproteome.data_sets.norm_all_data(datas, norm_mapping, replace_norm=True, inplace=True)[source]

Normalize all data sets.

Parameters:
datas : dict of (str, DataSet)
norm_mapping : dict of (str, str)
replace_norm : bool, optional
inplace : bool, optional

Modify datas object inplace.

Returns:
datas : dict of (str, DataSet)
mapped_names : dict of (str, str)
pyproteome.data_sets.merge_all_data(datas, merge_mapping, mapped_names=None, merged_fn=None, inplace=True)[source]

Merge together multiple data sets.

Parameters:
datas : dict of (str, DataSet)
merge_mapping : dict of (str, list of str)
mapped_names : dict of (str, str), optional
merged_fn : func, optional
inplace : bool, optional

Modify datas object inplace.

Returns:
datas : dict of (str, DataSet)
pyproteome.data_sets.merge_data(data_sets, name=None, norm_channels=None, merge_duplicates=True)[source]

Merge a list of data sets together.

Parameters:
data_sets : list of DataSet
name : str, optional
norm_channels : dict of (str, str)
merge_duplicates : bool, optional
Returns:
ds : DataSet
pyproteome.data_sets.merge_proteins(ds, inplace=False, fn=None)[source]

Merge together all peptides mapped to the same protein. Maintains the first available peptide and calculates the median quantification value for each protein across all of its peptides.

Parameters:
ds : DataSet
inplace : bool, optional
Returns:
ds : DataSet
pyproteome.data_sets.update_correlation(ds, corr, metric='spearman', min_periods=5)[source]

Update a table’s Fold-Change, and p-value columns.

Values are calculated based on changes between group_a and group_b.

Parameters:
ds : DataSet
corr : pandas.Series
metric : str, optional
min_periods : int, optional
Returns:
ds : DataSet
class pyproteome.data_sets.Modification(rel_pos=0, mod_type='', sequence=None, nterm=False, cterm=False)[source]

Bases: object

Contains information for a single peptide modification.

Attributes:
rel_pos : int

The relative position of a modification in a peptide sequence (0-indexed).

mod_type : str

A short name for this type of modification (i.e. ‘Phospho’, ‘Carbamidomethyl’, ‘Oxidation’, ‘TMT6’, ‘TMT10’)

nterm : bool

Boolean indicator of whether this modification is applied to the peptide N-terminus.

cterm : bool

Boolean indicator of whether this modification is applied to the peptide C-terminus.

abs_pos

The absolute positions of this modification in the full sequence of each mapped protein (0-indexed).

Returns:
tuple of int
copy()[source]

Creates a copy of a modification. Does not copy the underlying sequence object.

Returns:
mod : Modification
display_mod_type()[source]

Return the mod_type in an abbreviated form (i.e. ‘p’ for ‘Phospho’)

Returns:
abbrev : str
exact

Indicates whether each peptide-protein mapping for this modification is an exact or partial match.

Returns:
exact : tuple of bool
letter

This modification’s one-letter amino acid code (i.e. ‘Y’), or ‘N-term’ / ‘C-term’ for terminal modifications.

Returns:
letter : str
to_tuple()[source]
class pyproteome.data_sets.Modifications(mods=None)[source]

Bases: object

A list of modifications.

Wraps the Modification objects and provides several utility functions.

Attributes:
mods : list of Modification
copy()[source]

Creates a copy of a set of modifications. Does not copy the underlying sequence object.

Returns:
mods : Modifications
get_mods(letter_mod_types)[source]

Filter the list of modifications.

Only keeps modifications with a given letter, mod_type, or both.

Parameters:
letter_mod_types : list of tuple of str, str
Returns:
mods : Modifications

Examples

>>> from pyproteome.sequence import Sequence
>>> from pyproteome.modification import Modification, Modifications
>>> s = Sequence(pep_seq='SVYTEIK')
>>> m = Modifications(
...     [
...         Modification(mod_type='TMT', nterm=True, sequence=s),
...         Modification(mod_type='Phospho', rel_pos=2, sequence=s),
...         Modification(mod_type='TMT', rel_pos=6, sequence=s),
...     ]
... )
>>> m.get_mods('TMT')
['TMT A0', 'TMT K6']
>>> m.get_mods('Phospho')
['pY2']
>>> m.get_mods('Y')
['pY2']
>>> m.get_mods('S')
[]
>>> m.get_mods([('Y', 'Phospho')])
['pY2']
>>> m.get_mods([('S', 'Phospho')])
[]
skip_labels()[source]

Get modifications, skipping over any that are peptide labels.

Returns:
mods : list of Modification
class pyproteome.data_sets.Protein(accession=None, gene=None, description=None, full_sequence=None)[source]

Bases: object

Contains information about a single protein.

Attributes:
accession : str

The UniProt accession (i.e. ‘P40763’).

gene : str

The UniProt gene name (i.e. ‘STAT3’).

description : str

A brief description of the protein (i.e. ‘Signal transducer and activator of transcription 3’).

full_sequence : str

The full sequence of the protein.

class pyproteome.data_sets.Proteins(proteins=None)[source]

Bases: object

Wraps a list of proteins.

Attributes:
proteins : tuple of Protein

List of proteins to which a peptide sequence is mapped.

accessions

List of UniPort accessions for a group of proteins.

Returns:
tuple of str
descriptions

List of protein descriptions for a group of proteins.

Returns:
tuple of str
genes

List of UniPort gene names for a group of proteins.

Returns:
tuple of str
class pyproteome.data_sets.Sequence(pep_seq='', protein_matches=None, modifications=None)[source]

Bases: object

Contains information about a sequence and which proteins it matches to.

Attributes:
pep_seq : str

Peptide sequence, in 1-letter amino code.

protein_matches : list of ProteinMatch

Object mapping all proteins that a peptide sequence matches.

modifications : modification.Modifications

Object listing all post-translation modifications identified on a peptide.

is_labeled

Checks whether a sequence is modified on any residue with a quantification label.

Returns:
is_labeled : bool
is_underlabeled

Checks whether a sequence is modified with quantification labels on fewer than all expected residues.

Returns:
is_underlabeled : bool
to_tuple()[source]
class pyproteome.data_sets.ProteinMatch(protein, rel_pos, exact)[source]

Bases: object

Contains information about how a peptide sequence maps onto a protein.

Attributes:
protein : protein.Protein

Protein object.

rel_pos : int

Relative position of the peptide start within the protein sequence.

exact : bool

Indicates whether a peptide sequence exact matches its protein sequence.

to_tuple()[source]
pyproteome.data_sets.extract_sequence(proteins, sequence_string)[source]

Extract a Sequence object from a list of proteins and sequence string.

Does not set the Sequence.modifications attribute.

Parameters:
proteins : list of protein.Protein
sequence_string : str
Returns:
seqs : list of Sequence

pyproteome.data_sets.constand module

This module provides functionality for manipulating proteomics data sets.

Functionality includes merging data sets and interfacing with attributes in a structured format.

pyproteome.data_sets.constand.CONSTAND_METHODS = {'kde': <function <lambda>>, 'mean': <sphinx.ext.autodoc.importer._MockObject object>, 'median': <sphinx.ext.autodoc.importer._MockObject object>}

Methods to estimate the center of a data set’s rows or columns.

One of [‘mean’, ‘median’, ‘kde’]. ‘mean’: applies numpy.nanmean(), ‘median’ applies numpy.nanmedian(), and ‘kde’ applies levels.kde_max().

pyproteome.data_sets.constand.constand(ds, inplace=False, n_iters=25, tol=1e-05, row_method=None, col_method=None)[source]

Normalize channels for intra-run comparisons. Iteratively fits the matrix of quantification values such that each row and column are centered around a calculated value. Uses row means and column median values for centering by default. See constand.CONSTAND_METHODS for other options.

Parameters:
ds : data_sets.DataSet

Data set to apply CONSTANd normalization on.

inplace : bool, optional

Modify this data set in place.

n_iters : int, optional

Max number of normalization iterations. Rows are normalized on the odd step and columns are normalized on the even step.

tol : float, optional

Minimum error tolerance to use to end iterations early.

row_method : str, optional

Row normalization method to use. Default value is ‘mean’.

col_method : str, optional

Column normalization method to use. Default value is ‘median’.

Returns:
ds : data_sets.DataSet

pyproteome.data_sets.data_set module

This module provides functionality for manipulating proteomics data sets.

Functionality includes merging data sets and interfacing with attributes in a structured format.

pyproteome.data_sets.data_set.DATA_SET_COLS = ['Proteins', 'Sequence', 'Modifications', 'Validated', 'Confidence Level', 'Ion Score', 'q-value', 'Isolation Interference', 'Missed Cleavages', 'Ambiguous', 'Charges', 'Masses', 'RTs', 'Intensities', 'Raw Paths', 'Scan Paths', 'Scan', 'Fold Change', 'p-value']

Columns available in DataSet.psms.

Note that this does not include columns for quantification or weights.

pyproteome.data_sets.data_set.DEFAULT_FILTER_BAD = {'ion_score': 15, 'isolation': 30, 'median_quant': 1500.0, 'q': 0.01}

Default parameters for filtering data sets.

Selects all ions with an ion score > 15, isolation interference < 50, median quantification signal > 1e3, and optional false-discovery q-value < 0.05.

class pyproteome.data_sets.data_set.DataSet(name='', psms=None, search_name=None, channels=None, groups=None, cmp_groups=None, fix_channel_names=True, dropna=False, pick_best_psm=True, constand_norm=False, merge_duplicates=True, filter_bad=True, check_raw=True, skip_load=False, skip_logging=False)[source]

Bases: object

Class that encompasses a proteomics data set. Data sets can be initialized by calling this class’s constructor directly, or using load_all_data().

Includes peptide-spectrum matches, quantification info, and mappings between channels, samples, and sample groups.

Data sets are automatically loaded, filtered, and merged by default. See DEFAULT_FILTER_BAD for default filtering parameters. See DataSet.merge_duplicates() for info on how multiple peptide-spectrum matches are integrated together.

Attributes:
search_name : str, optional

Name of the search file this data set was loaded from.

psms : pandas.DataFrame

Contains at least ‘Proteins’, ‘Sequence’, and ‘Modifications’ columns as well as any quantication data.

channels : dict of str, str

Maps label channel to sample name.

groups : dict of str, list of str

Maps groups to list of sample names. The primary group is considered as the first in this sequence.

cmp_groups : list of list of str

List of groups that are being compared.

name : str

Name of this data set.

levels : dict or str, float

Peptide levels used for normalization.

intra_normalized : bool

Indicates if the data set has been normalized within a TMT-plex analysis.

inter_normalized : bool

Indicates if the data set has been normalized for comparison across TMT-plex analyses.

sets : int

Number of sets merged into this data set.

accessions

Get all uniprot accessions occuring in this data set.

Returns:
list of str

Examples

>>> ds.accessions
['P42227', 'Q920G3', 'Q9ES52']
add_peptide(inserts)[source]

Manually add a peptide or list of peptides to a data set.

Parameters:
insert : dict or list of dict

Examples

This example demostrates how to manually insert a peptide that was manually validated by the user:

prots = data_sets.protein.Proteins(
    proteins=(
        data_sets.protein.Protein(
            accession='Q920G3',
            gene='Siglec5',
            description='Sialic acid-binding Ig-like lectin 5',
            full_sequence=(
                'MRWAWLLPLLWAGCLATDGYSLSVTGSVTVQEGLCVFVACQVQYPNSKGPVFGYWFREGA'
                'NIFSGSPVATNDPQRSVLKEAQGRFYLMGKENSHNCSLDIRDAQKIDTGTYFFRLDGSVK'
                'YSFQKSMLSVLVIALTEVPNIQVTSTLVSGNSTKLLCSVPWACEQGTPPIFSWMSSALTS'
                'LGHRTTLSSELNLTPRPQDNGTNLTCQVNLPGTGVTVERTQQLSVIYAPQKMTIRVSWGD'
                'DTGTKVLQSGASLQIQEGESLSLVCMADSNPPAVLSWERPTQKPFQLSTPAELQLPRAEL'
                'EDQGKYICQAQNSQGAQTASVSLSIRSLLQLLGPSCSFEGQGLHCSCSSRAWPAPSLRWR'
                'LGEGVLEGNSSNGSFTVKSSSAGQWANSSLILSMEFSSNHRLSCEAWSDNRVQRATILLV'
                'SGPKVSQAGKSETSRGTVLGAIWGAGLMALLAVCLCLIFFTVKVLRKKSALKVAATKGNH'
                'LAKNPASTINSASITSSNIALGYPIQGHLNEPGSQTQKEQPPLATVPDTQKDEPELHYAS'
                'LSFQGPMPPKPQNTEAMKSVYTEIKIHKC'
            ),
        ),
    ),
)
seq = data_sets.sequence.extract_sequence(prots, 'SVyTEIK')

mods = data_sets.modification.Modifications(
    mods=[
        data_sets.modification.Modification(
            rel_pos=0,
            mod_type='TMT10plex',
            nterm=True,
            sequence=seq,
        ),
        data_sets.modification.Modification(
            rel_pos=2,
            mod_type='Phospho',
            sequence=seq,
        ),
        data_sets.modification.Modification(
            rel_pos=6,
            mod_type='TMT10plex',
            sequence=seq,
        ),
    ],
)

seq.modifications = mods

ckh_sigf1_py_insert = {
    'Proteins': prots,
    'Sequence': seq,
    'Modifications': mods,
    '126':  1.46e4,
    '127N': 2.18e4,
    '127C': 1.88e4,
    '128N': 4.66e3,
    '128C': 6.70e3,
    '129N': 7.88e3,
    '129C': 1.03e4,
    '130N': 7.28e3,
    '130C': 2.98e3,
    '131':  6.01e3,
    'Validated': True,
    'First Scan': {23074},
    'Raw Paths': {'2019-04-24-CKp25-SiglecF-1-py-SpinCol-col189.raw'},
    'Scan Paths': {'CK-7wk-H1-pY'},
    'IonScore': 30,
    'Isolation Interference': 0,
}
ds.add_peptide([ckh_sigf1_py_insert])
check_raw()[source]

Checks that all raw files referenced in search data can be found in pyproteome.paths.MS_RAW_DIR.

Returns:
found_all : bool
copy()[source]

Make a copy of self.

Returns:
ds : DataSet
data

Get the quantification data for all samples and peptides in a data set.

Returns:
df : pandas.DataFrame
dropna(columns=None, how=None, thresh=None, groups=None, inplace=False)[source]

Drop any channels with NaN values.

Parameters:
columns : list of str, optional
how : str, optional
groups : list of str, optional

Only drop rows with NaN in columns within groups.

inplace : bool, optional
Returns:
ds : DataSet
filter(filters=None, inplace=False, **kwargs)[source]

Filters a data set.

Parameters:
filters : list of dict or dict, optional

List of filters to apply to data set. Filters are also pulled from kwargs (see below).

inplace : bool, optional

Perform the filter on self, other create a copy and return the new object.

Returns:
ds : DataSet

Notes

These parameters filter your data set to only include peptides that match a given attribute. For example:

>>> data.filter(mod='Y', p=0.01, fold=2)

This function interprets both the argument filter and python kwargs magic. The three functions are all equivalent:

>>> data.filter(p=0.01)
>>> data.filter([{'p': 0.01}])
>>> data.filter({'p': 0.01})

Filter parameters can be one of any below:

Name Description
series Use a pandas series (data.psms[series]).
fn Use data.psms.apply(fn).
group_a Calculate p / fold change values from group_a.
group_b Calculate p / fold change values from group_b.
ambiguous Include peptides with ambiguous PTMs if true, filter them out if false.
confidence Discoverer’s peptide confidence (High|Medium|Low).
ion_score MASCOT’s ion score.
isolation Discoverer’s isolation inference.
missed_cleavage Missed cleaves <= cutoff.
median_quant Median quantification signal >= cutoff.
median_cv Median coefficient of variation <= cutoff.
p p-value < cutoff.
q q-value < cutoff.
asym_fold Change > val if cutoff > 1 else Change < val.
fold Change > cutoff or Change < 1 / cutoff.
motif Filter for motif.
protein Filter for protein or list of proteins.
protein Filter for protein or list of proteins.
accession Filter for protein or list of UniProt accessions.
sequence Filter for sequence or list of sequences.
mod Filter for modifications.
only_validated Use rows validated by CAMV.
any Use rows that many any filter.
inverse Use all rows that are rejected by a filter.
rename Change the new data sets name to a new value.
fix_channel_names()[source]

Correct quantification channel names to those present in the search file.

i.e. from 130_C to 130C (or vice versa).

genes

Get all uniprot gene names occuring in this data set.

Returns:
list of str
get_data(groups=None, mods=None, short_name=False)[source]

Get the quantification data for all samples and peptides in a data set, with more customizable options.

Parameters:
groups : list of str, optional

Select samples from a given list of groups, otherwise select all samples.

mods : str or list of str, optional

Option passed to modification.Modification.__str__().

short_name : bool, optional

Use the short abbreviation of a gene name (using pyproteome.utils.get_name()). Otherwise use the long version.

Returns:
df : pandas.DataFrame
get_groups(group_a=None, group_b=None)[source]

Get channels associated with two groups.

Parameters:
group_a : str or list of str, optional
group_b : str or list of str, optional
Returns:
samples : list of str
labels : list of str
groups : tuple of (str or list of str)
get_samples(groups=None)[source]

Get a list of sample names in this data set.

Parameters:
groups : optional, list of (list of str)
Returns:
list of str
intensity_data

Get the quantification data for all samples and peptides in a data set.

Parameters:
norm_cmp : bool, optional
Returns:
df : pandas.DataFrame
inter_normalize(norm_channels=None, other=None, inplace=False)[source]

Normalize runs to one channel for inter-run comparions.

Parameters:
other : DataSet, optional

Second data set to normalize quantification values against, using a common normalization channels.

norm_channels : list of str, optional

Normalization channels to use for cross-run normalization.

inplace : bool, optional

Modify this data set in place.

Returns:
ds : DataSet
log_stats()[source]

Log statistics information about peptides contained in the data set. This information includes total numbers, phospho-specificity, modification ambiguity, completeness of labeling, and missed cleavage counts.

merge_duplicates(inplace=False)[source]

Merge together all duplicate peptides. New quantification values are calculated from a weighted sum of each channel’s values.

Parameters:
inplace : bool, optional

Modify the data set in place, otherwise create a copy and return the new object.

Returns:
ds : DataSet
merge_subsequences(inplace=False)[source]

Merges petides that are a subsequence of another peptide. (i.e. SVYTEIKIHK + SVYTEIK -> SVYTEIK)

Only merges peptides that contain the same set of modifications and that map to the same protein(s).

Parameters:
inplace : bool, optional
Returns:
ds : DataSet
norm_cmp_groups(cmp_groups, ctrl_groups=None, inplace=False)[source]

Normalize between groups in a list. This can be used to compare data sets that have comparable control groups.

Channnels within each list of groups are normalized to the mean of the group’s channels.

Parameters:
cmp_groups : list of list of str

List of groups to be normalized to each other. i.e. [(‘CK Hip’, ‘CK-p25 Hip’), (‘CK Cortex’, ‘CK-p25 Cortex’)]

ctrl_groups : list of str, optional

List of groups to use for baseline normalization. If not set, the first group from each comparison will be used.

inplace : bool, optional

Modify the data set in place, otherwise create a copy and return the new object.

Returns:
ds : DataSet

Examples

>>> channels = ['a', 'b', 'c', 'd']
>>> groups = {i: [i] for i in channels}
>>> ds = data_sets.DataSet(channels=channels, groups=groups)
>>> ds.add_peptide({'a': 1000, 'b': 500, 'c': 100, 'd': 25})
>>> ds = ds.norm_cmp_groups([['a', 'b'], ['c', 'd']])
>>> ds.data
{'a': 1, 'b': 0.5, 'c': 1, 'd': .25}
normalize(lvls, inplace=False)[source]

Normalize channels to given levels for intra-run comparisons.

Divides all channel values by a given level.

Parameters:
lvls : dict of str, float

Mapping of channel names to normalized levels. All quantification values for each channel are divided by the normalization factor.

inplace : bool, optional

Modify this data set in place.

Returns:
ds : DataSet
phosphosites

Get a list of all unique phosphosites identified in a data set.

Returns:
list of str

Examples

>>> ds.phosphosites
['Siglec5 pY561', 'Stat3 pY705', 'Inpp5d pY868']
rename_channels(inplace=False)[source]

Rename all columns names for quantification channels to sample names. (i.e. ‘126’ => ‘Mouse #1’).

Parameters:
inplace : bool, optional
Returns:
ds : DataSet
samples

Get a list of sample names in this data set.

Returns:
list of str
shape

Get the size of a data set in (rows, columns) format.

Returns:
shape : tuple of (int, int)
update_group_changes(group_a=None, group_b=None)[source]

Update a DataSet’s Fold Change, and p-value for each peptide using the give two-group comparison.

Values are calculated based on changes between group_a and group_b. p-values are calculated as a 2-sample t-test.

Parameters:
psms : pandas.DataFrame
group_a : str or list of str, optional

Single or multiple groups to use for fold change numerator.

group_b : str or list of str, optional

Single or multiple groups to use for fold change denominator.

pyproteome.data_sets.data_set.load_all_data(chan_mapping=None, group_mapping=None, loaded_fn=None, norm_mapping=None, merge_mapping=None, merged_fn=None, kw_mapping=None, merge_only=True, replace_norm=True, **kwargs)[source]

Load, normalize, and merge all data sets found in pyproteome.paths.MS_SEARCHED_DIR.

Parameters:
chan_mapping : dict, optional
group_mapping : dict, optional
loaded_fn : func, optional
norm_mapping : dict, optional
merge_mapping : dict, optional
merged_fn : func, optional
kw_mapping : dict of (str, dict)
merge_only : bool, optional
replace_norm : bool, optional

If true, only keep the normalized version of a data set. Otherwise return both normalized and unnormalized version.

kwargs : dict

Any extra arguments are passed directly to DataSet during initialization.

Returns:
datas : dict of str, DataSet

Examples

This example demostrates how to automatically load, filter, normalize, and together several data sets:

ckh_channels = OrderedDict(
    [
        ('3130 CK Hip',     '126'),
        ('3131 CK-p25 Hip', '127'),
        ('3145 CK-p25 Hip', '128'),
        ('3146 CK-p25 Hip', '129'),
        ('3148 CK Hip',     '130'),
        ('3157 CK Hip',     '131'),
    ]
)
ckx_channels = OrderedDict(
    [
        ('3130 CK Cortex',     '126'),
        ('3131 CK-p25 Cortex', '127'),
        ('3145 CK-p25 Cortex', '128'),
        ('3146 CK-p25 Cortex', '129'),
        ('3148 CK Cortex',     '130'),
        ('3157 CK Cortex',     '131'),
    ]
)
ckp25_groups = OrderedDict(
    [
        (
            'CK',
            [
                '3130 CK Hip',
                '3148 CK Hip',
                '3157 CK Hip',
                '3130 CK Cortex',
                '3148 CK Cortex',
                '3157 CK Cortex',
            ],
        ),
        (
            'CK-p25',
            [
                '3131 CK-p25 Hip',
                '3145 CK-p25 Hip',
                '3146 CK-p25 Hip',
                '3131 CK-p25 Cortex',
                '3145 CK-p25 Cortex',
                '3146 CK-p25 Cortex',
            ],
        ),
    ]
)
# With search data located as follows:
#   Searched/
#       CK-H1-pY.msf
#       CK-H1-pST.msf
#       CK-H1-Global.msf
#       CK-X1-pY.msf
#       CK-X1-pST.msf
#       CK-X1-Global.msf
# Load each data set, normalized to its respective global proteome analysis:
datas = data_sets.load_all_data(
    chan_mapping={
        'CK-H': ckh_channels,
        'CK-X': ckx_channels,
    },
    # Normalize pY, pST, and Global runs to each sample's global data
    norm_mapping={
        'CK-H1': 'CK-H1-Global',
        'CK-X1': 'CK-X1-Global',
    ]),
    # Merge together normalized hippocampus and cortex runs
    merge_mapping={
        'CK Hip': ['CK-H1-pY', 'CK-H1-pST', 'CK-H1-Global'],
        'CK Cortex': ['CK-X1-pY', 'CK-X1-pST', 'CK-X1-Global'],
        'CK All': ['CK Hip', 'CK Cortex'],
    },
    groups=ckp25_groups,
)

# Alternatively, load each data set, using CONSTANd normalization:
data_sets.constand.DEFAULT_CONSTAND_COL = 'kde'
datas = data_sets.load_all_data(
    chan_mapping={
        'CK-H': ckh_channels,
        'CK-X': ckx_channels,
    },
    norm_mapping='constand',
    # Merge together normalized hippocampus and cortex runs
    merge_mapping={
        'CK Hip': ['CK-H1-pY', 'CK-H1-pST', 'CK-H1-Global'],
        'CK Cortex': ['CK-X1-pY', 'CK-X1-pST', 'CK-X1-Global'],
        'CK All': ['CK Hip', 'CK Cortex'],
    },
    groups=ckp25_groups,
)
pyproteome.data_sets.data_set.merge_all_data(datas, merge_mapping, mapped_names=None, merged_fn=None, inplace=True)[source]

Merge together multiple data sets.

Parameters:
datas : dict of (str, DataSet)
merge_mapping : dict of (str, list of str)
mapped_names : dict of (str, str), optional
merged_fn : func, optional
inplace : bool, optional

Modify datas object inplace.

Returns:
datas : dict of (str, DataSet)
pyproteome.data_sets.data_set.merge_data(data_sets, name=None, norm_channels=None, merge_duplicates=True)[source]

Merge a list of data sets together.

Parameters:
data_sets : list of DataSet
name : str, optional
norm_channels : dict of (str, str)
merge_duplicates : bool, optional
Returns:
ds : DataSet
pyproteome.data_sets.data_set.merge_proteins(ds, inplace=False, fn=None)[source]

Merge together all peptides mapped to the same protein. Maintains the first available peptide and calculates the median quantification value for each protein across all of its peptides.

Parameters:
ds : DataSet
inplace : bool, optional
Returns:
ds : DataSet
pyproteome.data_sets.data_set.norm_all_data(datas, norm_mapping, replace_norm=True, inplace=True)[source]

Normalize all data sets.

Parameters:
datas : dict of (str, DataSet)
norm_mapping : dict of (str, str)
replace_norm : bool, optional
inplace : bool, optional

Modify datas object inplace.

Returns:
datas : dict of (str, DataSet)
mapped_names : dict of (str, str)
pyproteome.data_sets.data_set.update_correlation(ds, corr, metric='spearman', min_periods=5)[source]

Update a table’s Fold-Change, and p-value columns.

Values are calculated based on changes between group_a and group_b.

Parameters:
ds : DataSet
corr : pandas.Series
metric : str, optional
min_periods : int, optional
Returns:
ds : DataSet
pyproteome.data_sets.data_set.update_pairwise_corr(ds, x_val, y_val, inplace=False)[source]

pyproteome.data_sets.modification module

This module provides functionality for post-translational modifications.

Wraps modifications in a structured class and allows filtering of modifications by amino acid and modification type.

pyproteome.data_sets.modification.LABEL_NAME_TARGETS = ('TMT', 'ITRAQ', 'plex')

Substrings used to identify and import novel label names from .msf files.

pyproteome.data_sets.modification.MERGE_UNDERLABELED = True

Merge peptides that have satured TMT labeling with peptides that are underlabeled.

class pyproteome.data_sets.modification.Modification(rel_pos=0, mod_type='', sequence=None, nterm=False, cterm=False)[source]

Bases: object

Contains information for a single peptide modification.

Attributes:
rel_pos : int

The relative position of a modification in a peptide sequence (0-indexed).

mod_type : str

A short name for this type of modification (i.e. ‘Phospho’, ‘Carbamidomethyl’, ‘Oxidation’, ‘TMT6’, ‘TMT10’)

nterm : bool

Boolean indicator of whether this modification is applied to the peptide N-terminus.

cterm : bool

Boolean indicator of whether this modification is applied to the peptide C-terminus.

abs_pos

The absolute positions of this modification in the full sequence of each mapped protein (0-indexed).

Returns:
tuple of int
copy()[source]

Creates a copy of a modification. Does not copy the underlying sequence object.

Returns:
mod : Modification
display_mod_type()[source]

Return the mod_type in an abbreviated form (i.e. ‘p’ for ‘Phospho’)

Returns:
abbrev : str
exact

Indicates whether each peptide-protein mapping for this modification is an exact or partial match.

Returns:
exact : tuple of bool
letter

This modification’s one-letter amino acid code (i.e. ‘Y’), or ‘N-term’ / ‘C-term’ for terminal modifications.

Returns:
letter : str
to_tuple()[source]
class pyproteome.data_sets.modification.Modifications(mods=None)[source]

Bases: object

A list of modifications.

Wraps the Modification objects and provides several utility functions.

Attributes:
mods : list of Modification
copy()[source]

Creates a copy of a set of modifications. Does not copy the underlying sequence object.

Returns:
mods : Modifications
get_mods(letter_mod_types)[source]

Filter the list of modifications.

Only keeps modifications with a given letter, mod_type, or both.

Parameters:
letter_mod_types : list of tuple of str, str
Returns:
mods : Modifications

Examples

>>> from pyproteome.sequence import Sequence
>>> from pyproteome.modification import Modification, Modifications
>>> s = Sequence(pep_seq='SVYTEIK')
>>> m = Modifications(
...     [
...         Modification(mod_type='TMT', nterm=True, sequence=s),
...         Modification(mod_type='Phospho', rel_pos=2, sequence=s),
...         Modification(mod_type='TMT', rel_pos=6, sequence=s),
...     ]
... )
>>> m.get_mods('TMT')
['TMT A0', 'TMT K6']
>>> m.get_mods('Phospho')
['pY2']
>>> m.get_mods('Y')
['pY2']
>>> m.get_mods('S')
[]
>>> m.get_mods([('Y', 'Phospho')])
['pY2']
>>> m.get_mods([('S', 'Phospho')])
[]
skip_labels()[source]

Get modifications, skipping over any that are peptide labels.

Returns:
mods : list of Modification
pyproteome.data_sets.modification.allowed_mod_type(mod, any_letter=None, any_mod=None, letter_mod=None)[source]

Check if a modification is of a given type.

Filters by letter, mod_type, or both.

Parameters:
mod : Modification
any_letter : set of str
any_mod : set of str
letter_mod : set of tuple of str, str
Returns:
is_type : bool

pyproteome.data_sets.protein module

This module provides functionality for interfacing with protein data.

class pyproteome.data_sets.protein.Protein(accession=None, gene=None, description=None, full_sequence=None)[source]

Bases: object

Contains information about a single protein.

Attributes:
accession : str

The UniProt accession (i.e. ‘P40763’).

gene : str

The UniProt gene name (i.e. ‘STAT3’).

description : str

A brief description of the protein (i.e. ‘Signal transducer and activator of transcription 3’).

full_sequence : str

The full sequence of the protein.

class pyproteome.data_sets.protein.Proteins(proteins=None)[source]

Bases: object

Wraps a list of proteins.

Attributes:
proteins : tuple of Protein

List of proteins to which a peptide sequence is mapped.

accessions

List of UniPort accessions for a group of proteins.

Returns:
tuple of str
descriptions

List of protein descriptions for a group of proteins.

Returns:
tuple of str
genes

List of UniPort gene names for a group of proteins.

Returns:
tuple of str

pyproteome.data_sets.sequence module

This module provides functionality for manipulating sequences.

class pyproteome.data_sets.sequence.ProteinMatch(protein, rel_pos, exact)[source]

Bases: object

Contains information about how a peptide sequence maps onto a protein.

Attributes:
protein : protein.Protein

Protein object.

rel_pos : int

Relative position of the peptide start within the protein sequence.

exact : bool

Indicates whether a peptide sequence exact matches its protein sequence.

to_tuple()[source]
class pyproteome.data_sets.sequence.Sequence(pep_seq='', protein_matches=None, modifications=None)[source]

Bases: object

Contains information about a sequence and which proteins it matches to.

Attributes:
pep_seq : str

Peptide sequence, in 1-letter amino code.

protein_matches : list of ProteinMatch

Object mapping all proteins that a peptide sequence matches.

modifications : modification.Modifications

Object listing all post-translation modifications identified on a peptide.

is_labeled

Checks whether a sequence is modified on any residue with a quantification label.

Returns:
is_labeled : bool
is_underlabeled

Checks whether a sequence is modified with quantification labels on fewer than all expected residues.

Returns:
is_underlabeled : bool
to_tuple()[source]
pyproteome.data_sets.sequence.extract_sequence(proteins, sequence_string)[source]

Extract a Sequence object from a list of proteins and sequence string.

Does not set the Sequence.modifications attribute.

Parameters:
proteins : list of protein.Protein
sequence_string : str
Returns:
seqs : list of Sequence