proteometer.lip#

Attributes#

Functions#

filter_contaminants_reverse_pept(→ pandas.DataFrame)

Filters out contaminants and reverse hits from a peptide DataFrame.

filter_contaminants_reverse_prot(→ pandas.DataFrame)

Filters out contaminants and reverse hits from a protein DataFrame.

filtering_protein_based_on_peptide_number(...)

Filters proteins based on the minimum number of peptides.

get_clean_peptides(→ pandas.DataFrame)

Cleans peptide sequences by removing modifications and returns a DataFrame with cleaned peptides.

get_tryptic_types(→ pandas.DataFrame)

Analyzes the tryptic pattern of peptides and classifies them as tryptic, semi-tryptic, or non-tryptic.

select_tryptic_pattern(→ pandas.DataFrame)

Selects peptides based on their digestion pattern.

analyze_tryptic_pattern(→ pandas.DataFrame)

Analyzes tryptic patterns and calculates statistics for peptides.

rollup_to_lytic_site(→ pandas.DataFrame)

Converts the double-peptide data frame to a site-level data frame.

rollup_single_protein_to_lytic_site(→ pandas.DataFrame)

Rolls up peptide-level limited proteolysis data to lytic sites.

select_lytic_sites(→ pandas.DataFrame)

Selects lytic sites based on the specified site type.

delta_prok_site(→ pandas.DataFrame)

Computes exposure values for each lytic (ProK) site.

Module Contents#

proteometer.lip.AggDictFloat[source]#
proteometer.lip.filter_contaminants_reverse_pept(df: pandas.DataFrame, search_tool: Literal['maxquant', 'msfragger', 'fragpipe'], protein_id_col_pept: str, uniprot_col: str) pandas.DataFrame[source]#

Filters out contaminants and reverse hits from a peptide DataFrame.

Parameters:
  • df (pd.DataFrame) – Input DataFrame containing peptide data.

  • search_tool (Literal["maxquant", "msfragger", "fragpipe"]) – The search tool used for data generation.

  • protein_id_col_pept (str) – Column name containing protein IDs in the peptide DataFrame.

  • uniprot_col (str) – Column name to store UniProt IDs.

Returns:

Filtered DataFrame with contaminants and reverse hits removed.

Return type:

pd.DataFrame

proteometer.lip.filter_contaminants_reverse_prot(df: pandas.DataFrame, search_tool: Literal['maxquant', 'msfragger', 'fragpipe'], protein_id_col_prot: str, uniprot_col: str) pandas.DataFrame[source]#

Filters out contaminants and reverse hits from a protein DataFrame.

Parameters:
  • df (pd.DataFrame) – Input DataFrame containing protein data.

  • search_tool (Literal["maxquant", "msfragger", "fragpipe"]) – The search tool used for data generation.

  • protein_id_col_prot (str) – Column name containing protein IDs in the protein DataFrame.

  • uniprot_col (str) – Column name to store UniProt IDs.

Returns:

Filtered DataFrame with contaminants and reverse hits removed.

Return type:

pd.DataFrame

proteometer.lip.filtering_protein_based_on_peptide_number(df2filter: pandas.DataFrame, peptide_counts_col: str, search_tool: Literal['maxquant', 'msfragger', 'fragpipe'], min_pept_count: int = 2) pandas.DataFrame[source]#

Filters proteins based on the minimum number of peptides.

Parameters:
  • df2filter (pd.DataFrame) – Input DataFrame containing proteomics data.

  • peptide_counts_col (str) – Column name containing peptide counts.

  • search_tool (Literal["maxquant", "msfragger", "fragpipe"]) – The search tool used for data generation.

  • min_pept_count (int, optional) – Minimum number of peptides required. Defaults to 2.

Returns:

Filtered DataFrame with proteins having at least min_pept_count peptides.

Return type:

pd.DataFrame

proteometer.lip.get_clean_peptides(pept_df: pandas.DataFrame, peptide_col: str, clean_pept_col: str = 'clean_pept') pandas.DataFrame[source]#

Cleans peptide sequences by removing modifications and returns a DataFrame with cleaned peptides.

Parameters:
  • pept_df (pd.DataFrame) – Input DataFrame containing peptide data.

  • peptide_col (str) – Column name containing peptide sequences.

  • clean_pept_col (str, optional) – Column name to store cleaned peptide sequences. Defaults to “clean_pept”.

Returns:

DataFrame with an additional column for cleaned peptide sequences.

Return type:

pd.DataFrame

proteometer.lip.get_tryptic_types(pept_df: pandas.DataFrame, prot_seq: str, peptide_col: str, clean_pept_col: str = 'clean_pept') pandas.DataFrame[source]#

Analyzes the tryptic pattern of peptides and classifies them as tryptic, semi-tryptic, or non-tryptic.

Parameters:
  • pept_df (pd.DataFrame) – Input DataFrame containing peptide data.

  • prot_seq (str) – Protein sequence to analyze against.

  • peptide_col (str) – Column name containing peptide sequences.

  • clean_pept_col (str, optional) – Column name for cleaned peptide sequences. Defaults to “clean_pept”.

Returns:

DataFrame with additional columns for peptide start, end, and type.

Return type:

pd.DataFrame

proteometer.lip.select_tryptic_pattern(pept_df: pandas.DataFrame, prot_seq: str, tryptic_pattern: str = 'all', peptide_col: str = 'Sequence', clean_pept_col: str = 'clean_pept') pandas.DataFrame[source]#

Selects peptides based on their digestion pattern.

Parameters:
  • pept_df (pd.DataFrame) – Input DataFrame containing peptide data.

  • prot_seq (str) – Protein sequence to analyze against.

  • tryptic_pattern (str, optional) – Digestion pattern to filter peptides. Defaults to “all”. must be one of: all, any-tryptic, tryptic, semi-tryptic, non-tryptic.

  • peptide_col (str, optional) – Column name containing peptide sequences. Defaults to “Sequence”.

  • clean_pept_col (str, optional) – Column name for cleaned peptide sequences. Defaults to “clean_pept”.

Returns:

Filtered DataFrame with peptides matching the specified digestion pattern.

Return type:

pd.DataFrame

proteometer.lip.analyze_tryptic_pattern(protein: pandas.DataFrame, sequence: str, pairwise_ttest_groups: collections.abc.Iterable[proteometer.stats.TTestGroup], peptide_col: str, description: str = '', anova_type: str = '[Group]', keep_non_tryptic: bool = True, id_separator: str = '@', sig_type: str = 'pval', sig_thr: float = 0.05) pandas.DataFrame[source]#

Analyzes tryptic patterns and calculates statistics for peptides.

Parameters:
  • protein (pd.DataFrame) – Input DataFrame containing proteomics data.

  • sequence (str) – Protein sequence to analyze against.

  • pairwise_ttest_groups (Iterable[TTestGroup]) – Groups for pairwise t-tests.

  • peptide_col (str) – Column name containing peptide sequences.

  • description (str, optional) – Protein description to add to data frame. Defaults to “”.

  • anova_type (str, optional) – Type of ANOVA analysis. Defaults to “[Group]”.

  • keep_non_tryptic (bool, optional) – Whether to keep non-tryptic peptides. Defaults to True.

  • id_separator (str, optional) – Separator for peptide IDs. Defaults to “@”.

  • sig_type (str, optional) – Significance type (e.g., “pval”). Defaults to “pval”.

  • sig_thr (float, optional) – Significance threshold. Defaults to 0.05.

Returns:

DataFrame with analyzed tryptic patterns and statistics.

Return type:

pd.DataFrame

proteometer.lip.rollup_to_lytic_site(double_pept: pandas.DataFrame, prot_seqs: list[proteometer.fasta.SeqRecord], int_cols: collections.abc.Iterable[str], par: proteometer.params.Params) pandas.DataFrame[source]#

Converts the double-peptide data frame to a site-level data frame.

Parameters:
  • double_pept (pd.DataFrame) – The double-peptide data frame.

  • prot_seqs (list[fasta.SeqRecord]) – The list of protein sequences.

  • int_cols (Iterable[str]) – The names of columns to with intensity values.

  • anova_cols (list[str]) – The columns for ANOVA.

  • pairwise_ttest_groups (Iterable[stats.TTestGroup]) – The pairwise T-test groups.

  • metadata (pd.DataFrame) – The metadata data frame.

  • par (Params) – The parameters for limitied proteolysis analysis.

Returns:

A data frame with the site-level data.

Return type:

pd.DataFrame

proteometer.lip.rollup_single_protein_to_lytic_site(df: pandas.DataFrame, int_cols: collections.abc.Iterable[str], uniprot_col: str, sequence: str, residue_col: str = 'Residue', description: str = '', tryptic_pattern: str = 'all', peptide_col: str = 'Sequence', clean_pept_col: str = 'clean_pept', id_separator: str = '@', id_col: str = 'id', pept_type_col: str = 'pept_type', site_col: str = 'Site', pos_col: str = 'Pos', multiply_rollup_counts: bool = True, ignore_NA: bool = True, alternative_protease: str = 'ProK', rollup_func: Literal['median', 'mean', 'sum'] = 'sum') pandas.DataFrame[source]#

Rolls up peptide-level limited proteolysis data to lytic sites.

Parameters:
  • df (pd.DataFrame) – Input DataFrame containing peptide data.

  • int_cols (Iterable[str]) – Columns with intensity values to aggregate.

  • uniprot_col (str) – Column name for UniProt IDs.

  • sequence (str) – Protein sequence to analyze against.

  • residue_col (str, optional) – Column name for lytic residues. Defaults to “Residue”.

  • description (str, optional) – Protein description to add to data frame. Defaults to “”.

  • tryptic_pattern (str, optional) – Digestion pattern to filter peptides. Defaults to “all”.

  • peptide_col (str, optional) – Column name containing peptide sequences. Defaults to “Sequence”.

  • clean_pept_col (str, optional) – Column name for cleaned peptide sequences. Defaults to “clean_pept”.

  • id_separator (str, optional) – Separator for IDs. Defaults to “@”.

  • id_col (str, optional) – Column name for IDs. Defaults to “id”.

  • pept_type_col (str, optional) – Column name for peptide types. Defaults to “pept_type”.

  • site_col (str, optional) – Column name for lytic sites. Defaults to “Site”.

  • pos_col (str, optional) – Column name for positions. Defaults to “Pos”.

  • multiply_rollup_counts (bool, optional) – Whether to multiply rollup counts. Defaults to True.

  • ignore_NA (bool, optional) – Whether to ignore NA values. Defaults to True.

  • alternative_protease (str, optional) – Name of the alternative protease. Defaults to “ProK”.

  • rollup_func (Literal["median", "mean", "sum"], optional) – Aggregation function. Defaults to “median”.

Returns:

DataFrame with rolled-up lytic site data and aggregated statistics.

Return type:

pd.DataFrame

proteometer.lip.select_lytic_sites(site_df: pandas.DataFrame, site_type: str = 'prok', site_type_col: str = 'Lytic site type') pandas.DataFrame[source]#

Selects lytic sites based on the specified site type.

Parameters:
  • site_df (pd.DataFrame) – Input DataFrame containing lytic site data.

  • site_type (str, optional) – Type of lytic site to select. Defaults to “prok”.

  • site_type_col (str, optional) – Column name for lytic site types. Defaults to “Lytic site type”.

Returns:

Filtered DataFrame with selected lytic sites.

Return type:

pd.DataFrame

proteometer.lip.delta_prok_site(peptide_df: pandas.DataFrame, site_df: pandas.DataFrame, int_cols: list[str], site_type_col: str = 'Type', site_protein_col: str = 'Protein', pept_protein_col: str = 'Protein', protein_length_col: str = 'Protein length', site_pept_col: str = 'Peptide', pept_pept_col: str = 'Peptide', position_col: str = 'Pos', pept_start_col: str = 'pept_start', pept_end_col: str = 'pept_end', rollup_method: Literal['median', 'mean', 'sum'] = 'median') pandas.DataFrame[source]#

Computes exposure values for each lytic (ProK) site.

This is computed as the average log intensity of peptides for which the site is a lytic site minus the average log intensity peptides that contain the site in their sequence. The average function is determined by the rollup_method parameter.

Parameters:
  • peptide_df (pd.DataFrame) – DataFrame containing peptide data.

  • site_df (pd.DataFrame) – DataFrame containing lytic site data.

  • int_cols (list[str]) – List of columns to aggregate.

  • site_type_col (str, optional) – Column name for lytic site types. Defaults to “Type”.

  • site_protein_col (str, optional) – Column name for protein IDs in the lytic site DataFrame. Defaults to “Protein”.

  • pept_protein_col (str, optional) – Column name for protein IDs in the peptide DataFrame. Defaults to “Protein”.

  • protein_length_col (str, optional) – Column name for protein lengths. Defaults to “Protein length”.

  • site_pept_col (str, optional) – Column name for peptides in the lytic site DataFrame. Defaults to “Peptide”.

  • pept_pept_col (str, optional) – Column name for peptides in the peptide DataFrame. Defaults to “Peptide”.

  • position_col (str, optional) – Column name for positions in the lytic site DataFrame. Defaults to “Pos”.

  • pept_start_col (str, optional) – Column name for start positions in the peptide DataFrame. Defaults to “pept_start”.

  • pept_end_col (str, optional) – Column name for end positions in the peptide DataFrame. Defaults to “pept_end”.

  • rollup_method (Literal["median", "mean", "sum"], optional) – Aggregation method to use. Defaults to “median”. The “sum” is done in linear space.

Returns:

DataFrame with delta values for each lytic site.

Return type:

pd.DataFrame