Welcome to DancePartner!¶
DancePartner is a python package for mining multi-omics relationship networks from literature and databases. Though DancePartner may be organized into an intuitive pipeline, it should be thought of as a toolbox of functions for building multi-omics networks for various needs, whether those be networks derived only from literature, databases, or a mix of the two. We welcome additions to our package and are happy to collaborate with anyone willing to add code to this framework. For simplicity, we will present the code in three different pipelines:
A. Mining from Literature¶
The literature mining pipeline may be summarized in 5 key steps: 1. Pulling Publications, 2. Identifying Entities, 3. Extracting Relationships, 4. Collapsing Synonyms, and 5. Construct Network Table.
1. Pulling Publications¶
Publications may be pulled from any of three databases: PubMed, Scopus, and OSTI. There is also a function to deduplicate papers across databases (which returns a table) that may be passed to the paper pulling function.
Documentation¶
- class DancePartner.pull_papers.pull_papers(output_directory: str, pubmed_ids: list[str] | None = None, scopus_ids: list[str] | None = None, osti_ids: list[str] | None = None, deduped_table: DataFrame | None = None, type: str = 'both', include_summary_file: bool = True, tarball_path: str | None = None, scopus_api_key: str | None = None)¶
Given a list of IDs referencing a literature database, pull available text, prioritizing full text whenever available, then titles and abstracts. A summary file of what was pulled is also generated.
- Parameters:
output_directory (str) – A string indicating the directory path for where to write results.
pubmed_ids (list[str] | None) – A list of PubMed IDs as strings. Only one of pubmed_ids, scopus_ids, or osti_ids can be provided. If wanting to use multiple databases, upload a dedeuped_table.
scopus_ids (list[str] | None) – A list of DOIs. Only one of pubmed_ids, scopus_ids, or osti_ids can be provided. If wanting to use multiple databases, upload a dedeuped_table.
osti_ids (list[str] | None) – A list of OSTI IDs. A list of PubMed IDs. Only one of pubmed_ids, scopus_ids, or osti_ids can be provided. If wanting to use multiple databases, upload a dedeuped_table.
deduped_table (pandas.core.frame.DataFrame | None) – A pandas DataFrame o deduplicated table of papers from deduplicate_papers. pubmed_ids, scopus_ids, and osti_ids should all be None to use a deduped table.
type (str) – Either “full text” to pull only full text, “abstract” to pull only abstracts, or “both” to first prioritize full text, and then prioritize abstracts.
include_summary_file (bool) – A boolean where True will write a summary .txt file desbring number of papers found from each pull_ranking method.
tarball_path (str | None) – An optional string for the path where to write the (large) tarball files to. Can also be used to specify a tarball path where a previous function run may have saved articles to, which can reduce run time.
scopus_api_key (str | None) – A string API key for Scopus-Elselvier. Only needed when pulling papers from Scopus. See https://dev.elsevier.com/.
- Return type:
Extracted papers as text files in folders, with a summary file
PubMed Example¶
PubMed requires a list of PubMed IDs called PMIDs. To obtain PMIDs, simply enter a query into the search bar of PubMed, click “Save”, select “All results”, and output the format as “PMID”.
pull_papers(database = "pubmed", ids = [9851916], output_directory = "papers")
Scopus Example¶
Scopus uses DOIs to identify papers. To obtain these DOIs, enter a query into the search bar, click “Export”, select the desired format, select all documents, and then export at least the DOI column. Scopus also requires string API key. See https://dev.elsevier.com/.
# Save the scopus key as a variable
pull_papers(
scopus_ids = ["10.1186/s40168-021-01035-8", "10.1002/bit.26296", "10.1002/pmic.200300397", "10.1074/mcp.M115.057117"],
output_directory = "scopus_papers",
scopus_api_key = scopus_api_key
)
OSTI Example¶
To obtain OSTI IDs, enter a query and click “Save Results”, and the resulting file will contain the OSTI IDs.
pull_papers(
osti_ids = ["2229172", "1629838", "1766618", "1379914"],
output_directory = "osti_papers"
)
Deduplication¶
To deduplicate papers across databases, the following steps must be followed:
PubMed: Enter the query, hit search, hit save, select “All results” and “csv”
Scopus: Enter the query, hit search, hit export, select “CSV” and keep all defaults checked.
OSTI: Enter the query, hit search, and save results as a “CSV”
- class DancePartner.deduplicate_papers.deduplicate_papers(pubmed_path: str | None = None, scopus_path: str | None = None, osti_path: str | None = None)¶
Deduplicate papers across databases.
- Parameters:
pubmed_path (str | None) – The path to the PubMed export of paper information. To obtain, enter the query, hit search, hit save, select “All results” and “csv”.
scopus_path (str | None) – The path to the Scopus export of paper information. To obtain, enter the query, hit search, hit export, select “CSV” and keep all defaults checked.
osti_path (str | None) – The path to the OSTI export of paper information. To obtain, enter the query, hit search, and save results as a “CSV”
- Return type:
A table with deduplicated papers
# Save the results as a deduplicated table
deduped_table = deduplicate_papers(pubmed_path, scopus_path, osti_path)
# Then pull publications using the deduped table. Use the saved scopus_api_key
pull_papers(deduped_table = deduped_table, output_directory = "deduped_example", scopus_api_key = scopus_api_key)
2. Identifying Entities¶
First, biological terms (entities) need to be defined and found in articles. An option is to use ScispaCy, or simply use synonym files, which is the recommended approach. To install ScispaCy, see the “Optional ScispaCy Model” section of the README.
ScispaCy¶
- class DancePartner.extract_terms.extract_terms_scispacy(paper_directory: str, omes_folder: str, tags: list[str] = ['GENE_OR_GENE_PRODUCT', 'SIMPLE_CHEMICAL', 'AMINO_ACID'], additional_stop_words: list[str] | None = None, min_length: int = 3, max_length: int = 100, verbose: bool = False)¶
Extract terms from papers
- Parameters:
paper_directory (str) – Directory to papers in txt format. Subdirectories are searched, and .gz and output_summary.txt files are ignored.
omes_folder (str) – Path to the omes folder where “stop_words_english.txt” is stored. Required.
tags (list[str]) – A list of tags from the en_ner_bionlp13cg_md model. See https://allenai.github.io/scispacy/
additional_stop_words (list[str] | None) – Add more words to be removed from consideration. Default is None.
min_length (int) – The minimum number of non-whitespace characters required. Default is 3.
max_length (int) – The maximum number of characters allowed. Default is 100.
verbose (bool) – Indicate whether a message should be printed as each file is processed. Default is “FALSE”
- Return type:
A list of unique terms found in papers written as a string
# Extracting terms requires a path to the papers and a path to the omes folder
extract_terms_scispacy(paper_directory = paper_directory, omes_folder = "../omes")
Synonym Files¶
First, pull a proteome.
- class DancePartner.pull_ome.pull_proteome(proteome_id: str, output_directory: str)¶
Function that pulls a proteome and its synonyms for a species.
- Parameters:
proteome_id (str) – Search for a proteome ID here: https://www.uniprot.org/proteomes/. It starts with “UP”
output_directory (str) – Path specifying where to write the result within the current directory.
- Return type:
Protein IDs and their synonyms in a text file
# Pull a proteome (protein and its synonyms) and place it in the omes folder
pull_proteome(proteome_id = "UP000001940", output_directory = "../omes")
List Synonyms¶
Then, list all synonyms.
- class DancePartner.create_synonym_table.list_synonyms(omes_folder: str, proteome_filename: str, min_length: int = 3)¶
List all possible synonyms to match
- Parameters:
omes_folder (str) – Path to the omes folder. Required.
proteome_filename (str) – Name of the proteome file within the omes folder. Use the full file name. Required.
min_length (int) – Minimum number of characters in a term. Default is 3.
- Return type:
A list of synonyms to find in papers
# List the omes folder and the proteome to use in the omes folder
list_synonyms("../omes", "UP000001940_proteome.txt")
3. Extracting Relationships¶
First, sentences with terms need to be extracted and formatted for the downstream BERT model.
- class DancePartner.find_terms_in_papers.find_terms_in_papers(paper_directory: str, terms: list[str], output_directory: str | None = None, n_gram_max: int = 3, max_char_length: int = 250, padding: int = 10, verbose: bool = False)¶
This function searches through sentences of papers to extract biomolecule pairs present in each sentence. It utilizes a set-intersection method on the n-grams of the sentences with the already-found biomolecule synonyms.
- Parameters:
paper_directory (str) – A directory path pointing to the list of papers to be parsed through.
terms (list[str]) – List of terms to find in papers
output_directory (str | None) – An optional path to a directory for where to write results to. Otherwise, the function will return the table.
n_gram_max (int) – The number of n_grams to consider when combing the papers. (e.g. n_grams=2 will catch “protein A” but n_grams=1 will not). If unsure, use the default.
max_char_length (int) – The number of maximum characters that can be in a segment containing the pair of biomolecules
padding (int) – The amount of padding (in characters) to surround the terms in a segment by at minimum.
verbose (bool) – If True, print status messages
- Return type:
A Pandas DataFrame of the resulting data.
# Supply this function with the directory with the papers, a list of terms, and the output directory
find_terms_in_papers(
paper_directory = paper_directory,
terms = my_terms,
n_gram_max = 5,
padding = 50,
output_directory = output_directory
)
Next, BERT can be run. Extract the BERT model from here. Place in the top level directory of this repo in a folder called “biobert”. Pull the config.json, the pytorch_model.bin, and the training_args.bin files.
- class DancePartner.bert_functions.run_bert(input_path: str, model_path: str, output_directory: str, segment_col_name: str, **kwargz)¶
Function to prepare a dataframe to be inputted into the BERT model
- Parameters:
input_path (str) – A path to the CSV file to run the model on. Should be a result of ppi.find_terms_in_papers
model_path (str) – A path to the folder containing the BERT model. Put the model in a folder within this directory called biobert. Find the model here: https://huggingface.co/david-degnan/BioBERT-RE/tree/main
output_directory (str) – A path where to write the results to
segment_col_name (str) – The name of the column representing the chunk of text containing the pair of biomolecules.
**kwargz – Any additional arguments to pass to TrainingArguments.
- Return type:
Writes a csv file containing the results of the model.
# Create a variable to the output directory. Make sure the biobert model is pulled and in a location that can be referenced by model_path.
run_bert(
input_path = "/path/to/sentence_biomolecule_pairs.csv",
model_path = "../biobert", # Update to the path of your BERT model if necessary. We recommend placing biobert in the top directory.
output_directory = output_directory,
segment_col_name = "segment",
use_cpu = True
)
4. Collapsing Synonyms¶
- class DancePartner.create_synonym_table.map_synonyms(term_list: list[str], omes_folder: str, proteome_filename: str, add_missing: bool = False, output_directory: bool | None = None)¶
Map synonyms to IDs in the order of lipids, metabolites, and finally gene products.
- Parameters:
term_list (list[str]) – List of terms to map to lipidome, metabolome, and proteome.
omes_folder (str) – Path to the omes folder. Required.
proteome_filename (str) – Name of the proteome file within the omes folder. Use the full file name. Required.
add_missing (bool) – If True, add terms that weren’t mapped to synonyms. Optional.
output_directory (bool | None) – A path to a directory for where to write results to.
- Return type:
A table with the synonym, its ID, and the type (gene product, lipid, metabolite)
# A term list, the path the ome folder, the name of the proteome file to use, and a path to
# the output directory are all needed
map_synonyms(
term_list = all_found_terms,
omes_folder = "../omes",
proteome_filename = "UP000001940_proteome.txt",
add_missing = True,
output_directory = output_directory
)
5. Construct Network Table¶
A network table that lists each edge between nodes can finally be built and visualized in a downstream function.
- class DancePartner.construct_network.build_network_table(BERT_data: DataFrame, synonyms: DataFrame)¶
Build a network table of edges with biomolecule IDs and their synonyms
- Parameters:
BERT_data (pandas.core.frame.DataFrame) – The output table from run_bert() as a pandas DataFrame.
synonyms (pandas.core.frame.DataFrame) – The output table from map_synonyms() as a pandas DataFrame.
- Return type:
A network table of synonyms, IDs, types (gene product, metabolite, lipid), and the source (literature or database)
# Pass the BERT table and synonyms
build_network_table(BERT_data = BERT_Table, synonyms = Synonym_Table)
B. Mining from Databases¶
There are functions to mine relationships from UniProt, WikiPathways, and KEGG. There is a file for relationships in LipidMaps. See the vignettes.
- class DancePartner.pull_relationships.pull_uniprot(species_id: str, output_directory: str | None = None, remove_self_relationships: bool = True, verbose: bool = True)¶
Function that pulls protein-protein and protein-metabolite interactions for a species.
- Parameters:
species_id (str) – The taxon ID for the organism of interest.
output_directory (str | None) – Path specifying where to write the result.
remove_self_relationships (bool) – True to remove any relationships to self, and False to maintain them. Default is True.
verbose (bool) – Whether progress messages should be written or not. Default is False.
- Return type:
A dataframe denoting relationships in 7 columns (Synonym1, ID1, Type1, Synonym1, ID2, Type2, Source)
pull_uniprot(
species_id = 1423,
output_directory = output_directory,
remove_self_relationships = True,
verbose = True
)
- class DancePartner.pull_relationships.pull_wikipathways(species_name: str, species_id: str, omes_folder: str, proteome_filename: str, output_directory: str | None = None, remove_self_relationships: bool = True, verbose: bool = False)¶
Extract relationships from metabolic networks stored in WikiPathways
- Parameters:
species_name (str) – The name for the species. Select species from here: https://www.wikipathways.org/browse/organisms.html. Use proper Genus species format
species_id (str) – The taxon ID for the organism of interest
omes_folder (str) – Path to the omes folder
proteome_filename (str) – Name of the proteome file
output_directory (str | None) – Path specifying where to write the result.
remove_self_relationships (bool) – True to remove any relationships to self, and False to maintain them. Default is True.
verbose (bool) – Whether progress messages should be written or not. Default is False.
- Return type:
A dataframe denoting relationships in 7 columns (Synonym1, ID1, Type1, Synonym1, ID2, Type2, Source)
pull_wikipathways(
species_name = "Bacillus subtilis",
species_id = "1423",
omes_folder = "../omes",
proteome_filename = "UP000001570_proteome.txt",
output_directory= output_directory,
verbose = True
)
- class DancePartner.pull_relationships.pull_kegg(kegg_species_id: str, omes_folder: str, proteome_filename: str, output_directory: str | None = None, flatten_module: bool = False, remove_self_relationships: bool = True, verbose: bool = False)¶
Extract relationships from metabolic networks (modules) stored in KEGG
- Parameters:
kegg_species_id (str) – The name for the species. Select species from here: https://rest.kegg.jp/list/organism
species_id – The taxon ID for the organism of interest
omes_folder (str) – Path to the omes folder
proteome_filename (str) – Name of the proteome file
output_directory (str | None) – Path specifying where to write the result.
flatten_module (bool) – If True, everything in a module will be considered related to everything else in a module. If False, metabolic relationships as defined by KEGG will be preserved. Default is False.
remove_self_relationships (bool) – True to remove any relationships to self, and False to maintain them. Default is True.
verbose (bool) – Whether progress messages should be written or not. Default is False.
- Return type:
A dataframe denoting relationships in 7 columns (Synonym1, ID1, Type1, Synonym1, ID2, Type2, Source)
pull_kegg(
kegg_species_id = "bsu",
omes_folder = "../omes",
proteome_filename = "UP000001570_proteome.txt",
output_directory = output_directory,
flatten_module = False,
verbose = True
)
C. Building and Combining Networks¶
Combining Networks¶
Network tables can easily be combined with pd.concat. Any other tools used to make networks can be visualized in DancePartner, assuming they have the seven required columns: Synonym1, ID1, Type1, Synonym1, ID2, Type2, Source. Synonyms for Term 1 and Term 2 (Synonym1 & Synonym2) can be anything as can identifiers (ID1 and ID2). Type must be “lipid”, “metabolite”, or “gene product”. Source must be either “literature” or “database”.
Duplicate Removal¶
If using your own network files, you may want to consider deduplication of relationships, as they are symmetric, meaning that ID1 & ID2 = ID2 & ID1 (or a relationship between ATP & ATP synthase is the same as a relationship between ATP synthase and ATP).
- class DancePartner.pull_relationships.remove_relationship_duplicates(network_table: DataFrame, remove_self_relationships: bool = True)¶
Remove all duplicates from a network table.
- Parameters:
network_table (pandas.core.frame.DataFrame) – Output of build_network_table, pull_protein_protein_interactions, etc. Use pd.concat to concatenate multiple tables together.
remove_self_relationships (bool) – True to remove any relationships to self, and False to maintain them. Default is True.
- Return type:
A table with unique interacting biomolecules
remove_relationship_duplicates(
network_table = network_table,
remove_self_relationships = True
)
Visualize & Calculate Network Metrics¶
- class DancePartner.construct_network.visualize_network(network_table: DataFrame, gene_product_color: str = '#D55E00', metabolite_color: str = '#0072B2', lipid_color: str = '#E69F00', literature_color: str = '#56B4E9', database_color: str = '#000000', node_size: int = 30, edge_weight: int = 4, with_labels: bool = False)¶
Visualize a network table
- Parameters:
network_table (pandas.core.frame.DataFrame) – Output of build_network_table, pull_protein_protein_interactions, etc. Use pd.concat to concatenate multiple tables together.
gene_product_color (str) – Hexadecimal for the gene product node color. Default is #D55E00 (vermillion).
metabolite_color (str) – Hexadecimal for the metabolite node color. Default is #0072B2 (blue).
lipid_color (str) – Hexadecimal for the lipid node color. Default is #E69F00 (orange).
literature_color (str) – Hexadecimal for the literature edge color. Default is #56B4E9 (skyblue).
database_color (str) – Hexadecimal for the database edge color. Default is #000000 (black).
node_size (int) – Size of the nodes. Default is 30.
edge_weight (int) – Weight of the edges. Default is 4.
with_labels (bool) – Whether labels should be included or not. Default is False.
- Return type:
A network object and the visualization of that object
visualize_network(network_table = network_table)
- class DancePartner.construct_network.calculate_network_metrics(network: Graph, metric: str = 'all')¶
Calculate network metrics for the multi-omics network.
- Parameters:
network (networkx.classes.graph.Graph) – The output of visualize_network
metric (str) – Either “number of components”, “average component size”, “degree centrality”, “clustering coefficient”, or “all”. Default is “all”.
- Return type:
Network summary metrics