pyRBDome pipeline - Ground Truth data analyses
Contents
Introduction
pyRBDome is a package for protein RNA-binding sites prediction. It combines multiple and distinct RNA-binding prediction algorithms to identify putative RNA-binding amino acids within proteins. The algorithms predict RNA-binding propensity from different aspects, for example using only the protein sequences or a combination fo protein sequence and protein structure. It then aggregates all the data into easily interpretable pdb and pdf data files, which can then be used to desing mutations for more detailed functional analyses. This repository contains all the code used and notebooks used for generating and analysing two human ground truth datasets.
Repo Contents
This repository contains the following content:
-
analysed_pdbs: Directory containing all the pdb files and RNA-binding site predictions for each individual Uniprot IDs. Below an explanation what all the different direactories are and what files they contain:
- pdb_files This directory contains ALL the structures of protein-RNA complexes that contain a protein with a specific Uniprot ID that were downloaded by notebook 1.0. These pdb files should have resolutions of 5.0 or less. The file names consist of two parts. For example, B3Y653_5GMF.pdb -> B3Y653 is the Uniprot ID and 5GMF is the rcsb.org structure name.
- distances_in_b-factor This directory contains the same structure files as in the pdb_files directory but in these files the distances to RNA (in Å) were included in the b-factor columns.
- filtered_pdb_files This folder contains cleaned up versions of the pdb files in the distances_in_b-factor directory in which only ATOM coordinates were kept.
- distances_merged This directory contains: (1) A "merged.pdb" file where the b-factor collumn contains the shortest distance to RNA for each amino acid. These values were determined by calculating the distances for each amino acid in all the structures that we analysed for this Uniprot ID and only the shortest distance to RNA was included in the merged.pdb file. (2) A "domains.pdb" file: contains only the coordinates for domains that were identified in the protein. (3) A "plip_merged_all.pdb" file: contains ALL the PLIP analysis results in the b-factor column. A value of 1 indicates that that specific amino acids would found to have one specific interaction (i.e. hydrogen bond, hydrophobic, etc etc) with RNA in the structures. Values larger than one indicate amino acids that have been found to interact with RNA in multiple PDB files and/or form more than one bond with RNA in available structures. (4) "plip_merged" pdb files containing counts for specific type of interactions (i.e. hydrogen bond, pie-stacking, etc) in the b-factor column.
- PLIP_analyses This directory contains ALL the files produced by PLIP for the pdb files in the distances_in_b-factor directory.
- prediction_results This directory contains ALL the prediction results from the individual tools in b-factor columns of pdb files. The "FTMap_docked.pdb" files contain the coordinates for the molecules docked onto the structures by FTMap. The "FTMap_distances.pdb" files contain the distances (in Å) for each amino acid in the pdb file to the docked molecules in the b-factor column. In this directory also all the pymol sessions are stored that allows easy visualisation of the different pdb files in one Pymol session. The "analysis_results.pdb" files contain all the prediction results in the protein sequence. The "model_predictions.pdb" files contain the probabilities for RNA-binding for each amino acid in the b-factor columns, calculated using our XGBoost model that was trained on the GT-Distance ground truth dataset (see the manuscript for more details). The "merged.pdb.zip" files contain the unprocessed PST-PRNA results.
-
analysis_results: Contains all the results from the statistical analyses that were performed, the distance analysis of the cross-linked peptides and amino acid sequences, the unprocessed prediction results from RNABindRPlus, HydRa and DisoRDPbind, the Interproscan domain analysis results and fasta files for the individual protein sequences. Additionally, this directory contains all the ROC and precision recall analyses that were performed with our XGBoost models and the GT-PLIP and GT-Distance ground truth datasets.
-
pyRBDome_analyses: This directory contains:
- ground_truth_notebooks: All the Jupyter notebooks that were used for the analyses,the "settings.yaml" configuration file for running the notebooks and the xgboost_models directory containing all the XGBoost models that we have generated using all the prediction results with the GT-PLIP ground truth datasets (all_model.xgb), all the predictions with the GT-Distance ground truth datasets (all_pocked_model.xgb) and using various combinations of prediction results.
- RBSID_human_data.xlsx: Dataset used for testing the pyRBDome pipeline.
- UP000005640_9696.fasta: Human proteome fasta file used for our analyses.
System Requirements
Hardware Requirements
These data analyses are computationally intensive, and we routinely run the jobs on 20-40 processors so a server with ~64 cores and at least 64GB of RAM is recommended.
The runtimes vary considerably, depending on how large the sequencing data files are.
But a typical run generally takes several hours to complete.
Software Requirements
OS Requirements
The installer has been tested on Mac OS 12.7 and extensively on Ubuntu OS versions 12.04 to 23.10. It requires Python 3.9, is likely to run fine on Python 3.10 but it hasn't been tested on later versions.
Unfortunately some code in pyRBDome are not compatible with Windows and we will also not be suppporting this operating system.
Other dependencies:
These dependencies need to be installed in case you want to run the Jupyter notebooks described in the pyRBDome pipeline.
- Python 3.9
- pyRBDome classes and functions: https://git.ecdf.ed.ac.uk/sgrannem
- ncbi-blast-2.13.0+ (For building AlphaFold models https://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/)
- interproscan-5.61-93.0 (https://interproscan-docs.readthedocs.io/en/latest/HowToRun.html)
- MM-align (https://zhanggroup.org/MM-align/)
- RNABindRPlus-Local (https://github.com/jpetucci/RNABindRPlus-Local)
The pipeline uses Python Selenium to connect to various web severs. For this it installs the Linux version of Chrome and the associated chrome driver on your system. To run the pipeline on Mac OS, you therefore need to download chrome and chrome driver manually and place these in the Chrome folder contained in the pyRBDome package. Make sure that the chromedriver version is exactly the same as the chrome version that you downloaded. You then need to modify the config.py file in this package to tell the tool where to find the binaries:
PACKAGE_DIR = os.path.dirname(os.path.abspath(__file__))
### Below the default chrome path is for linux. You will need to change this if you want to run the package on Mac OS
CHROME_PATH = os.path.join(PACKAGE_DIR,'Chrome','chrome-macos','chrome')
### Make sure the chromedriver script has the exact same version as chrome.
CHROME_DRIVER_PATH = os.path.join(PACKAGE_DIR,'Chrome','chromedriver-macos','chromedriver')
os.environ['PATH'] = f"{os.environ['PATH']}:{os.path.dirname(CHROME_DRIVER_PATH)}"
Then you can install the package as described below.
Installation Guide
To be able to run the pipeline, you need to install the pyRBDome package. When doing so, we strongly recommend that you install Anaconda first and then generate a new Python evironment:
conda create -n pyrbdome python=3.9
Then activate this conda environment:
conda activate pyrbdome
To download the package you need to install 'git' and run the following command:
git clone https://git.ecdf.ed.ac.uk/sgrannem/pyrbdome-dev-082023.git
Then go into the directory:
cd pyrbdome_dev
Then run the following command:
pip install -e .
Make sure to include the dot ('.') at the end of this command, otherwise it won't run! This command will also install jupyter lab, which you need for running test and pyRBDome data analysis notebooks.
pyRBDome Ground Truth analysis pipeline description
The pipeline that we describe in our manuscript includes two parts: the pyRBDome code and pyRBDome notebooks. The former contains all the scripts, functions, and classes that users need to execute in Jupyter notebooks on their local machine (https://git.ecdf.ed.ac.uk/sgrannem/pyrbdome-dev-082023). The code has been thoroughly tested on Ubuntu Linux operating systems and can readily be adapted to work on Mac OS (12.7 and above). Details on how to install the packages and run the notebooks can be found in the README files on our repository (https://git.ecdf.ed.ac.uk/sgrannem). For each protein, pyRBDome will automatically run the predictions online or locally, rename and store the results in designated directories. Meanwhile, the script running processes and results will be stored and updated in SQLite database. SQLite database can help avoid repeated submission of analysed pdb files to prediction servers again and result tables can also be exported for further analysis (details can be found in pyRBDome/Functions).
The pipeline stores any progress it has made as well as result from all the analyses in an SQlite database. This enables the user to keep track for which proteins (model) structures have been downloaded and whether these structures were analysed successfully by each prediction algorithm. All the notebooks can also be run sequentially in the terminal using papermill (https://papermill.readthedocs.io). The Jupiter notebooks each have their unique number. A detailed description of what analyses each notebook does is outlined below. Note that all the notebooks can also be executed using papermill in a Unix or Linux terminal. Papermill is automatically installed when installing the pyRBDome package.
1. Finding all available (model) structures for each Uniprot ID.
Available pdb files (<= 5Å resolution) associated with the Uniprot IDs listed in the RBS-ID data (Bae et al, 2020) were downloaded from rcsb.org (Berman et al, 2000) using notebook 1.0. Notebook 1.1 subsequently processes all the pdb files and cleans them up to remove any non essentail text from the files.
2. Getting protein domains from Pfam.
After all the pdb files have been downloaded, notebook 1.2 will then use the Interproscan tool (Jones et al, 2014; Blum et al, 2021) to download all the domain information associated with these proteins. Only Pfam domains are considered, and the user needs to install the Interproscan tool separately.
3. Creating peptide control datasets.
Notebook 1.3 takes the protein sequence from each pdb file and digests the sequences in silico with Trypsin and Lys-C to generate a library of all possible peptides that could theoretically be detected by the mass-spectrometer for the protein of interest. If cross-linked peptide sequences were provided, notebook 1.4 will generate a library of random peptide sequences that are peptides of the exact same length distribution as the cross-linked peptides, but that were randomly extracted from the protein sequence.
4. Performing RNA/ligand-binding sites predictions.
To predict RNA/ligand-binding sites on the proteins of study, we chose six different prediction algorithms: aaRNA, BindUP, FTMap, RNABindRPlus and DisoRDPbind (Walia et al, 2014; Peng & Kurgan, 2015; Paz et al, 2016; Mehio et al, 2010). These notebooks will automatically submit all the pdb files to the respective web servers, download the results, and store the progress they have made with the analyses in the SQLite database. To further increase the performance of the pipeline, we have recently also implemented the PST-PRNA deep learning approach (Li & Liu, 2022) in our notebooks, which predicts putative RNA-binding amino acids entirely using the surface topology of the proteins in the structures.
5. Mapping the cross-linked amino acid and peptide sequences to the pdb files.
Notebook 3.0 takes the cross-linked, in silico digested and random peptide sequences and maps them to the pdb files. Once the peptides have been mapped, it will determine the location of cross-linked amino acids, if this information was provided. For example, if the peptide sequence “PSRKDPKYREWHHFL” is analysed by this notebook and it could be mapped to a pdb file sequence, it will the start and end residue numbers for the peptide in the pdb file and what chain it was mapped to. For this example, the code returned the following result: 74A_psrkdpkyrewhhfl_88A. This shows that the peptide was mapped between residues 74 to 88 of chain A in the pdb file.
6. Storing the results in pdb files
Notebook 4.0 collects all prediction results and any domain and mapped peptide/amino acid information and store the results in the b-factor columns of pdb file. This makes it possible to visualise the results in Pymol or other viewers.
7. Gathering all the results and storing the data into large tables.
Notebook 5.0 grabs all the prediction results and the ground truth information and stores them into large tables and the SQL database for further downstream analyses. These tables are also exported as CSV files to the analysis_results folder. The notebooks in the pyRBDome analyses of the ground truth datasets also contain extra code that adds the distances to RNA molecules for each amino acid for all protein-RNA structures that were analysed.
8. Tripeptide motif analyses
Notebook 6.0 grabs all the RNA-binding regions from the analysed proteins and searches for enriched k-mers/tripeptides in the data.
9. Binary classification analyses.
Notebooks 6.1.1 and 6.1.2 process the prediction results so that it can be used for training the XGBoost models. Notebook 6.1.1 uses the GT-PLIP dataset for training and testing wherease notebook 6.1.2 uses the GT-Distance dataset for this purpose. Given that the number of RNA-interacting amino acids was less than 5-10% of the total number, we undersampled the majority class (i.e. 0’s) in our training data to reduce the unbalanced nature of the dataset. Using Python Scikit-learn and the Optuna hyperparameter optimisation framework (Akiba et al, 2019) we optimised the parameters for our XGBoost models. All the models are available from our repository. This notebook also describes analyses we performed for testing overfitting. The observation that the validation and training errors were close implies that the model is not significantly overfitting the data.
10. Predicting RNA-binding residues for proteins using our XGBoost models.
Notebook 6.2 takes all the prediction results available in the large table (produced by notebook 5.0), feeds that to our XGBoost models, and calculates for each amino acid in each protein a probability for RNA-binding. These findings are then provided in PDB files where the probability for RNA-binding for each amino acid is provided in the b-factor column.
10. Analysis of cross-linked peptide and amino acid sequences
Notebooks 6.3 and 6.4 compare the cross-linking data to the GT-PLIP and GT-Distance ground truth datasets as well as the predictions from the different tools that pyRBDome employs. Notebook 6.3 determines whether cross-linked peptide and amino acids (where available) are significantly enriched for predicted RNA-binding sites compared to the random peptide datasets and the peptides generated by Trypsin/Lys-C digestion of the protein sequences. Notebook 6.4 does similar analyses but here the cross-linking data are compared to the ground truth datasets.
11. Making the pdf and pymol session output files
The series 7 notebooks gather all the prediction and cross-linking information from the pdb files that were produced by notebook 4.0 and place the information in a large table where RNA-binding probabilities provided by each algorithm are stored as well as the location of cross-linked peptides and amino acid residues. The notebooks in the pyRBDome analyses of the ground truth dataset also contain extra code that adds the distances to RNA molecules for each amino acid for all protein-RNA structures that were analysed. Notebook 7.1 takes all the analysis results and for each protein produces pdf files summarising all the results in the protein sequences. The scorebars in the pdf files indicate the XGBoost RNA-binding probabilities for each amino acid. Notebook 7.2 generates pymol session files that enables the user to conveniently load all pdb files into a single Pymol session.