Data Science Portfolio
Patient survival analysis by Kaplan-Meier estimates
Summary
During my graduate studies, I helped in a project evaluating protein targets that could be of potential
therapeutic interest in RET-related cancers. This project combined insights
from several different experimental techniques, mainly synthethic lethality assays,
gene expression of CRISPR-Cas9 knock-out cell models, protein-protein interactions, etc.
To add more information to this study, I took a list of our top genes of interest and evaluated whether there
was any apparent clinical similarity with our synthetic lethality findings. I retrieved
clinical and RNA Seq gene expression data from a breast cancer study (METABRIC), and generated
Kaplan-Meier survival curves for ER+ patients, where RET is not the driver gene, but its
expression can enhance tumor progression.
I divided the dataset into 4 groups according to the expression of RET and one other gene at
a time (low-low, low-high, high-low, and high-high). I generated a large number of plots for
each RET-gene pair, and screened to find the pairs where only the low-low curve is above the other 3
(higher survival probabilities, the others must be overlapping = same survival).
I identified 10 RET-gene candidates that show the behavior of interest in this breast cancer
clinical trial (top 3: GRK7, NTHL1, and RUVBL1). These findings were considered in the ongoing project, and
during the process I generated tools (Jupyter/Colab Notebooks and Streamlit app) to automate the generation
of KM plots that will allow to explore other diseases/trials.
Expand this to read more…
More Context
We obtained a list of 1,015 genes that showed significant synthetic lethality with RET in cell assays (based on fitness scores and p-values). This indicates that cancer cells died when a pair of RET-gene were knocked down, while healthy cells remained unaffected. In addition, this only occurs when the two selected genes are "removed" together, whereas "removing" only one of them or leaving both intact does not kill the cancer cells. For this purpose, I retrieved the METABRIC dataset deposited in cBioPortal, filtered it to keep the ER+ patients marked as dying from disease (cancer), and subdivided the data into four groups for a single gene pair as mentioned above. Survival curves for each of the subdatasets was then calculated using the KaplanMeierFitter module from the lifelines python library. This process was repeated to create KM plots for each RET-gene pair, where I used one of the ~1k genes that show synthetic lethality. All the resulting plots were reviewed and the ones showing the behavior of interest were analyzed (10 of them).
Problems
- We needed to retrieve clinical and gene expression data from a breast cancer trial.
- We needed to generate over 1,000 plots to screen RET-gene pairs.
- Once we identified the 10 candidates, we needed to do complimentary survival analyses to evaluate other features.
Solutions
- I found the -METABRIC- clinical trial which has survival and gene expression data of over 2,000 patients.
- I made Jupyter notebook tools ( see Tool 3) to automate the generation of KM plots using the same data processing and varying the gene name to be used.
- I generated a Streamlit data app that allows quick and interactive generation of KM plots with user-selected features.
Check the Streamlit tool I made here: Demo_KM_plotter
Searching for cancer-specific protein interactions by Proximity Ligation Assays
Summary
As part of my Ph.D. project, I used PLA assays to identify proteins that
interact only with mutant receptors present in thyroid cancers, and not with normal
receptors present in healthy cells.
For this purpose, I developed a full workflow: from automated imaging, to image processing and
quantification in ImageJ/Fiji. Due to the large number of images acquired, I wrote Python
scripts to automate some steps of the workflow or speed up others.
Using these tools, I was able to identify 2 protein candidates that showed the behavior
of interest (see figure for CRK→). These and future findings using my workflow provide valuable insights
that can be translated into therapeutic strategies.
Expand this to read more…
More Context
Based on research and information provided by collaborators who perform PLA experiments, I carried out several
small-scale tests. However, I found out that due to our conditions (antibodies, cell lines, treatments), there was
noticeable variability and background signal such that larger sample sizes would be required (first I tested <50
images of individual cells). In addition, I observed that confocal resolution was not essential to ensure good results and thus
evaluated other options. Finally, I decided to use the EVOS M7000 imager (Thermo Fisher) which has automated
imaging capabilities and decent resolution of PLA objects at 40x magnification.
Here I illustrate the workflow I generated (→), including the extra steps required due to the large
number of images that I acquired with the EVOS imager.
However, that brought several complications since my lab used to -mostly- do manual
experiment analysis for small-scale experiments, which was very time consuming. Thus, I developed
data-solutions for as many stages as I could (all with an * symbol before their name
).
Problems
- The imager saved the raw images as single-channel, single-slice images instead of hyperstacks. This provided 12 images instead of 1.
- I acquired 300+ hyperstacks/fields of view (FOVs), which was too many for manual review.
- All FOVs requied pre-processing before quantification (Z-projection and background subtraction).
- We needed to manually draw and save two .roi files for each cell to be quantified (100-500 per experimental group).
- For the quantification, each big FOV required to be split, the PLA channel to be isolated and thresholded. Here, different pre-set methods are offered within ImageJ/Fiji, but produce different results depending on the input images and features.
- The quantification results needed to be reviewed to evaluate whether the thresholding and object detection parameters were appropriate. These sometimes needed to be slightly adjusted for each experiment to produce the best results.
Solutions
- I wrote a script to merge output images to hyperstacks in ImageJ/Fiji by analyzing the output image file name which contained cues about the FOV, slice, and channel numbers.
- I wrote a script for grid stitching in ImageJ/Fiji, to produce 10 big FOVs per experimental group. This was much more manageable and still easy to manually navigate and review.
- I wrote a script for pre-processing in ImageJ/Fiji, to Z-project and apply background subtraction.
- Due to time constraints, I could not implement a solution to identify the cells/areas of interest and draw the contours for the .roi files. However, I wrote 3 scripts to save and handle the areas in ImageJ/Fiji, which significantly speeded up the process.
- I wrote a script for quantification in ImageJ/Fiji that allows the user to select the thresholding method to use (methods panel and/or Find Maxima plug-in). Also, the user can enter the parameters for PLA object detection (size, circularity).
- I made a Streamlit data app to summarize quantification outputs. The app automates the generation of Power Point presentations with images and object count results.
Results
- I consolidated my merge, stitch, and pre-process scripts into a single script that takes few initial parameters and automates these steps by chaining their inputs and outputs (See script).
- The only manual parts left are to select the areas (cells) of interest, and the analysis of the results since my research group uses GraphPad Prism. Everything else has been succesfully automated , most scripts take several input parameters that allow reusability and crucial parts of the code are flexible enough for fine-tunning (more/less channels, slices, FOVs, areas, etc.).
- Approximately 10,000 images of individual cells were analyzed (6 different interactions, 7 experimental groups each).
- CRK was identified as a protein interacting with mutant RET receptors found in thyroid cancers but not with normal reecptors. The box plot below has the same data as the dot plot above, but shows the statistical analysis results and the dotted line is the additional threshold I used in my experiments (20 normalized puncta per cell) to consider a real interaction result given the background signal. All groups in this plot have sample sizes in the range of 140-270 (individual cell images).
- More details and data can't be shared at the moment. The final version of this figure as well as the figures for the other protein interactions I evaluated are temporarily restricted as they are part of ongoing projects. Most of my PLA experiments were done in 2022 but I wasn't able to publish them all before I graduated in Sept 2023 (my thesis will be made available to the public in Dec 2028). This restriction is to allow my former research group to use the data in manuscripts for publication.
Check the Github page showcasing some of the scripts mentioned here: PLA ImageJ scripts
Check the Streamlit data app I made here: Demo_PLA_PPTX
More about my work
Expand this to read more…
Check my other Github pages
My Github Stats: