Skip to the content.

003 - Interactive Kaplan-Meier plotter

Expand this to read more…

Summary

During my graduate studies, I helped two bachelor's students working in my research group to do a small project analyzing survival data from a breast cancer clinical trial. The aim was to assess any potential correlation between higher patient survival and low expression of two genes simultaneously.

For this purpose, we retrieved data publicly available in cBioPortal, and generated Kaplan-Meier survival curves. We focused on the METABRIC dataset containing information for over 2,500 patients and I generated tools in Jupyter/Colab notebooks and a Streamlit app (see GIF -->) to automate the creation of these plots with Python.

At the end of the project, we were able to identify less than 10 gene pairs showing the behavior of interest. That information was used in combination with other data from different techniques (in silico and in vitro) to prioritize further studies evaluating the effect of inhibtion of those genes in cancer cell models.



Try the app on → Streamlit Cloud or GitHub Codespaces

Instructions and Sample data (Rename to "clinical.txt" first)

Problem

  • We needed to generate around 100 Kaplan-Meier plots (pairs of RET-other gene ).
  • Each plot required to divide the dataset into 4 groups to generate 4 survival curves (expression: low-low, low-high, high-low, high-high).
  • The clinical data (survival times and status) and the RNA Seq expression data were in different datasets that have different structure, so pre-processing to both of them was required before we could map the patient IDs.
  • We needed to screen all the plots generated but keep only the ones where the low-low curve was higher than the others, and retrieve relevant data such as CIs and time to 50% survival to complement our analysis.
  • Since each clinical trial reports the data in a different way and not all have RNA Seq data, we chose the best possible option for breast cancer (METABRIC).
  • In order to reuse our code for other breast cancer datasets or even different cancer types, we needed to generalize the workflow as much as possible and make tools for reproducibility and automation.

Solution

  • I learned how to use the KaplanMeierFitter module from the lifelines python library to generate KM plots.
  • I first generated a Google Colab notebook that was dataset-specific to produce batches of 40-50 plots. This exclusively makes 4 groups from the original dataset based on the expression of RET and one other gene, which required to manually write in the code all 40-50 names of the other gene (View tool).
  • Then, I found a way to generalize some steps and created a Jupyter notebook that used ipywidgets to interactively get user inputs, allowing dynamic selection of any measured variable to divide the dataset into 2 or more groups and re-plotting curves easily (View tool).
  • Finally, I discovered Streamlit and adapted my interactive notebook to a data app (GIF above) that used a similar approach but has more interactivy, improved outputs and better user experience.
  • Although the app works well for several datasets, I noticed high variability in the formatting of clinical trial data, and try to improve my app to generalize it more!.

002 - Automated Power Point generator

Expand this to read more…

Summary

During my graduate studies, I performed fluorescence microscopy experiments, acquiring images of cancer cells in vitro. Typical analysis involved co-localization between signals produced by proteins, or object/particle counting.

For Proximity Ligation Assay (PLA) experiments, which evaluate protein-protein interactions, I used a EVOS M7000 cell imager to automate the acquisition of thousands of images. I wrote scripts in Jython (Python wrapper for Java) to automate image processing and analysis in the ImageJ/Fiji software. The outputs are a csv file with the object count for each individual cell and pairs of fluorescence + object mask images (the latter shows particles as colored blobs if met the criteria to be counted).

I designed a tool to consolidate all the outputs for each experimental group into a summary Power Point presentation so we could validate the parameters using during the workflow. I automated the creation of slides using the python-pptx library, designing a custom layout and inserting relevant information. I created first a tool in the form of a Google Colab notebook and then as a Streamilt app. This tool helped me visualize outputs for almost 10,000 images, easily compare two quantification methods, and fully optimize the whole workflow.



Streamlit Projects 002 GIF

Try the app on → Streamlit Cloud or GitHub Codespaces

Instructions and Sample data (Rename to "data.zip" first)

Problem

  • Manually inserting, resizing, arranging and labeling all the images is incredibly time consuming and prone to errors.
  • ImageJ/Fiji is not fully compatible with Python 3 code, so I could not integrate a feasible solution into my other Jython scripts.
  • Each experimental group may be quantified by both methods, one or the other.
  • Depending on the quantification type, the output csv may contain less/additional columns.
  • The real image labels are in the csv alongside their count numbers, however, the fluorescence images are in one subdirectory and adds a "_2" to their name, whereas the object mask image is in a different subdirectory and adds a "_1" to their name (may be one or two sets of object masks, one for each quantification method used).
  • Due to the large number of images quantified per experimental group (100-500) we needed an efficient layout, balancing image visibility and number of slides (fewer slides = faster review).
  • Since our research group was planning on doing several more PLA experiments, automation was essential.

Solution

  • I manually tested different arrangements of images + labels in rows and columns until I set one layout that best worked for the type and amount of data I had (see app info page).
  • I measured and defined each item's coordinates and dove in the documentation of python-pptx to figure out how to make that very specific layout (see app info page).
  • I generated the neccesary code to scan through a zip file in search for csv files, then read the content and go back to the root directory for that experimental group to find the pairs of images to insert.
  • A big iterable is generated with names, counts, and image locations which are analyzed to separate in groups of up to 20 for a single slide (see app info).
  • I implemented this approach first in a Google Colab notebook (View tool) and then created a Streamlit app (GIF above). The app has the same functionality but better user experience, especially to read additional info on the input/output and the design of the slides.
  • The app allows quick and easy automation, as the user only needs to upload a zip file with as many experimental group folders as desired (with the outputs of my quantification script), and indicate the quantification method in the app.

001 - Interactive extraction of RNA Seq data from CCLE/DepMap

Expand this to read more…

Summary

During my graduate studies, I came across the Cancer Cell Line Encyclopedia, which is a project containing information on over 1,800 cell models, including RNA Seq gene expression data (around 20,000 genes).

I created a basic tool as a Google Colab notebook (View tool) to search and retrieve only cell lines of interest (we usually only needed <10). However, years later I noticed that the dataset was merged with the Achilles project to make the DepMap project. This added few more cell lines but several more datasets from diverse genomics, proteomics, and metabolomics assays. They also reshaped datasets, reassigned IDs to make all datasets consistent, etc. I adapted my tool to work for the new version (at that time, 23Q2), and generated a similar notebook.

Finally, when I discovered Streamlit I built a data app to replicate my notebook tool. I realized how easy was to add widgets and interactive plots that would allow not only to extract the data, but also to automate basic exploration and visualization of the cell lines and gene expression in a very user-friendly manner.



Streamlit Projects 001 GIF

Try the app on → Streamlit Cloud or GitHub Codespaces

Instructions (Does not require sample data)

Problem

  • The RNA Seq dataset is very large and it no longer has cell line names, as they were changed to Achilles IDs which are encoded in another file.
  • We needed to pre-process both datasets before mapping the IDs, but asking the user to get the required files from the website was confusing and led to errors as the datasets change 2-4 times a year.
  • The notebook tool required the user to have the required files already stored in a specific Google Drive folder (or to have access to a Google account that had them).
  • The notebook tool was only able to search based on cell line name, but sometimes we needed just to explore what models are available for some tissues.
  • The notebook tool only provided a simple view of the search results showing the cell line name followed by tissue, no more information.
  • While the notebook tool provided some degree of automation, it was not easy to de-select cell lines and only gave the raw data for the user to plot or analyze.

Solution

  • I set the Streamlit app to automatically download the required files for the current release at the time (23Q2). It takes like a minute or two, but the user does not need any Google account, nor to upload anything to be able to use the app.
  • The pre-processing is tailored to that specific data release and caches the prepared dataframe to improve efficiency.
  • I added a second search mode, so the user can search names of cell lines (or parts of them), and also search by tissue type.
  • The app displays more interactive search results, allowing to check boxes of cell lines to keep (instead of intering numbers) and I provide the Achilles ID, clean cell line name, tissue type and cancer type.
  • The csv output is the same as the notebook tool, however, the app has several widgets to preview the selected data.
  • Although it is not perfect, the preview area shows the generated dataset and lets the user easily type in genes of interest to make a bar chart or a heatmap. These visualizations are interactive (plotly) and the user can take snapshots if needed.