NYSCF Automated Deep Phenotyping Dataset (ADPD)

◄ Back to Open Source @ NYSCF

Background

Update 03/16/2021: The work and dataset described here is from a manuscript that is currently in submission for peer review. The submitted manuscript is available on bioRxiv. After review and acceptance for publication, this page will be updated with download links to the dataset and example code. If you’d like to be notified when the dataset is available for download, please send us an email here.

The New York Stem Cell Foundation Research Institute (NYSCF), in collaboration with Google Accelerated Science has built a high-throughput, high-content Cell Painting–based phenotyping platform that combines advanced and scalable cell culture automation with cutting-edge deep learning algorithms. Great care was taken to minimize experimental confounds in the study design, including choice and source of cell lines, automated cell handling and assaying including cell expansion, freeze-down, thawing, seeding and Cell Painting. An important goal of this work was to achieve sufficient reproducibility that analysis and cross-validation could be performed across plates, plate layouts and, most importantly, batches, to create a technical foundation for large-scale, population-based phenotypic profiling and drug screening.

We chose to test our platform on primary fibroblasts from subjects with Parkinson’s disease (PD) and demographically matched healthy controls. Using deep learning in parallel with automated Cell Painting analysis, we were able to confidently separate PD (both sporadic and LRRK2, ROC AUC 0.79 (0.08 standard deviation (SD)) from healthy controls, demonstrating the potential use for this platform for unbiased PD disease modeling and drug discovery. Furthermore, our platform was able to successfully identify a cell line within a cohort of 96 total lines with 91% mean accuracy (6% SD), across batches and plate layouts, demonstrating the robustness of our screening platform and revealing the presence of surprisingly strong individual signatures. Full details of our platform and study are summarized in this article. As part of the publication, the entire dataset of raw and processed images along with example code for reproducing our findings and a near real-time image analysis Fiji macro has been made available to the scientific community, which are detailed below. To our knowledge, this is the largest publicly available Cell Painting–based high-content imaging data set in the world.

NYSCF Automated Deep Phenotyping Dataset 1 (NYSCF-ADPD-1)

This 48 TB data set contains over 1.7 million full resolution (5056×2960 pixels) Cell Painting images in 16-bit TIFF format. The images were acquired in 5 fluorescent channels using an automated epifluorescence system (Nikon Ti2) and the collection contains images from 4 replicate batches containing 12 96-well plates, each seeded with 96 cell lines from 91 individuals and imaged at 76 tiles per well. The images were pre-processed as described in the publication.

In the filenames, GFP-ER-ConcA is ER, CY3-AGP is AGP, CY5 is MITO and site is tile.

Table 1. Image datasets (no links)

Dataset sizes are uncompressed image sizes
ResolutionTile images
RawFull resolution48 TB
Pre-processedFull resolution48 TB
Junior: Downsampled 299x299160 GB

NYSCF Automated Deep Phenotyping Dataset 1 – Junior

For increased accessibility, the NYSCF-ADPD-1 data set has been downsampled to 299×299 pixel resolution (the Inception-v4 deep embedding generator’s input resolution and the same resolution used to generate tile deep embeddings) or a total of 160 GB (uncompressed).

FIJI plugin for near real-time quality analysis

We developed a Fiji-based macro to assess the quality and consistency of the images acquired from a 96-well plate which is available hereA sample test dataset can be downloaded here.

NYSCF-ADPD-1 Tile deep embeddings and cell image deep embeddings

We computed 320-dimensional deep embedding vectors for both 512×512 cropped cells and full tile images (5056×2960). For this, we used an Inception-v4 CNN pre-trained on the ImageNet object recognition dataset consisting of 1.2 million images of a thousand categories of (non-cell) objects to compute a 64-dimensional deep embedding for each channel, from which the final 320-dimensional deep embedding is obtained by concatenating the 64-dimensional vectors for each channel (in order: DAPI, ER, RNA, AGP and MITO). Table 2 summarizes deep embedding data sources and provides direct download links.

NYSCF-ADPD-1  CellProfiler feature vectors

We used the pipeline template as described in Bray et al Nature Protocols 2016, where we determined cells on the RNA channel, nuclei on the DAPI channel, and cytoplasm by subtracting the nuclei object from the cell object. A 3483-dimensional feature vector was obtained per cell, across 7,450,738 cells. Table 2 summarizes feature data sources and provides direct download links.

This data is licensed under CC BY-NC-SA 4.0.

Table 2. Data sources (no links)

The well-level deep embeddings or features contain a single deep embedding or feature vector per well, the mean across all cells or tiles in a well.
Table 2Cell image deep embeddingsTile deep embeddingsCellProfiler features
Cell or Tile levelRaw6 GB0.4 GB160 GB
Normalized12 GB0.7 GBNA
Well levelRawNANA100 MB
Normalized10 MB10 MB100 MB

Code Collection

We have made available the necessary code to recapitulate key findings of our paper, including the PD vs. healthy classification and cell-line classification, as well as sample code for generating deep embeddings from microscopy images and the CellProfiler pipeline.

  • Example Jupyter notebook training a PD vs healthy model achieving 0.79 (0.08 SD) ROC AUC
  • Example Jupyter notebook training a 96-way cell line model achieving 91% (6% SD) accuracy
  • Example Jupyter notebook generating embeddings from one site image set
  • Metadata with the information about the cell lines (supplementary tables from the paper)