scGPT: Cell Type Annotation

The below figure demonstrates the performance of scGPT on annotating human pancreas cells, after fine-tuning on a reference dataset with expert annotations. A side by side comparison can be viewed between the expert annotations (L) and annotations inferred by scGPT (R). Notably, scGPT’s precision was diminished for rare cell types with low numbers in the reference dataset - proceed with caution using the pre-trained model.

This Superbio application ingests Anndata files as both a) reference files for fine tuning, and b) test files for inference and annotation. Please make sure you have both, or feel free to use the test data provided.

Screenshot 2024-05-08 at 7.44.24 PM.png

Fig 2a, Cui et al (2024), Nature Methods.

Navigate to “scGPT: Cell Type Annotation” on Superbio

Screenshot 2024-05-08 at 7.37.19 PM.png

Find scRNA-seq dataset(s) of interest. We recommend searching CELLxGENE or files of interest and downloading the Anndata file (.h5ad format) for scGPT ingestion.
- For a faster start, users may leverage the MS test dataset from Cui et al. Scroll down the application page and navigate left to ‘Upload’ Data window.
If uploading data from remote or local source, simply click ‘Remote’ or ‘Local’. Note that local file uploads are limited to 100Mb in size, while remote uploads have no limit.
Click ‘Use Demo Data’ if loading the test MS dataset provided by Superbio. There are nine healthy control samples partitioned into the reference file: ‘c_data.h5ad’ for fine tuning, and 12 MS samples in the test data file: ‘filtered_ms_adata.h5ad’.
After data is clearly embedded in the UI, turn your attention to the workflow parameters on the right hand side. You may notice quite a few!
To begin, let’s start by setting EPOCHs, FILTER_GENE_COUNTS, and FILTER_CELL_COUNTS.
EPOCHs are an important consideration in machine learning. An EPOCH refers to a complete pass of a training dataset through an algorithm. As a general rule, a good EPOCH number to begin with is roughly 3 * the # of columns in a dataset. Output data should be examined, and EPOCH number should be adjusted to optimize model fitting.