The below figure demonstrates the performance of scGPT on annotating human pancreas cells, after fine-tuning on a reference dataset with expert annotations. A side by side comparison can be viewed between the expert annotations (L) and annotations inferred by scGPT (R). Notably, scGPT’s precision was diminished for rare cell types with low numbers in the reference dataset - proceed with caution using the pre-trained model.

This Superbio application ingests Anndata files as both a) reference files for fine tuning, and b) test files for inference and annotation. Please make sure you have both, or feel free to use the test data provided.

Screenshot 2024-05-08 at 7.44.24 PM.png

Fig 2a, Cui et al (2024), Nature Methods.

  1. Navigate to “scGPT: Cell Type Annotation” on Superbio

Screenshot 2024-05-08 at 7.37.19 PM.png

  1. Find scRNA-seq dataset(s) of interest. We recommend searching CELLxGENE or files of interest and downloading the Anndata file (.h5ad format) for scGPT ingestion.

  2. If uploading data from remote or local source, simply click ‘Remote’ or ‘Local’. Note that local file uploads are limited to 100Mb in size, while remote uploads have no limit.

    Screenshot 2024-05-08 at 8.14.58 PM.png

  3. Click ‘Use Demo Data’ if loading the test MS dataset provided by Superbio. There are nine healthy control samples partitioned into the reference file: ‘c_data.h5ad’ for fine tuning, and 12 MS samples in the test data file: ‘filtered_ms_adata.h5ad’.

    Screenshot 2024-05-08 at 9.00.44 PM.png

  4. After data is clearly embedded in the UI, turn your attention to the workflow parameters on the right hand side. You may notice quite a few!

  5. To begin, let’s start by setting EPOCHs, FILTER_GENE_COUNTS, and FILTER_CELL_COUNTS.

    Screenshot 2024-05-08 at 9.07.32 PM.png

  6. EPOCHs are an important consideration in machine learning. An EPOCH refers to a complete pass of a training dataset through an algorithm. As a general rule, a good EPOCH number to begin with is roughly 3 * the # of columns in a dataset. Output data should be examined, and EPOCH number should be adjusted to optimize model fitting.