<aside> <img src="/icons/list_gray.svg" alt="/icons/list_gray.svg" width="40px" /> Tutorials
</aside>
Single-cell GPT represents a pioneering adaptation of generative pre-trained models to the field of multi-omics. In fields like natural language processing, pre-trained large language models (LLMs) have shown incredible performance on a wide range of downstream tasks, outperforming models trained on smaller, domain-specific datasets. This inspired the notion of ‘pre-training universally, fine-tuning on demand’ for single-cell genomics that Cui et al (2024) have demonstrated in their Nature Methods work. Traditionally, multi-omics data has been both costly and laborious to generate, limiting the efficacy of machine learning models trained on this data - which are notoriously overfitted in low-data regimes.
To ameliorate this, Cui, et al. (2024) have assembled a pre-training scRNA-seq training dataset of >33 million cells, derived from 51 organs or tissues - and 441 separate studies. The resulting model demonstrates state-of-the-art performance using both cell and gene embeddings. Further modularity was achieved upon fine-tuning scGPT on additional downstream tasks. Amongst a diverse range of use cases, the authors leverage these embeddings for cell type annotation, new sample characterization & the fine mapping of genetic regulatory networks.
The above three applications have been implemented in a user-friendly GUI on Superbio. Below, we offer concrete instructions for launching successful inference for each use case. For these applications, we recommend carefully considering each input parameter for optimal performance.
Fig 1a, Cui et al (2024), Nature Methods.
scGPT: Reference Mapping Using Cell Embedding
scGPT: Gene Regulatory Network Inference
scGPT: Predicting Perturbations by Fine-Tuning a Pre-Trained Model