scGPT: Predicting Perturbations by Fine-Tuning a Pre-Trained Model

Untitled

Fig 3, Cui et al (2024), Nature Methods.

Advances in sequencing and gene editing have improved large-scale perturbation experiments for studying cellular responses, promising new gene interaction discoveries and advancements in regenerative medicine. Due to the impracticality of testing all gene perturbations, scGPT uses self-attention mechanisms to predict gene expression changes based on existing experimental data, effectively overcoming experimental limitations.

Navigate to “scGPT: Predicting Perturbations by Fine-Tuning a scGPT Trained Model” on Superbio.
Before providing the dataset, please edit your dataset to ensure proper functionality of this app. The app card also provides explanations related to editing custom dataset files. Please follow the instructions below before uploading your custom dataset:
1. First, to better understand the h5ad format, please preview the demo dataset by clicking on 'View example data'.
  
  You can preview the adata.obs and adata.var sections as shown below:
b. Based on the demo data, your custom dataset should contain gene names in the 'gene_name' column. This column name can be changed, but gene names should be in the adata.var object in the h5ad file. So please confirm that you have gene names in one of your columns inside the adata.var object before running the app. Your column name can be 'genes', 'geneNames', or something else. Please select the column name that contains gene names in your custom dataset in the parameter section.

c. Based on the demo data, cell types should be included in your dataset as well. It should be inside the adata.obs object. The column name can be 'celltypes', 'celltype', or something else. Please select the column name that contains cell types in your custom dataset in the parameter section.

d. The most important column is 'condition', which contains perturbations. Values inside this column must be in a specific format as follows:
- If the condition indicates “control” the value format must be “ctrl”.
- If the condition has a single perturbation value format must be either 'A+ctrl' or 'ctrl+A’
- If the condition has a combination of perturbation value format must be 'A+B'.
Note: You can utilize the basic functionalities of the scanpy package to edit your dataset.
```
import scanpy as sc

# Example for adjusting condition format. Please change it based on your dataset!
adata = sc.read("perturb_processed.h5ad")

for i in adata.obs['condition'].unique():
    if i == "control":
        #print(i)
        adata.obs['condition'] = adata.obs['condition'].str.replace(f'{i}', f'ctrl')
    else:
        #print(i)
        adata.obs['condition'] = adata.obs['condition'].str.replace(f'{i}', f'{i}+ctrl')

adata.write("perturb_processed_edited.h5ad", compression="gzip") 
```
After properly adjusting your custom dataset, provide a Perturb-Seq file in .h5ad format. We recommend downloading the Adamson or Norman dataset for testing, or you can use the demo data provided by Superbio. These datasets are already processed and ready for use with this application, so you DO NOT need to edit the dataset before running it.
Let’s move to the parameter section:
1. On the right, please specify the column names that contain gene names, cell types, and conditions in your Perturb-seq h5ad file. These fields are required. After uploading the custom dataset, column names will be populated for each parameter, and you can select your column name via the dropdown menu.
2. You are ready to run your job by clicking the “Submit Job” button.
After your job is successfully finished, you will receive an email from us, or you can check the 'Jobs' section.

You can navigate your job and review your results. The results section contains three components:

All other output files can be downloaded using the button in the 'Actions' section on the result page.