ChemBERTa (arXiv, 2020) represents one of the first sets of work that attempts to adapt the success of self-supervised learning via transformer models to the molecular property prediction space. Specifically, ChemBERTa was adapted from the RoBERTa transformer implementation on Hugging Face. ChemBERTa was pre-trained on the PubChem 77M database, then fine-tuned on a variety of classification tasks from MoleculeNet.

While the initial implementation of ChemBERTa approaches, but did not beat, the classification performance of GNN-based Chemprop - the model scales very well with more pre-training data. This suggests future outperformance in the event of dataset expansion - which is occurring day by day.

Screenshot 2024-03-27 at 8.01.44 PM.png

While not state-of-the-art, the ChemBERTa toolkit still provides some useful tools for the drug developers toolkit, including downstream molecular toxicity predictions leveraging a model fine-tuned on the ClinTox dataset.

Let’s examine a toxicity prediction task using the ChemBERTa implementation on Superbio.

Step 1: Examine ClinTox training data to better understand how the model works. A screenshot of the Clintox dataset is pictured below.

Screenshot 2024-03-27 at 8.25.26 PM.png

First, chemical structures are given in SMILES format on the left. The ClinTox dataset contains records of 1,485 chemical compounds, and an annotation of either “0” or “1” indicating FDA approval, or conversely, clinical toxicity and regulatory failure.

Since ChemBERTa-ClinTox was fine-tuned on this data, we can expect that the model will return a prediction indicating the likelihood of clinical trial failure for a given molecule in SMILES format.

Step 2: Identify a molecule of interest in SMILES format.

For this tutorial, we will select the small molecule rapamycin, an obsession of human longevity researchers ever since it was reported to extend lifespan by up to 60% in mice.

Screenshot 2024-03-27 at 8.31.05 PM.png

Given its role in inhibiting cell growth and proliferation, rapamycin is currently an FDA-approved drug for the prevention of GvHD, on account of its immunosuppressive properties. However, major toxicities have also been reported, reducing the likelihood of mainstream approval for life extension.

Let’s find the SMILES format for the drug, and see what ChemBERTa has to say!

Step 3: First, make sure you have the SMILES format for rapamycin saved in a .csv file.