Motivation: LLMs are a class of Artificial Intelligence models trained on natural and programming languages. Thus, they offer an opportunity to generate functional code based on semantic prompting. Biotech companies and academic institutions continue to collect more sequencing, imaging, and other data modalities every day - and it is estimated that this industry will generate more data than other by 2025. As such, the ability to sift through such large swaths of information will only become more important as time goes on. Democratizing access to useful bioinformatics workflows is therefore an issue of increasing importance.

Enter, CodeBio! CodeBio is a GPT-based application, with upstream domain-specific prompt engineering questions abstracted away. Thus, CodeBio can generate functional bioinformatics code in Python, while also proving natural language explanations for the code’s function.

Step 1: Navigate to CodeBio on Superbio

Screenshot 2025-01-21 at 7.57.43 AM.png

Step 2: Select (1) your role, (2) preferred coding language, and (3) the level of verbosity of the response. Responses can be more or less concise depending on task complexity and user preference. You may also reference specific packages that you’d like to see incorporated in the results (i.e. pandas for tabular datasets, sci-kit learn for machine-learning).

In this case, we will request that CodeBio helps us to preform quality control data filtration steps upstream of a scRNA-seq analysis using pandas.

Screenshot 2025-01-21 at 7.58.17 AM.png

Step 3: CodeBio will return both a code block in the response, as well as a LLM-generated explanation of the code above. You can try running the code in Google Colab. Be aware that the code may be incomplete, and require additional prompting to generate a functional program.

Screenshot 2025-01-21 at 7.58.50 AM.png

Step 4: Re-prompt as necessary. For example, I would like to following code block to:

Screenshot 2025-01-21 at 8.00.45 AM.png