Suggest edit — computational-biology

Title

Name

Note

---
visibility: public
---

# Computational Biology

**repo:** [inoue0426/awesome-computational-biology](https://github.com/inoue0426/awesome-computational-biology)  
**category:** [[miscellaneous|Miscellaneous]]

---

# Awesome Computational Biology [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)

A curated collection of databases, software, and papers related to computational biology.

> Computational biology involves the development and application of data-analytical and theoretical methods, mathematical modelling and computational simulation techniques to the study of biological, ecological, behavioural, and social systems. — [Wikipedia](https://en.wikipedia.org/wiki/Computational_biology)

---

## Interface

Browse and search the resources via the [GitHub Pages UI](https://inoue0426.github.io/awesome-computational-biology/).

---

## Table of Contents

- [Awesome Computational Biology](#[awesome](/@harrisonqian/awesome/wiki/miscellaneous/awesome)-computational-biology-)
  - [Table of Contents](#table-of-contents)
  - [Databases](#databases)
    - [scRNA](#scrna)
    - [Compound](#compound)
    - [Pathway](#pathway)
    - [Mass Spectra](#mass-spectra)
    - [Protein](#protein)
    - [Genome](#genome)
    - [Disease](#disease)
    - [Interaction](#interaction)
      - [Drug-Gene Interaction](#drug-gene-interaction)
      - [Drug (Cell Line) Response](#drug-cell-line-response)
      - [Chemical-Protein Interaction](#chemical-protein-interaction)
      - [Protein-Protein Interaction](#protein-protein-interaction)
      - [Knowledge Graph](#knowledge-graph)
      - [Gene Regulatory Network](#gene-regulatory-network)
    - [Clinical Trial](#clinical-trial)
  - [Benchmarks & Datasets](#benchmarks--datasets)
  - [API](#api)
  - [Preprocessing Tools](#preprocessing-tools)
  - [Machine [Learning](/@harrisonqian/awesome/wiki/programming-languages/learning) Tasks and Models](#machine-learning-tasks-and-models)
    - [Drug Discovery](#drug-discovery)
      - [Drug Response Prediction](#drug-response-prediction)
      - [Drug Repurposing](#drug-repurposing)
      - [Drug Target Interaction](#drug-target-interaction)
      - [Compound-Protein Interaction](#compound-protein-interaction)
      - [Molecular Generation](#molecular-generation)
    - [LLM for Biology](#llm-for-biology)
    - [Foundation Models](#foundation-models)
      - [Single-cell Foundation Models](#single-cell-foundation-models)
        - [Transcriptomics Foundation Models](#transcriptomics-foundation-models)
        - [Spatial Foundation Models](#spatial-foundation-models)
        - [Multi-Omics Foundation Models](#multi-omics-foundation-models)
        - [Domain Alignment](#domain-alignment)
      - [Protein Foundation Models](#protein-foundation-models)
        - [Pre-trained Embedding](#pre-trained-embedding)
        - [Protein Structure Prediction and Design](#protein-structure-prediction-and-design)
      - [Multi-Modal Foundation Models](#multi-modal-foundation-models)
      - [Genomics Foundation Models](#genomics-foundation-models)

---

## Databases

### scRNA

- [CZ CELLxGENE](https://cellxgene.cziscience.com/) — Single-cell dataset repository and interactive explorer from the Chan Zuckerberg Initiative.
- [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/) — Public functional genomics [database](/@harrisonqian/awesome/wiki/databases/database).
- [Human Cell Atlas](https://www.humancellatlas.org/) — Open global atlas of all cells in the human body.
- [Single Cell PORTAL](https://singlecell.broadinstitute.org/single_cell) — Public [database](/@harrisonqian/awesome/wiki/databases/database) for single-cell RNA.
- [Single Cell Expression Atlas](https://www.ebi.ac.uk/gxa/sc/home) — Public [database](/@harrisonqian/awesome/wiki/databases/database) for single-cell RNA.

### Compound

- [PubChem](https://pubchem.ncbi.nlm.nih.gov/) — One of the largest chemical databases (compounds, genes, and proteins).
- [ChEBI](https://www.ebi.ac.uk/chebi/) — [Database](/@harrisonqian/awesome/wiki/databases/database) focused on small chemical compounds.
- [ChEMBL](https://www.ebi.ac.uk/chembl/) — Bioactive molecules with drug-like properties.
- [ChemSpider](http://www.chemspider.com/) — Chemical structure [database](/@harrisonqian/awesome/wiki/databases/database).
- [DrugTargetCommons](https://drugtargetcommons.fimm.fi/) — Community platform for curating and integrating experimental bioactivity data across drugs and targets.
- [HMDB (Human Metabolome Database)](https://hmdb.ca/) — Comprehensive [database](/@harrisonqian/awesome/wiki/databases/database) of small molecule metabolites found in the human body.
- [KEGG COMPOUND](https://www.genome.jp/kegg/compound/) — Collection of small molecules and biopolymers.
- [LIPID MAPS](https://www.lipidmaps.org/databases/lmsd/overview) — [Database](/@harrisonqian/awesome/wiki/databases/database) of lipids.
- [Rhea](https://www.rhea-db.org/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of chemical reactions.
- [DrugCentral](http://drugcentral.org/) — Online drug compendium with drug mode of action and indication information.
- [Drug Repurposing Hub](https://repo-hub.broadinstitute.org/repurposing#download-data) — Collections of drug repurposing data (drug, MoA, target, etc).
- [Therapeutic Target Database](https://idrblab.net/ttd/full-data-download) — Drug-target, target-disease, and drug-disease [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets).
- [ZINC ligand discovery database](https://zinc.docking.org/) — Free [database](/@harrisonqian/awesome/wiki/databases/database) of commercially-available compounds for virtual screening.

### Pathway

- [PathwayCommons](https://www.pathwaycommons.org/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of pathways and interactions.
- [KEGG PATHWAY](https://www.genome.jp/kegg/pathway.html) — Collection of pathway maps.
- [WikiPathways](https://wikipathways.org/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of biological pathways.
- [Reactome](https://reactome.org/) — Expert-curated, peer-reviewed pathway [database](/@harrisonqian/awesome/wiki/databases/database) with detailed reaction mechanisms.
- [BioCyc](https://biocyc.org/) — Collection of pathway/genome databases across thousands of organisms.
- [SIGNOR](https://signor.uniroma2.it/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of causal signaling interactions and pathways.
- [MSigDB (Molecular Signatures Database)](https://www.gsea-msigdb.org/gsea/msigdb) — Curated gene sets derived from pathways and biological processes.

### Mass Spectra

- [MassBank](http://www.massbank.jp/) — Open source databases and tools for mass spectrometry reference spectra.
- [MoNA MassBank of North America](https://mona.fiehnlab.ucdavis.edu/) — Meta-[database](/@harrisonqian/awesome/wiki/databases/database) of metabolite mass spectra, metadata, and associated compounds.

### Protein

- [THE HUMAN PROTEIN ATLAS](https://www.proteinatlas.org/) — Comprehensive human protein [database](/@harrisonqian/awesome/wiki/databases/database) (cells, tissues, organs).
- [PROTEIN DATA BANK (PDB)](https://www.rcsb.org/) — 3D structures of proteins, nucleic acids, complexes.
- [UniProt](https://www.uniprot.org/) — Functional information on proteins.
- [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/api-docs) — 3D protein structure predictions.
- [RCSB Protein Data Bank](https://www.rcsb.org/) — Repository for structural data of biological molecules.
- [Critical Assessment of Structure Prediction (CASP)](https://predictioncenter.org/) — Assessing methods for protein structure prediction.
- [Uniclust](https://uniclust.mmseqs.com/) — Clustered protein sequence databases.
- [UniRef](https://www.uniprot.org/uniref/) — Non-redundant sequence [database](/@harrisonqian/awesome/wiki/databases/database) clustering UniProtKB entries at multiple sequence identity thresholds.
- [CATH database](https://www.cathdb.info/) — Hierarchical classification of protein domain structures.
- [SAbDab](https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab) — Structural Antibody [Database](/@harrisonqian/awesome/wiki/databases/database) containing all antibody structures in the PDB.
- [OADB (Observed Antibody Space Database)](http://opig.stats.ox.ac.uk/webapps/oas/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of antibody sequences from immune repertoire sequencing.
- [InterPro](https://www.ebi.ac.uk/interpro/) — Protein families, domains, and functional sites [database](/@harrisonqian/awesome/wiki/databases/database) integrating 14 member databases including Pfam and PROSITE.
- [Pfam](https://www.ebi.ac.uk/interpro/entry/pfam/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of protein families described by multiple sequence alignments and hidden Markov models.
- [NeXtProt](https://www.nextprot.org/) — Expert knowledge base on human proteins with deep functional annotation, complementary to UniProt.

### Genome

- [ENCODE](https://www.encodeproject.org/) — Encyclopedia of DNA Elements; regulatory and functional genomic elements across the genome.
- [Ensembl](https://www.ensembl.org/) — Genome browser and annotation [database](/@harrisonqian/awesome/wiki/databases/database) for vertebrate and other eukaryotic genomes.
- [Human Genome Resources at NCBI](https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml) — [Database](/@harrisonqian/awesome/wiki/databases/database) for genomics, proteomics, transcriptomics, and systems biology.
- [GenBank](https://www.ncbi.nlm.nih.gov/genbank/) — NCBI's [database](/@harrisonqian/awesome/wiki/databases/database) of genetic sequences.
- [UCSC Genome Browser](https://genome.ucsc.edu/) — UCSC's genome browser.
- [cBioPortal](https://www.cbioportal.org/) — Cancer genomics [database](/@harrisonqian/awesome/wiki/databases/database); aggregating many patient [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets).
- [10x Genomics Dataset](https://www.10xgenomics.com/resources/datasets) — Collection of single-cell [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets).
- [The Genotype-Tissue Expression (GTEx)](https://gtexportal.org/home/) — Human gene expression and regulation resource.
- [Dependency Map (DepMap)](https://depmap.org/portal/) — CRISPR-Cas9 screens in cancer cell lines.
- [Catalogue Of Somatic Mutations In Cancer (COSMIC)](https://cancer.sanger.ac.uk/cosmic) — Resource on somatic mutations in cancers.
- [MGnify](https://www.ebi.ac.uk/metagenomics/) — Resource for metagenomic and metatranscriptomic data.
- [JASPAR](http://jaspar.genereg.net/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of transcription factor binding profiles.
- [gnomAD](https://gnomad.broadinstitute.org/) — Genome Aggregation [Database](/@harrisonqian/awesome/wiki/databases/database); genetic variation from large-scale sequencing projects.
- [Rfam](https://rfam.org/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of RNA families with sequence alignments and consensus structures.
- [ROADMAP Epigenomics](http://www.roadmapepigenomics.org/) — Reference epigenome maps for 111 primary human cell types and tissues, including histone modifications, chromatin accessibility, and DNA methylation.
- [FANTOM5](https://fantom.gsc.riken.jp/5/) — Functional annotation of mammalian genome; comprehensive atlas of active enhancers, promoters, and transcription start sites across human and mouse cell types.

### Disease

- [KEGG DRUG](https://www.genome.jp/kegg/drug/) — Comprehensive, approved drug information.
- [DrugBank](https://go.drugbank.com/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of drugs and targets (University of Alberta).
- [DisGeNET](https://www.disgenet.org/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of gene-disease associations integrating expert-curated and GWAS data.
- [OMIM (Online Mendelian Inheritance in Man)](https://www.omim.org/) — Comprehensive [database](/@harrisonqian/awesome/wiki/databases/database) of human genes and genetic disorders.
- [Open Targets Platform](https://platform.opentargets.org/) — Systematic target identification and prioritization platform integrating genetics, genomics, and drug data for drug discovery.
- [Human Phenotype Ontology (HPO)](https://hpo.jax.org/) — Standardized vocabulary of phenotypic abnormalities in human disease, linking genes, variants, and clinical features.
- [DISEASES](https://diseases.jensenlab.org/) — Gene–disease association [database](/@harrisonqian/awesome/wiki/databases/database) integrating evidence from text mining, curated databases, and experimental data.

### Interaction

#### Drug-Gene Interaction

- [DGIdb](https://www.dgidb.org/) — Drug-gene interactions and the druggable genome.
- [Comparative Toxicogenomics Database](http://ctdbase.org/) — Chemical-gene interactions, chemical-disease and gene-disease associations, chemical-phenotype associations.
- [SNAP](https://snap.stanford.edu/biodata/datasets/10002/10002-ChG-Miner.html) — Dataset of drug-gene interactions.

#### Drug (Cell Line) Response

- [NCI60](https://dtp.cancer.gov/discovery_development/nci-60/) — Focuses on 60 cancer cell lines and many drugs.
- [Genomics of Drug Sensitivity in Cancer (GDSC)](https://www.cancerrxgene.org/) — Drug sensitivity for ~1000 human cancer cell lines and hundreds of compounds.
- [Cancer Cell Line Encyclopedia](https://sites.broadinstitute.org/ccle/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of ~1000 cancer cell lines.
- [CellMiner Cross [Database](/@harrisonqian/awesome/wiki/databases/database) (CellMinerCDB)](https://discover.nci.nih.gov/cellminercdb/) — Integrates multiple cancer cell line databases.

#### Chemical-Protein Interaction

- [STITCH](http://stitch.embl.de/) — Chemical-protein interactions.
- [BindingDB](https://www.bindingdb.org/rwd/bind/index.jsp) — Compounds and target [database](/@harrisonqian/awesome/wiki/databases/database).
- [Davis kinase inhibitors DB](http://staff.cs.utu.fi/~aijrinas/dti/) — Experimental kinase inhibitor binding affinity dataset for protein–ligand interaction research.
- [Kinase Inhibitor Bioactivity Data (KIBA)](https://janeliascicomp.[github](/@harrisonqian/awesome/wiki/development-environment/github).io/KIBA/) — Integrated bioactivity scores for kinase inhibitors combining Ki, Kd, and IC50 measurements.
- [PDBBind](https://www.pdbbind-plus.org.cn/) — Binding affinity data for biomolecular complexes.

#### Protein-Protein Interaction

- [STRING](https://string-db.org/) — PPI networks for multiple organisms.
- [BioGRID](https://thebiogrid.org/) — Protein, genetic, and chemical interactions.
- [HIPPIE](http://cbdm-01.zdv.uni-mainz.de/~mschaefer/hippie/) — Human protein-protein interaction [database](/@harrisonqian/awesome/wiki/databases/database).
- [IntAct](https://www.ebi.ac.uk/intact/home) — Open-source molecular interaction [database](/@harrisonqian/awesome/wiki/databases/database) and analysis system from EMBL-EBI.

#### Knowledge Graph

- [Drug Mechanism [Database](/@harrisonqian/awesome/wiki/databases/database) (DrugMechDB)](https://github.com/SuLab/DrugMechDB/tree/2.0.1) — Mechanisms of action from drug to disease.
- [DRKG](https://github.com/gnn4dr/DRKG) — Large-scale biological knowledge graph for drug discovery.
- [Hetionet](https://github.com/hetio/hetionet) — Heterogeneous network integrating genes, diseases, drugs, pathways, and more.
- [PrimeKG](https://github.com/mims-harvard/PrimeKG) — Multi-modal precision medicine knowledge graph integrating clinical, genetic, and drug data.

#### Gene Regulatory Network

- [TRRUST](https://www.grnpedia.org/trrust/) — Manually curated [database](/@harrisonqian/awesome/wiki/databases/database) of human and mouse transcriptional regulatory interactions between transcription factors and their target genes.
- [RegNetwork](http://www.regnetworkweb.org/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of gene regulatory networks covering transcription factor–target gene and miRNA–gene interaction data across multiple species.
- [miRBase](https://www.mirbase.org/) — Reference repository for microRNA gene annotations, sequences, and experimentally validated targets.

### Clinical Trial

- [ClinicalTrials.gov](https://clinicaltrials.gov/) — Privately and publicly funded clinical studies.
- [ICD10](https://icd.who.int/browse10/2019/en) — International Classification of Diseases, 10th revision.
- [EU Drug Regulating Authorities Clinical Trials DB (EudraCT)](https://eudract.ema.europa.eu/) — European clinical trial [database](/@harrisonqian/awesome/wiki/databases/database).
- [MIMIC-IV](https://mimic.mit.edu/) — Freely accessible critical care [database](/@harrisonqian/awesome/wiki/databases/database).

---

## Benchmarks & Datasets

- [1000 Genomes Project](https://www.internationalgenome.org/) — Reference panel of human genetic variation from 2,504 individuals across 26 populations.
- [BACE](https://www.kaggle.com/datasets/gokturkkoch/bace) — Binary classification and regression dataset for β-secretase 1 (BACE-1) inhibitor binding affinity.
- [BEAT AML](https://biodev.[github](/@harrisonqian/awesome/wiki/development-environment/github).io/BeatAML2/) — Functional ex vivo drug sensitivity measurements paired with genomics for acute myeloid leukemia.
- [BindingDB Curated Sets](https://www.bindingdb.org/rwd/bind/chemsearch/marvin/SDFdownload.jsp?all_download=yes) — Curated binding affinity [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) for protein–ligand interaction benchmarking.
- [Cancer Therapeutics Response Portal (CTRP)](https://portals.broadinstitute.org/ctrp/) — Drug sensitivity profiles across ~900 cancer cell lines for >400 compounds.
- [ClinTox](https://tdcommons.ai/single_pred_tasks/tox/#clintox) — Clinical toxicity dataset contrasting FDA-approved drugs with those that failed clinical trials due to toxicity.
- [CPTAC (Clinical Proteomic Tumor Analysis Consortium)](https://proteomics.cancer.gov/programs/cptac) — Multi-omic proteogenomic [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) for multiple cancer types linking proteomics with genomics.
- [CrossDocked2020](https://arxiv.org/abs/2001.01037) — Large-scale dataset for structure-based virtual screening.
- [FLIP (Fitness Landscape Inference for Proteins)](https://github.com/J-SNACKKB/FLIP) — Benchmark collection of protein fitness landscape [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) for evaluating protein ML models.
- [Genomics of Drug Sensitivity in Cancer (GDSC)](https://www.cancerrxgene.org/) — Drug sensitivity for ~1000 human cancer cell lines and hundreds of compounds.
- [GuacaMol](https://github.com/BenevolentAI/guacamol) — Benchmark suite for generative molecular design models.
- [LINCS L1000](https://lincsproject.org/LINCS/tools/workflows/find-the-best-place-to-obtain-the-lincs-l1000-data) — Gene expression profiles (978 landmark genes) for >20,000 chemical and genetic perturbations across cell lines.
- [MoleculeNet](http://moleculenet.ai/) — Benchmark [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) for molecular [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning).
- [MOSES](https://github.com/molecularsets/moses) — Benchmarking platform for molecular generation models.
- [NCI60](https://dtp.cancer.gov/discovery_development/nci-60/) — Drug sensitivity benchmark across 60 diverse human cancer cell lines.
- [OGB (Open Graph Benchmark)](https://ogb.stanford.edu/) — Large-scale graph ML benchmark suite including biological [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) such as ogbl-ppa (protein-protein associations) and ogbg-molhiv.
- [OpenBioLink](https://github.com/OpenBioLink/OpenBioLink) — Benchmark [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) for biological knowledge graph completion.
- [PharmGKB](https://www.pharmgkb.org/) — Curated pharmacogenomics dataset linking genetic variants to drug response phenotypes across thousands of drugs.
- [PK-DB](https://pk-db.com/) — Open [database](/@harrisonqian/awesome/wiki/databases/database) of experimental pharmacokinetics (PK) and ADME data from clinical and preclinical studies.
- [PRISM](https://depmap.org/portal/prism/) — Cancer drug sensitivity profiling of >4,500 drugs across >900 cancer cell lines using pooled-cell-line barcoding.
- [ProteinGym](https://github.com/OATML-Markslab/ProteinGym) — Large-scale benchmark of deep mutational scanning assays for evaluating protein fitness landscape models.
- [QM9](https://figshare.com/collections/Quantum_chemistry_structures_and_properties_of_134_kilo_molecules/978904) — Quantum chemistry properties for 134K stable small organic molecules computed at DFT level.
- [scIB (Single-cell [Integration](/@harrisonqian/awesome/wiki/platforms/integration) Benchmarks)](https://github.com/theislab/scib) — Comprehensive benchmarking framework for single-cell data [integration](/@harrisonqian/awesome/wiki/platforms/integration) methods.
- [SIDER (Side Effect Resource)](http://sideeffects.embl.de/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of 1,430 approved drugs with their recorded adverse drug reactions across 27 system-organ classes.
- [Tabula Muris](https://tabula-muris.ds.czbiohub.org/) — Comprehensive single-cell atlas of 20 mouse organs and tissues, enabling cross-tissue and cross-species comparisons.
- [Tabula Sapiens](https://tabula-sapiens-portal.ds.czbiohub.org/) — Comprehensive human single-cell atlas of ~500K cells from 24 organs and tissues across multiple donors.
- [TAPE (Tasks Assessing Protein Embeddings)](https://github.com/songlab-cal/tape) — Benchmark suite of five biologically meaningful semi-supervised [learning](/@harrisonqian/awesome/wiki/programming-languages/learning) tasks for evaluating protein representations.
- [The Cancer Genome Atlas (TCGA)](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga) — Comprehensive multi-omics (genomics, transcriptomics, proteomics, methylation) dataset for 33 cancer types across ~11,000 patients.
- [Therapeutics Data Commons (TDC)](https://tdcommons.ai/) — Unified benchmark suite covering ADMET, drug-target interaction, drug response, and more.
- [Tox21](https://tripod.nih.gov/tox21/challenge/) — 12,707 compounds tested in 12 nuclear receptor and stress-response pathway biochemical assays for toxicity prediction.
- [UK Biobank](https://www.ukbiobank.ac.uk/) — Large-scale biomedical [database](/@harrisonqian/awesome/wiki/databases/database) of ~500K participants with genetic, imaging, and health data for population genetics and disease studies.

---

## API

- [PubMed E-utilities (esearch/efetch)](https://www.nlm.nih.gov/dataguide/edirect/esearch.html) — APIs for searching and retrieving biomedical literature from PubMed.
- [NCBI E-utilities](https://www.ncbi.nlm.nih.gov/books/NBK25501/) — Unified APIs for accessing NCBI databases (Gene, GEO, SRA, PubChem, etc).
- [UniProt [REST](/@harrisonqian/awesome/wiki/miscellaneous/rest) API](https://www.uniprot.org/help/api) — Programmatic access to protein sequence and functional annotation data.
- [Ensembl [REST](/@harrisonqian/awesome/wiki/miscellaneous/rest) API](https://rest.ensembl.org/) — API for genomic annotations, variants, genes, and comparative genomics.
- [KEGG [REST](/@harrisonqian/awesome/wiki/miscellaneous/rest) API](https://www.kegg.jp/kegg/rest/keggapi.html) — API for accessing KEGG pathways, compounds, genes, and reactions.
- [ChEMBL Web Services](https://www.ebi.ac.uk/chembl/ws) — [REST](/@harrisonqian/awesome/wiki/miscellaneous/rest) API for bioactive molecules, targets, and bioassays.
- [Open Targets Platform API](https://platform.opentargets.org/api) — API for target–disease associations integrating genetics, genomics, and drug data.
- [ClinicalTrials.gov API](https://clinicaltrials.gov/api/gui) — API for querying clinical trial metadata and results.

---

## Preprocessing Tools

- [Chemistry Development Kit](https://github.com/cdk/cdk) — [Cheminformatics](/@harrisonqian/awesome/wiki/miscellaneous/cheminformatics) software & [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) tools.
- [Biopython](https://biopython.org/) — Collection of [Python](/@harrisonqian/awesome/wiki/programming-languages/python) tools for biological computation including sequence analysis, structure parsing, and [database](/@harrisonqian/awesome/wiki/databases/database) access.
- [FlashDeconv](https://github.com/cafferychen777/flashdeconv) — High-performance spatial transcriptomics deconvolution (~1M spots in ~3 min).
- [RDKit](https://github.com/rdkit/rdkit) — [Cheminformatics](/@harrisonqian/awesome/wiki/miscellaneous/cheminformatics) software & [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) toolkit.
- [DeepChem](https://github.com/deepchem/deepchem) — [Deep learning](/@harrisonqian/awesome/wiki/computer-science/deep-learning) library for drug discovery, quantum chemistry, and materials science.
- [ChatSpatial](https://github.com/cafferychen777/ChatSpatial) — MCP server for spatial transcriptomics analysis via natural language.
- [Scanpy](https://scanpy.readthedocs.io/en/stable/) — [Python](/@harrisonqian/awesome/wiki/programming-languages/python) library for scRNA-seq analysis.
- [Seurat](https://satijalab.org/seurat/) — R library for scRNA-seq analysis.
- [scvi-tools](https://scvi-tools.org/) — Probabilistic models for single-cell omics data analysis.
- [CellTypist](https://github.com/Teichlab/celltypist) — Automated cell type annotation for scRNA-seq.
- [Squidpy](https://squidpy.readthedocs.io/) — [Python](/@harrisonqian/awesome/wiki/programming-languages/python) library for spatial single-cell analysis.
- [GROMACS](https://www.gromacs.org/) — Molecular dynamics simulation package for biochemical molecules.
- [MDAnalysis](https://www.mdanalysis.org/) — [Python](/@harrisonqian/awesome/wiki/programming-languages/python) library for analyzing and altering molecular dynamics simulation trajectories.
- [OpenMM](https://openmm.org/) — High-performance toolkit for molecular simulation and GPU-accelerated MD.
- [scVelo](https://github.com/theislab/scvelo) — RNA velocity estimation for single-cell transcriptomics, inferring the direction and speed of cell differentiation.
- [STAR](https://github.com/alexdobin/STAR) — Ultrafast universal RNA-seq aligner with support for spliced alignment and single-cell quantification via STARsolo.
- [kallisto](https://pachterlab.[github](/@harrisonqian/awesome/wiki/development-environment/github).io/kallisto/) — Near-optimal RNA-seq quantification using pseudoalignment for fast transcript abundance estimation.
- [Harmony](https://github.com/immunogenomics/harmony) — Fast and scalable [integration](/@harrisonqian/awesome/wiki/platforms/integration) of single-cell data across [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets), conditions, technologies, and species.
- [Monocle3](https://cole-trapnell-lab.[github](/@harrisonqian/awesome/wiki/development-environment/github).io/monocle3/) — Single-cell trajectory analysis tool for [learning](/@harrisonqian/awesome/wiki/programming-languages/learning) developmental trajectories and ordering cells in pseudotime.
- [CellChat](https://github.com/sqjin/CellChat) — Inference and analysis of cell-cell communication ligand-receptor networks from single-cell transcriptomics data.
- [SCENIC](https://github.com/aertslab/SCENIC) — Single-cell regulatory network inference and clustering linking transcription factors to co-expressed gene modules.
- [DoubletFinder](https://github.com/chris-mcginnis-ucsf/DoubletFinder) — [Machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) approach for detecting multiplet (doublet) artifacts in single-cell RNA-seq data.

---

## Machine Learning Tasks and Models

### Drug Discovery

#### Drug Response Prediction

- [drGAT](https://github.com/inoue0426/drGAT) — Attention-based model for drug response prediction with gene explainability.
- [MOFGCN](https://github.com/weiba/MOFGCN/tree/main) — GCN + heterogeneous network.
- [DeepDSC](https://ieeexplore-ieee-org.ezp2.lib.umn.edu/stamp/stamp.jsp?tp=&arnumber=8723620&tag=1) — Autoencoder + fully connected NN.
- [DGDRP](https://github.com/minwoopak/heteronet) — Multi-view embedding neural network.
- [DeepAEG](https://github.com/zhejiangzhuque/DeepAEG) — GNN embedding + attention mechanism.
- [RECOVER](https://github.com/RECOVERcoalition/Recover) — [Machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) framework for predicting synergistic drug combination responses across cell lines.
- [TGSA](https://github.com/violet-sto/TGSA) — Tumor gene set and attention-based model leveraging biological pathway knowledge for drug response prediction.
- [HiDRA](https://github.com/bsml320/HiDRA) — Hierarchical network model incorporating gene and pathway-level information for cancer drug response prediction.

#### Drug Repurposing

- [DeepPurpose](https://github.com/kexinhuang12345/DeepPurpose) — [Deep learning](/@harrisonqian/awesome/wiki/computer-science/deep-learning) library for drug repurposing.

#### Drug Target Interaction

- [NeoDTI](https://github.com/FangpingWan/NeoDTI) — Library for drug-target interaction prediction.
- [DTINet](https://github.com/luoyunan/DTINet) — Network-based framework integrating heterogeneous biological data for DTI prediction.
- [DeepDTA](https://github.com/hkmztrk/DeepDTA) — [Deep learning](/@harrisonqian/awesome/wiki/computer-science/deep-learning) model using CNNs on protein sequences and drug SMILES.
- [GraphDTA](https://github.com/thinng/GraphDTA) — Graph neural network–based DTI prediction using molecular graphs.
- [MolTrans](https://github.com/kexinhuang12345/MolTrans) — Transformer-based DTI model leveraging molecular substructures.
- [DrugBAN](https://github.com/peizhenbai/DrugBAN) — Bilinear attention network for interpretable DTI prediction.

#### Compound-Protein Interaction

- [MCPINN](https://github.com/mhlee0903/multi_channels_PINN) — Drug discovery via compound-protein interaction and [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning).
- [TransformerCPI](https://github.com/lifanchen-simm/transformerCPI) — CPI prediction using Transformer.

#### Molecular Generation

- [REINVENT](https://github.com/MolecularAI/Reinvent) — Reinforcement [learning](/@harrisonqian/awesome/wiki/programming-languages/learning) for de novo drug design.
- [MolGPT](https://github.com/devalab/molgpt) — Transformer-based model for molecular generation.
- [Molecular Transformer](https://github.com/pschwllr/MolecularTransformer) — Sequence-to-sequence model for retrosynthesis prediction.
- [TargetDiff](https://github.com/guanjq/targetdiff) — 3D equivariant diffusion model for structure-based drug design.
- [DiffDock](https://github.com/gcorso/DiffDock) — Diffusion generative model for molecular docking, predicting the binding pose of small molecules to protein targets.
- [JTVAE](https://github.com/wengong-jin/icml18-jtnn) — Junction tree variational autoencoder for molecular graph generation that guarantees chemical validity via a hierarchical tree decomposition.

### LLM for Biology

- [AI4Chem/ChemLLM-7B-Chat](https://huggingface.co/AI4Chem/ChemLLM-7B-Chat) — LLM for chemical & molecular science.
- [BioGPT](https://github.com/microsoft/BioGPT) — LLM for biomedical text generation.
- [GeneGPT](https://github.com/ncbi/GeneGPT) — LLM for biomedical information, integrated with various APIs.
- [GenePT](https://github.com/yiqunchen/GenePT) — Foundation LLM for single-cell data.
- [scPRINT](https://github.com/cantinilab/scPRINT) — Pretrained on 50M cells for scRNA-seq denoising & zero imputation.
- [ClawBio](https://github.com/ClawBio/ClawBio) — [Bioinformatics](/@harrisonqian/awesome/wiki/miscellaneous/bioinformatics)-native AI agent skill library with local-first pharmacogenomics, ancestry PCA, semantic similarity, nutrigenomics, and metagenomics skills.
- [BioMedLM](https://huggingface.co/stanford-crfm/BioMedLM) — 2.7B parameter GPT-2-style language model trained exclusively on biomedical literature from PubMed for biomedical [question answering](/@harrisonqian/awesome/wiki/computer-science/question-answering) and text generation.
- [MolT5](https://github.com/blender-nlp/MolT5) — Language model for molecular tasks bridging text and SMILES, enabling molecule captioning and text-driven molecule generation.
- [ChatDrug](https://github.com/chao1224/ChatDrug) — LLM-based conversational pipeline for drug discovery, using natural language prompts for iterative drug editing and optimization.

### Foundation Models

#### Single-cell Foundation Models

##### Transcriptomics Foundation Models

- [scFoundation](https://github.com/biomap-research/scFoundation) — Large-scale foundation model for single-cell gene expression, enabling multiple downstream tasks.
- [scGPT](https://github.com/bowang-lab/scGPT) — Transformer-based foundation model pretrained on millions of single-cell profiles.
- [Geneformer](https://huggingface.co/ctheodoris/Geneformer) — Context-aware, attention-based [deep learning](/@harrisonqian/awesome/wiki/computer-science/deep-learning) model pretrained on a large corpus of single-cell transcriptomes.
- [BulkFormer](https://github.com/KangBoming/BulkFormer) — Foundation model for bulk RNA-seq data; learns general transcriptomic representations.
- [scBERT](https://github.com/TencentAILabHealthcare/scBERT) — BERT-based foundation model pretrained on large-scale scRNA-seq data for cell type annotation.
- [CellPLM](https://github.com/OmicsML/CellPLM) — Cell pre-trained language model with inter-cell transformer architecture for diverse single-cell analysis tasks.
- [UCE](https://github.com/snap-stanford/UCE) — Universal Cell Embeddings: zero-shot single-cell embedding model trained on 36M cells across species, tissues, and assays without fine-tuning.
- [GEARS](https://github.com/snap-stanford/GEARS) — Graph-based model for predicting transcriptional responses to single and combinatorial genetic perturbations using biological priors.

##### Spatial Foundation Models

- [GigaPath](https://github.com/prov-gigapath/prov-gigapath) — Slide-level digital pathology foundation model pretrained on 1.3 billion pathology image tokens from whole-slide images.
- [UNI](https://github.com/mahmoodlab/UNI) — General-purpose self-supervised pathology foundation model trained on 100K+ whole-slide images for diverse computational pathology tasks.
- [CONCH](https://github.com/mahmoodlab/CONCH) — Vision-language foundation model for computational pathology trained with contrastive captioning on pathology image–text pairs.
- [Phikon](https://huggingface.co/owkin/phikon) — ViT-based pathology foundation model pretrained with iBOT self-supervision on TCGA whole-slide images.

##### Multi-Omics Foundation Models

- [scMulan](https://github.com/SuperBianC/scMulan) — Single-cell multi-omic language model pretrained on ~10M cells spanning transcriptomics, epigenomics, and proteomics for cross-omics transfer tasks.
- [totalVI](https://github.com/scverse/scvi-tools) — Probabilistic framework for joint analysis of paired scRNA-seq and protein (CITE-seq) data enabling multi-modal cell state representation across single-cell [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets).
- [MultiVI](https://github.com/scverse/scvi-tools) — Multi-modal variational autoencoder for integrating paired and unpaired single-cell RNA-seq and ATAC-seq measurements into a unified latent space.
- [MIRA](https://github.com/cistrome/MIRA) — Probabilistic multimodal topic model jointly modeling single-cell transcriptomics and chromatin accessibility for regulatory network inference.
- [GLUE](https://github.com/gao-lab/GLUE) — Graph-Linked Unified Embedding framework for unpaired single-cell multi-omics data [integration](/@harrisonqian/awesome/wiki/platforms/integration) across RNA, ATAC, methylation, and protein modalities.
- [BABEL](https://github.com/wukevin/babel) — Cross-modality translation model enabling prediction between scRNA-seq and scATAC-seq profiles without requiring paired single-cell measurements.
- [Multigrate](https://github.com/theislab/multigrate) — Asymmetric multi-omics variational autoencoder for integrating single-cell data across RNA, ATAC, and protein modalities with missing-modality support.
- [MOFA+](https://github.com/bioFAM/MOFA2) — Multi-Omics Factor Analysis framework identifying shared axes of variation across bulk and single-cell [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) including RNA, ATAC, proteomics, methylation, and copy number.
- [GeneCompass](https://github.com/xCompass-AI/GeneCompass) — Large-scale foundation model integrating DNA regulatory sequences and single-cell transcriptomics from 120M+ cells across multiple species for gene regulation prediction.
- [UnitedNet](https://github.com/LiuLab-Bioelectronics-Harvard/UnitedNet) — Interpretable multi-task deep neural network for single-cell multi-omics [integration](/@harrisonqian/awesome/wiki/platforms/integration) spanning transcriptomics, chromatin accessibility, and proteomics.
- [SpatialGlue](https://github.com/zhanglabtools/SpatialGlue) — Graph attention network for spatial multi-omics [integration](/@harrisonqian/awesome/wiki/platforms/integration) jointly embedding spatial transcriptomics with chromatin accessibility or proteomics.
- [MIDAS](https://github.com/labomics/midas) — Mosaic [integration](/@harrisonqian/awesome/wiki/platforms/integration) and differential accessibility model for single-cell multi-omics data that handles arbitrary missing-modality combinations across transcriptomics, chromatin accessibility, and proteomics.

##### Domain Alignment

- [scArches](https://github.com/theislab/scarches) — Transfer [learning](/@harrisonqian/awesome/wiki/programming-languages/learning) framework for mapping new single-cell [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) onto pre-trained reference atlases across batches, conditions, and modalities.
- [TOSICA](https://github.com/JackieHanlaopo/TOSICA) — Transformer-based framework for one-stop interpretable cell-type annotation supporting cross-dataset and cross-species transfer.

#### Protein Foundation Models

##### Pre-trained Embedding

- [Evolutionary Scale Modeling (ESM)](https://github.com/facebookresearch/esm) — Protein embeddings.
- [ChemBERTa-2](https://github.com/seyonechithrananda/bert-loves-chemistry) — Chemical embeddings & prediction.
- [ProtTrans](https://github.com/agemagician/ProtTrans) — Suite of protein language models (ProtBERT, ProtT5, ProtXLNet) trained on billions of protein sequences from UniRef and BFD.
- [ProGen2](https://github.com/salesforce/progen) — Protein language model trained on diverse protein families for sequence generation and fitness prediction.
- [Ankh](https://github.com/agemagician/Ankh) — Efficient protein language model optimized for downstream prediction tasks including secondary structure, localization, and function annotation.

##### Protein Structure Prediction and Design

- [AlphaFold3](https://github.com/google-deepmind/alphafold3) — Predicts structures of proteins, nucleic acids, small molecules, and their complexes.
- [Boltz-1](https://github.com/jwohlwend/boltz) — Open-source all-[atom](/@harrisonqian/awesome/wiki/editors/atom) biomolecular structure prediction model for proteins, nucleic acids, small molecules, and their complexes achieving AlphaFold3-level accuracy.
- [Chai-1](https://github.com/chaidiscovery/chai-lab) — Unified molecular structure prediction model covering proteins, nucleic acids, small molecules, and complexes.
- [ESM3](https://github.com/evolutionaryscale/esm) — Multimodal protein language model that jointly reasons over sequence, structure, and function for generative protein design and engineering.
- [ESMFold](https://github.com/facebookresearch/esm) — Fast protein structure prediction using language model embeddings.
- [RFdiffusion](https://github.com/RosettaCommons/RFdiffusion) — Generative model for protein [backbone](/@harrisonqian/awesome/wiki/front-end-development/backbone) design using diffusion.
- [ProteinMPNN](https://github.com/dauparas/ProteinMPNN) — [Deep learning](/@harrisonqian/awesome/wiki/computer-science/deep-learning) model for protein sequence design given [backbone](/@harrisonqian/awesome/wiki/front-end-development/backbone) structure.
- [OmegaFold](https://github.com/HeliXonProtein/OmegaFold) — High-resolution de novo protein structure prediction from sequence.
- [RoseTTAFold](https://github.com/RosettaCommons/RoseTTAFold) — Three-track neural network for protein structure prediction.
- [OpenFold](https://github.com/aqlaboratory/openfold) — Trainable, memory-efficient open-source reproduction of AlphaFold2 enabling custom protein structure prediction workflows.
- [SaProt](https://github.com/westlake-reup/SaProt) — Structure-aware protein language model using structure-aware tokens that encode both sequence and [backbone](/@harrisonqian/awesome/wiki/front-end-development/backbone) geometry for improved function prediction.
- [EvoDiff](https://github.com/microsoft/evodiff) — Discrete diffusion framework for protein sequence generation trained on evolutionary-scale data, supporting unconditional generation, disordered region design, and functional motif scaffolding. [ [paper-2023](https://www.biorxiv.org/content/10.1101/2023.09.11.556673v1) ]

#### Multi-Modal Foundation Models

- [CHIEF](https://github.com/hms-dbmi/CHIEF) — Clinical Histopathology Imaging Evaluation Foundation model integrating histology images and clinical context for pan-cancer analysis.
- [BiomedCLIP](https://huggingface.co/microsoft/BiomedCLIP-PubMedBERT_256-vit_g_14) — CLIP-based vision-language foundation model for biomedical images and text trained on PubMed figure–caption pairs.

#### Genomics Foundation Models

- [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer) — Foundation model for genomic sequences across multiple species.
- [DNABERT](https://github.com/jerryji1993/DNABERT) — Pre-trained bidirectional encoder for DNA sequence analysis.
- [DNABERT-2](https://github.com/Zhihan1996/DNABERT_2) — Improved genome foundation model with efficient tokenization.
- [Enformer](https://github.com/deepmind/deepmind-research/tree/master/enformer) — Transformer model predicting gene expression from DNA sequence.
- [Basenji](https://github.com/calico/basenji) — Sequential regulatory activity prediction from DNA sequences.
- [Caduceus](https://github.com/kuleshov-group/caduceus) — Bidirectional equivariant long-range DNA sequence model based on Mamba.
- [Evo](https://github.com/evo-design/evo) — Long-context genomic foundation model (up to 1M tokens).
- [HyenaDNA](https://github.com/HazyResearch/hyena-dna) — Long-range genomic foundation model handling sequences up to 1M tokens with sub-quadratic attention.
- [Borzoi](https://github.com/calico/borzoi) — Extended successor to Enformer for predicting RNA-seq coverage from long genomic sequence [windows](/@harrisonqian/awesome/wiki/platforms/windows) (524 kb) with improved resolution.
- [DeepSEA](http://deepsea.princeton.edu/) — [Deep learning](/@harrisonqian/awesome/wiki/computer-science/deep-learning) framework for predicting chromatin effects of sequence alterations with single-nucleotide sensitivity across thousands of chromatin features.
- [Sei](https://github.com/FunctionLab/sei-framework) — Sequence-to-function framework [learning](/@harrisonqian/awesome/wiki/programming-languages/learning) a genome-wide regulatory activity code from DNA sequences for variant effect prediction.
- [GPN (Genomic Pre-trained Network)](https://github.com/songlab-cal/gpn) — Masked language model for DNA sequences enabling zero-shot variant effect prediction without requiring functional annotations.

---