[[
wikihub
]]
Search
⌘K
Explore
People
For Agents
Sign in
Explore
People
For Agents
Sign in
@harrisonqian / Awesome / wiki/miscellaneous/computational-biology.md
Suggest edit
Cancel
Submit suggestion
Title
Name
Note
--- visibility: public --- # Computational Biology **repo:** [inoue0426/awesome-computational-biology](https://github.com/inoue0426/awesome-computational-biology) **category:** [[miscellaneous|Miscellaneous]] --- # Awesome Computational Biology [](https://awesome.re) A curated collection of databases, software, and papers related to computational biology. > Computational biology involves the development and application of data-analytical and theoretical methods, mathematical modelling and computational simulation techniques to the study of biological, ecological, behavioural, and social systems. — [Wikipedia](https://en.wikipedia.org/wiki/Computational_biology) --- ## Interface Browse and search the resources via the [GitHub Pages UI](https://inoue0426.github.io/awesome-computational-biology/). --- ## Table of Contents - [Awesome Computational Biology](#[awesome](/@harrisonqian/awesome/wiki/miscellaneous/awesome)-computational-biology-) - [Table of Contents](#table-of-contents) - [Databases](#databases) - [scRNA](#scrna) - [Compound](#compound) - [Pathway](#pathway) - [Mass Spectra](#mass-spectra) - [Protein](#protein) - [Genome](#genome) - [Disease](#disease) - [Interaction](#interaction) - [Drug-Gene Interaction](#drug-gene-interaction) - [Drug (Cell Line) Response](#drug-cell-line-response) - [Chemical-Protein Interaction](#chemical-protein-interaction) - [Protein-Protein Interaction](#protein-protein-interaction) - [Knowledge Graph](#knowledge-graph) - [Gene Regulatory Network](#gene-regulatory-network) - [Clinical Trial](#clinical-trial) - [Benchmarks & Datasets](#benchmarks--datasets) - [API](#api) - [Preprocessing Tools](#preprocessing-tools) - [Machine [Learning](/@harrisonqian/awesome/wiki/programming-languages/learning) Tasks and Models](#machine-learning-tasks-and-models) - [Drug Discovery](#drug-discovery) - [Drug Response Prediction](#drug-response-prediction) - [Drug Repurposing](#drug-repurposing) - [Drug Target Interaction](#drug-target-interaction) - [Compound-Protein Interaction](#compound-protein-interaction) - [Molecular Generation](#molecular-generation) - [LLM for Biology](#llm-for-biology) - [Foundation Models](#foundation-models) - [Single-cell Foundation Models](#single-cell-foundation-models) - [Transcriptomics Foundation Models](#transcriptomics-foundation-models) - [Spatial Foundation Models](#spatial-foundation-models) - [Multi-Omics Foundation Models](#multi-omics-foundation-models) - [Domain Alignment](#domain-alignment) - [Protein Foundation Models](#protein-foundation-models) - [Pre-trained Embedding](#pre-trained-embedding) - [Protein Structure Prediction and Design](#protein-structure-prediction-and-design) - [Multi-Modal Foundation Models](#multi-modal-foundation-models) - [Genomics Foundation Models](#genomics-foundation-models) --- ## Databases ### scRNA - [CZ CELLxGENE](https://cellxgene.cziscience.com/) — Single-cell dataset repository and interactive explorer from the Chan Zuckerberg Initiative. - [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/) — Public functional genomics [database](/@harrisonqian/awesome/wiki/databases/database). - [Human Cell Atlas](https://www.humancellatlas.org/) — Open global atlas of all cells in the human body. - [Single Cell PORTAL](https://singlecell.broadinstitute.org/single_cell) — Public [database](/@harrisonqian/awesome/wiki/databases/database) for single-cell RNA. - [Single Cell Expression Atlas](https://www.ebi.ac.uk/gxa/sc/home) — Public [database](/@harrisonqian/awesome/wiki/databases/database) for single-cell RNA. ### Compound - [PubChem](https://pubchem.ncbi.nlm.nih.gov/) — One of the largest chemical databases (compounds, genes, and proteins). - [ChEBI](https://www.ebi.ac.uk/chebi/) — [Database](/@harrisonqian/awesome/wiki/databases/database) focused on small chemical compounds. - [ChEMBL](https://www.ebi.ac.uk/chembl/) — Bioactive molecules with drug-like properties. - [ChemSpider](http://www.chemspider.com/) — Chemical structure [database](/@harrisonqian/awesome/wiki/databases/database). - [DrugTargetCommons](https://drugtargetcommons.fimm.fi/) — Community platform for curating and integrating experimental bioactivity data across drugs and targets. - [HMDB (Human Metabolome Database)](https://hmdb.ca/) — Comprehensive [database](/@harrisonqian/awesome/wiki/databases/database) of small molecule metabolites found in the human body. - [KEGG COMPOUND](https://www.genome.jp/kegg/compound/) — Collection of small molecules and biopolymers. - [LIPID MAPS](https://www.lipidmaps.org/databases/lmsd/overview) — [Database](/@harrisonqian/awesome/wiki/databases/database) of lipids. - [Rhea](https://www.rhea-db.org/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of chemical reactions. - [DrugCentral](http://drugcentral.org/) — Online drug compendium with drug mode of action and indication information. - [Drug Repurposing Hub](https://repo-hub.broadinstitute.org/repurposing#download-data) — Collections of drug repurposing data (drug, MoA, target, etc). - [Therapeutic Target Database](https://idrblab.net/ttd/full-data-download) — Drug-target, target-disease, and drug-disease [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets). - [ZINC ligand discovery database](https://zinc.docking.org/) — Free [database](/@harrisonqian/awesome/wiki/databases/database) of commercially-available compounds for virtual screening. ### Pathway - [PathwayCommons](https://www.pathwaycommons.org/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of pathways and interactions. - [KEGG PATHWAY](https://www.genome.jp/kegg/pathway.html) — Collection of pathway maps. - [WikiPathways](https://wikipathways.org/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of biological pathways. - [Reactome](https://reactome.org/) — Expert-curated, peer-reviewed pathway [database](/@harrisonqian/awesome/wiki/databases/database) with detailed reaction mechanisms. - [BioCyc](https://biocyc.org/) — Collection of pathway/genome databases across thousands of organisms. - [SIGNOR](https://signor.uniroma2.it/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of causal signaling interactions and pathways. - [MSigDB (Molecular Signatures Database)](https://www.gsea-msigdb.org/gsea/msigdb) — Curated gene sets derived from pathways and biological processes. ### Mass Spectra - [MassBank](http://www.massbank.jp/) — Open source databases and tools for mass spectrometry reference spectra. - [MoNA MassBank of North America](https://mona.fiehnlab.ucdavis.edu/) — Meta-[database](/@harrisonqian/awesome/wiki/databases/database) of metabolite mass spectra, metadata, and associated compounds. ### Protein - [THE HUMAN PROTEIN ATLAS](https://www.proteinatlas.org/) — Comprehensive human protein [database](/@harrisonqian/awesome/wiki/databases/database) (cells, tissues, organs). - [PROTEIN DATA BANK (PDB)](https://www.rcsb.org/) — 3D structures of proteins, nucleic acids, complexes. - [UniProt](https://www.uniprot.org/) — Functional information on proteins. - [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/api-docs) — 3D protein structure predictions. - [RCSB Protein Data Bank](https://www.rcsb.org/) — Repository for structural data of biological molecules. - [Critical Assessment of Structure Prediction (CASP)](https://predictioncenter.org/) — Assessing methods for protein structure prediction. - [Uniclust](https://uniclust.mmseqs.com/) — Clustered protein sequence databases. - [UniRef](https://www.uniprot.org/uniref/) — Non-redundant sequence [database](/@harrisonqian/awesome/wiki/databases/database) clustering UniProtKB entries at multiple sequence identity thresholds. - [CATH database](https://www.cathdb.info/) — Hierarchical classification of protein domain structures. - [SAbDab](https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab) — Structural Antibody [Database](/@harrisonqian/awesome/wiki/databases/database) containing all antibody structures in the PDB. - [OADB (Observed Antibody Space Database)](http://opig.stats.ox.ac.uk/webapps/oas/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of antibody sequences from immune repertoire sequencing. - [InterPro](https://www.ebi.ac.uk/interpro/) — Protein families, domains, and functional sites [database](/@harrisonqian/awesome/wiki/databases/database) integrating 14 member databases including Pfam and PROSITE. - [Pfam](https://www.ebi.ac.uk/interpro/entry/pfam/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of protein families described by multiple sequence alignments and hidden Markov models. - [NeXtProt](https://www.nextprot.org/) — Expert knowledge base on human proteins with deep functional annotation, complementary to UniProt. ### Genome - [ENCODE](https://www.encodeproject.org/) — Encyclopedia of DNA Elements; regulatory and functional genomic elements across the genome. - [Ensembl](https://www.ensembl.org/) — Genome browser and annotation [database](/@harrisonqian/awesome/wiki/databases/database) for vertebrate and other eukaryotic genomes. - [Human Genome Resources at NCBI](https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml) — [Database](/@harrisonqian/awesome/wiki/databases/database) for genomics, proteomics, transcriptomics, and systems biology. - [GenBank](https://www.ncbi.nlm.nih.gov/genbank/) — NCBI's [database](/@harrisonqian/awesome/wiki/databases/database) of genetic sequences. - [UCSC Genome Browser](https://genome.ucsc.edu/) — UCSC's genome browser. - [cBioPortal](https://www.cbioportal.org/) — Cancer genomics [database](/@harrisonqian/awesome/wiki/databases/database); aggregating many patient [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets). - [10x Genomics Dataset](https://www.10xgenomics.com/resources/datasets) — Collection of single-cell [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets). - [The Genotype-Tissue Expression (GTEx)](https://gtexportal.org/home/) — Human gene expression and regulation resource. - [Dependency Map (DepMap)](https://depmap.org/portal/) — CRISPR-Cas9 screens in cancer cell lines. - [Catalogue Of Somatic Mutations In Cancer (COSMIC)](https://cancer.sanger.ac.uk/cosmic) — Resource on somatic mutations in cancers. - [MGnify](https://www.ebi.ac.uk/metagenomics/) — Resource for metagenomic and metatranscriptomic data. - [JASPAR](http://jaspar.genereg.net/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of transcription factor binding profiles. - [gnomAD](https://gnomad.broadinstitute.org/) — Genome Aggregation [Database](/@harrisonqian/awesome/wiki/databases/database); genetic variation from large-scale sequencing projects. - [Rfam](https://rfam.org/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of RNA families with sequence alignments and consensus structures. - [ROADMAP Epigenomics](http://www.roadmapepigenomics.org/) — Reference epigenome maps for 111 primary human cell types and tissues, including histone modifications, chromatin accessibility, and DNA methylation. - [FANTOM5](https://fantom.gsc.riken.jp/5/) — Functional annotation of mammalian genome; comprehensive atlas of active enhancers, promoters, and transcription start sites across human and mouse cell types. ### Disease - [KEGG DRUG](https://www.genome.jp/kegg/drug/) — Comprehensive, approved drug information. - [DrugBank](https://go.drugbank.com/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of drugs and targets (University of Alberta). - [DisGeNET](https://www.disgenet.org/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of gene-disease associations integrating expert-curated and GWAS data. - [OMIM (Online Mendelian Inheritance in Man)](https://www.omim.org/) — Comprehensive [database](/@harrisonqian/awesome/wiki/databases/database) of human genes and genetic disorders. - [Open Targets Platform](https://platform.opentargets.org/) — Systematic target identification and prioritization platform integrating genetics, genomics, and drug data for drug discovery. - [Human Phenotype Ontology (HPO)](https://hpo.jax.org/) — Standardized vocabulary of phenotypic abnormalities in human disease, linking genes, variants, and clinical features. - [DISEASES](https://diseases.jensenlab.org/) — Gene–disease association [database](/@harrisonqian/awesome/wiki/databases/database) integrating evidence from text mining, curated databases, and experimental data. ### Interaction #### Drug-Gene Interaction - [DGIdb](https://www.dgidb.org/) — Drug-gene interactions and the druggable genome. - [Comparative Toxicogenomics Database](http://ctdbase.org/) — Chemical-gene interactions, chemical-disease and gene-disease associations, chemical-phenotype associations. - [SNAP](https://snap.stanford.edu/biodata/datasets/10002/10002-ChG-Miner.html) — Dataset of drug-gene interactions. #### Drug (Cell Line) Response - [NCI60](https://dtp.cancer.gov/discovery_development/nci-60/) — Focuses on 60 cancer cell lines and many drugs. - [Genomics of Drug Sensitivity in Cancer (GDSC)](https://www.cancerrxgene.org/) — Drug sensitivity for ~1000 human cancer cell lines and hundreds of compounds. - [Cancer Cell Line Encyclopedia](https://sites.broadinstitute.org/ccle/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of ~1000 cancer cell lines. - [CellMiner Cross [Database](/@harrisonqian/awesome/wiki/databases/database) (CellMinerCDB)](https://discover.nci.nih.gov/cellminercdb/) — Integrates multiple cancer cell line databases. #### Chemical-Protein Interaction - [STITCH](http://stitch.embl.de/) — Chemical-protein interactions. - [BindingDB](https://www.bindingdb.org/rwd/bind/index.jsp) — Compounds and target [database](/@harrisonqian/awesome/wiki/databases/database). - [Davis kinase inhibitors DB](http://staff.cs.utu.fi/~aijrinas/dti/) — Experimental kinase inhibitor binding affinity dataset for protein–ligand interaction research. - [Kinase Inhibitor Bioactivity Data (KIBA)](https://janeliascicomp.[github](/@harrisonqian/awesome/wiki/development-environment/github).io/KIBA/) — Integrated bioactivity scores for kinase inhibitors combining Ki, Kd, and IC50 measurements. - [PDBBind](https://www.pdbbind-plus.org.cn/) — Binding affinity data for biomolecular complexes. #### Protein-Protein Interaction - [STRING](https://string-db.org/) — PPI networks for multiple organisms. - [BioGRID](https://thebiogrid.org/) — Protein, genetic, and chemical interactions. - [HIPPIE](http://cbdm-01.zdv.uni-mainz.de/~mschaefer/hippie/) — Human protein-protein interaction [database](/@harrisonqian/awesome/wiki/databases/database). - [IntAct](https://www.ebi.ac.uk/intact/home) — Open-source molecular interaction [database](/@harrisonqian/awesome/wiki/databases/database) and analysis system from EMBL-EBI. #### Knowledge Graph - [Drug Mechanism [Database](/@harrisonqian/awesome/wiki/databases/database) (DrugMechDB)](https://github.com/SuLab/DrugMechDB/tree/2.0.1) — Mechanisms of action from drug to disease. - [DRKG](https://github.com/gnn4dr/DRKG) — Large-scale biological knowledge graph for drug discovery. - [Hetionet](https://github.com/hetio/hetionet) — Heterogeneous network integrating genes, diseases, drugs, pathways, and more. - [PrimeKG](https://github.com/mims-harvard/PrimeKG) — Multi-modal precision medicine knowledge graph integrating clinical, genetic, and drug data. #### Gene Regulatory Network - [TRRUST](https://www.grnpedia.org/trrust/) — Manually curated [database](/@harrisonqian/awesome/wiki/databases/database) of human and mouse transcriptional regulatory interactions between transcription factors and their target genes. - [RegNetwork](http://www.regnetworkweb.org/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of gene regulatory networks covering transcription factor–target gene and miRNA–gene interaction data across multiple species. - [miRBase](https://www.mirbase.org/) — Reference repository for microRNA gene annotations, sequences, and experimentally validated targets. ### Clinical Trial - [ClinicalTrials.gov](https://clinicaltrials.gov/) — Privately and publicly funded clinical studies. - [ICD10](https://icd.who.int/browse10/2019/en) — International Classification of Diseases, 10th revision. - [EU Drug Regulating Authorities Clinical Trials DB (EudraCT)](https://eudract.ema.europa.eu/) — European clinical trial [database](/@harrisonqian/awesome/wiki/databases/database). - [MIMIC-IV](https://mimic.mit.edu/) — Freely accessible critical care [database](/@harrisonqian/awesome/wiki/databases/database). --- ## Benchmarks & Datasets - [1000 Genomes Project](https://www.internationalgenome.org/) — Reference panel of human genetic variation from 2,504 individuals across 26 populations. - [BACE](https://www.kaggle.com/datasets/gokturkkoch/bace) — Binary classification and regression dataset for β-secretase 1 (BACE-1) inhibitor binding affinity. - [BEAT AML](https://biodev.[github](/@harrisonqian/awesome/wiki/development-environment/github).io/BeatAML2/) — Functional ex vivo drug sensitivity measurements paired with genomics for acute myeloid leukemia. - [BindingDB Curated Sets](https://www.bindingdb.org/rwd/bind/chemsearch/marvin/SDFdownload.jsp?all_download=yes) — Curated binding affinity [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) for protein–ligand interaction benchmarking. - [Cancer Therapeutics Response Portal (CTRP)](https://portals.broadinstitute.org/ctrp/) — Drug sensitivity profiles across ~900 cancer cell lines for >400 compounds. - [ClinTox](https://tdcommons.ai/single_pred_tasks/tox/#clintox) — Clinical toxicity dataset contrasting FDA-approved drugs with those that failed clinical trials due to toxicity. - [CPTAC (Clinical Proteomic Tumor Analysis Consortium)](https://proteomics.cancer.gov/programs/cptac) — Multi-omic proteogenomic [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) for multiple cancer types linking proteomics with genomics. - [CrossDocked2020](https://arxiv.org/abs/2001.01037) — Large-scale dataset for structure-based virtual screening. - [FLIP (Fitness Landscape Inference for Proteins)](https://github.com/J-SNACKKB/FLIP) — Benchmark collection of protein fitness landscape [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) for evaluating protein ML models. - [Genomics of Drug Sensitivity in Cancer (GDSC)](https://www.cancerrxgene.org/) — Drug sensitivity for ~1000 human cancer cell lines and hundreds of compounds. - [GuacaMol](https://github.com/BenevolentAI/guacamol) — Benchmark suite for generative molecular design models. - [LINCS L1000](https://lincsproject.org/LINCS/tools/workflows/find-the-best-place-to-obtain-the-lincs-l1000-data) — Gene expression profiles (978 landmark genes) for >20,000 chemical and genetic perturbations across cell lines. - [MoleculeNet](http://moleculenet.ai/) — Benchmark [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) for molecular [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning). - [MOSES](https://github.com/molecularsets/moses) — Benchmarking platform for molecular generation models. - [NCI60](https://dtp.cancer.gov/discovery_development/nci-60/) — Drug sensitivity benchmark across 60 diverse human cancer cell lines. - [OGB (Open Graph Benchmark)](https://ogb.stanford.edu/) — Large-scale graph ML benchmark suite including biological [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) such as ogbl-ppa (protein-protein associations) and ogbg-molhiv. - [OpenBioLink](https://github.com/OpenBioLink/OpenBioLink) — Benchmark [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) for biological knowledge graph completion. - [PharmGKB](https://www.pharmgkb.org/) — Curated pharmacogenomics dataset linking genetic variants to drug response phenotypes across thousands of drugs. - [PK-DB](https://pk-db.com/) — Open [database](/@harrisonqian/awesome/wiki/databases/database) of experimental pharmacokinetics (PK) and ADME data from clinical and preclinical studies. - [PRISM](https://depmap.org/portal/prism/) — Cancer drug sensitivity profiling of >4,500 drugs across >900 cancer cell lines using pooled-cell-line barcoding. - [ProteinGym](https://github.com/OATML-Markslab/ProteinGym) — Large-scale benchmark of deep mutational scanning assays for evaluating protein fitness landscape models. - [QM9](https://figshare.com/collections/Quantum_chemistry_structures_and_properties_of_134_kilo_molecules/978904) — Quantum chemistry properties for 134K stable small organic molecules computed at DFT level. - [scIB (Single-cell [Integration](/@harrisonqian/awesome/wiki/platforms/integration) Benchmarks)](https://github.com/theislab/scib) — Comprehensive benchmarking framework for single-cell data [integration](/@harrisonqian/awesome/wiki/platforms/integration) methods. - [SIDER (Side Effect Resource)](http://sideeffects.embl.de/) — [Database](/@harrisonqian/awesome/wiki/databases/database) of 1,430 approved drugs with their recorded adverse drug reactions across 27 system-organ classes. - [Tabula Muris](https://tabula-muris.ds.czbiohub.org/) — Comprehensive single-cell atlas of 20 mouse organs and tissues, enabling cross-tissue and cross-species comparisons. - [Tabula Sapiens](https://tabula-sapiens-portal.ds.czbiohub.org/) — Comprehensive human single-cell atlas of ~500K cells from 24 organs and tissues across multiple donors. - [TAPE (Tasks Assessing Protein Embeddings)](https://github.com/songlab-cal/tape) — Benchmark suite of five biologically meaningful semi-supervised [learning](/@harrisonqian/awesome/wiki/programming-languages/learning) tasks for evaluating protein representations. - [The Cancer Genome Atlas (TCGA)](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga) — Comprehensive multi-omics (genomics, transcriptomics, proteomics, methylation) dataset for 33 cancer types across ~11,000 patients. - [Therapeutics Data Commons (TDC)](https://tdcommons.ai/) — Unified benchmark suite covering ADMET, drug-target interaction, drug response, and more. - [Tox21](https://tripod.nih.gov/tox21/challenge/) — 12,707 compounds tested in 12 nuclear receptor and stress-response pathway biochemical assays for toxicity prediction. - [UK Biobank](https://www.ukbiobank.ac.uk/) — Large-scale biomedical [database](/@harrisonqian/awesome/wiki/databases/database) of ~500K participants with genetic, imaging, and health data for population genetics and disease studies. --- ## API - [PubMed E-utilities (esearch/efetch)](https://www.nlm.nih.gov/dataguide/edirect/esearch.html) — APIs for searching and retrieving biomedical literature from PubMed. - [NCBI E-utilities](https://www.ncbi.nlm.nih.gov/books/NBK25501/) — Unified APIs for accessing NCBI databases (Gene, GEO, SRA, PubChem, etc). - [UniProt [REST](/@harrisonqian/awesome/wiki/miscellaneous/rest) API](https://www.uniprot.org/help/api) — Programmatic access to protein sequence and functional annotation data. - [Ensembl [REST](/@harrisonqian/awesome/wiki/miscellaneous/rest) API](https://rest.ensembl.org/) — API for genomic annotations, variants, genes, and comparative genomics. - [KEGG [REST](/@harrisonqian/awesome/wiki/miscellaneous/rest) API](https://www.kegg.jp/kegg/rest/keggapi.html) — API for accessing KEGG pathways, compounds, genes, and reactions. - [ChEMBL Web Services](https://www.ebi.ac.uk/chembl/ws) — [REST](/@harrisonqian/awesome/wiki/miscellaneous/rest) API for bioactive molecules, targets, and bioassays. - [Open Targets Platform API](https://platform.opentargets.org/api) — API for target–disease associations integrating genetics, genomics, and drug data. - [ClinicalTrials.gov API](https://clinicaltrials.gov/api/gui) — API for querying clinical trial metadata and results. --- ## Preprocessing Tools - [Chemistry Development Kit](https://github.com/cdk/cdk) — [Cheminformatics](/@harrisonqian/awesome/wiki/miscellaneous/cheminformatics) software & [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) tools. - [Biopython](https://biopython.org/) — Collection of [Python](/@harrisonqian/awesome/wiki/programming-languages/python) tools for biological computation including sequence analysis, structure parsing, and [database](/@harrisonqian/awesome/wiki/databases/database) access. - [FlashDeconv](https://github.com/cafferychen777/flashdeconv) — High-performance spatial transcriptomics deconvolution (~1M spots in ~3 min). - [RDKit](https://github.com/rdkit/rdkit) — [Cheminformatics](/@harrisonqian/awesome/wiki/miscellaneous/cheminformatics) software & [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) toolkit. - [DeepChem](https://github.com/deepchem/deepchem) — [Deep learning](/@harrisonqian/awesome/wiki/computer-science/deep-learning) library for drug discovery, quantum chemistry, and materials science. - [ChatSpatial](https://github.com/cafferychen777/ChatSpatial) — MCP server for spatial transcriptomics analysis via natural language. - [Scanpy](https://scanpy.readthedocs.io/en/stable/) — [Python](/@harrisonqian/awesome/wiki/programming-languages/python) library for scRNA-seq analysis. - [Seurat](https://satijalab.org/seurat/) — R library for scRNA-seq analysis. - [scvi-tools](https://scvi-tools.org/) — Probabilistic models for single-cell omics data analysis. - [CellTypist](https://github.com/Teichlab/celltypist) — Automated cell type annotation for scRNA-seq. - [Squidpy](https://squidpy.readthedocs.io/) — [Python](/@harrisonqian/awesome/wiki/programming-languages/python) library for spatial single-cell analysis. - [GROMACS](https://www.gromacs.org/) — Molecular dynamics simulation package for biochemical molecules. - [MDAnalysis](https://www.mdanalysis.org/) — [Python](/@harrisonqian/awesome/wiki/programming-languages/python) library for analyzing and altering molecular dynamics simulation trajectories. - [OpenMM](https://openmm.org/) — High-performance toolkit for molecular simulation and GPU-accelerated MD. - [scVelo](https://github.com/theislab/scvelo) — RNA velocity estimation for single-cell transcriptomics, inferring the direction and speed of cell differentiation. - [STAR](https://github.com/alexdobin/STAR) — Ultrafast universal RNA-seq aligner with support for spliced alignment and single-cell quantification via STARsolo. - [kallisto](https://pachterlab.[github](/@harrisonqian/awesome/wiki/development-environment/github).io/kallisto/) — Near-optimal RNA-seq quantification using pseudoalignment for fast transcript abundance estimation. - [Harmony](https://github.com/immunogenomics/harmony) — Fast and scalable [integration](/@harrisonqian/awesome/wiki/platforms/integration) of single-cell data across [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets), conditions, technologies, and species. - [Monocle3](https://cole-trapnell-lab.[github](/@harrisonqian/awesome/wiki/development-environment/github).io/monocle3/) — Single-cell trajectory analysis tool for [learning](/@harrisonqian/awesome/wiki/programming-languages/learning) developmental trajectories and ordering cells in pseudotime. - [CellChat](https://github.com/sqjin/CellChat) — Inference and analysis of cell-cell communication ligand-receptor networks from single-cell transcriptomics data. - [SCENIC](https://github.com/aertslab/SCENIC) — Single-cell regulatory network inference and clustering linking transcription factors to co-expressed gene modules. - [DoubletFinder](https://github.com/chris-mcginnis-ucsf/DoubletFinder) — [Machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) approach for detecting multiplet (doublet) artifacts in single-cell RNA-seq data. --- ## Machine Learning Tasks and Models ### Drug Discovery #### Drug Response Prediction - [drGAT](https://github.com/inoue0426/drGAT) — Attention-based model for drug response prediction with gene explainability. - [MOFGCN](https://github.com/weiba/MOFGCN/tree/main) — GCN + heterogeneous network. - [DeepDSC](https://ieeexplore-ieee-org.ezp2.lib.umn.edu/stamp/stamp.jsp?tp=&arnumber=8723620&tag=1) — Autoencoder + fully connected NN. - [DGDRP](https://github.com/minwoopak/heteronet) — Multi-view embedding neural network. - [DeepAEG](https://github.com/zhejiangzhuque/DeepAEG) — GNN embedding + attention mechanism. - [RECOVER](https://github.com/RECOVERcoalition/Recover) — [Machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) framework for predicting synergistic drug combination responses across cell lines. - [TGSA](https://github.com/violet-sto/TGSA) — Tumor gene set and attention-based model leveraging biological pathway knowledge for drug response prediction. - [HiDRA](https://github.com/bsml320/HiDRA) — Hierarchical network model incorporating gene and pathway-level information for cancer drug response prediction. #### Drug Repurposing - [DeepPurpose](https://github.com/kexinhuang12345/DeepPurpose) — [Deep learning](/@harrisonqian/awesome/wiki/computer-science/deep-learning) library for drug repurposing. #### Drug Target Interaction - [NeoDTI](https://github.com/FangpingWan/NeoDTI) — Library for drug-target interaction prediction. - [DTINet](https://github.com/luoyunan/DTINet) — Network-based framework integrating heterogeneous biological data for DTI prediction. - [DeepDTA](https://github.com/hkmztrk/DeepDTA) — [Deep learning](/@harrisonqian/awesome/wiki/computer-science/deep-learning) model using CNNs on protein sequences and drug SMILES. - [GraphDTA](https://github.com/thinng/GraphDTA) — Graph neural network–based DTI prediction using molecular graphs. - [MolTrans](https://github.com/kexinhuang12345/MolTrans) — Transformer-based DTI model leveraging molecular substructures. - [DrugBAN](https://github.com/peizhenbai/DrugBAN) — Bilinear attention network for interpretable DTI prediction. #### Compound-Protein Interaction - [MCPINN](https://github.com/mhlee0903/multi_channels_PINN) — Drug discovery via compound-protein interaction and [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning). - [TransformerCPI](https://github.com/lifanchen-simm/transformerCPI) — CPI prediction using Transformer. #### Molecular Generation - [REINVENT](https://github.com/MolecularAI/Reinvent) — Reinforcement [learning](/@harrisonqian/awesome/wiki/programming-languages/learning) for de novo drug design. - [MolGPT](https://github.com/devalab/molgpt) — Transformer-based model for molecular generation. - [Molecular Transformer](https://github.com/pschwllr/MolecularTransformer) — Sequence-to-sequence model for retrosynthesis prediction. - [TargetDiff](https://github.com/guanjq/targetdiff) — 3D equivariant diffusion model for structure-based drug design. - [DiffDock](https://github.com/gcorso/DiffDock) — Diffusion generative model for molecular docking, predicting the binding pose of small molecules to protein targets. - [JTVAE](https://github.com/wengong-jin/icml18-jtnn) — Junction tree variational autoencoder for molecular graph generation that guarantees chemical validity via a hierarchical tree decomposition. ### LLM for Biology - [AI4Chem/ChemLLM-7B-Chat](https://huggingface.co/AI4Chem/ChemLLM-7B-Chat) — LLM for chemical & molecular science. - [BioGPT](https://github.com/microsoft/BioGPT) — LLM for biomedical text generation. - [GeneGPT](https://github.com/ncbi/GeneGPT) — LLM for biomedical information, integrated with various APIs. - [GenePT](https://github.com/yiqunchen/GenePT) — Foundation LLM for single-cell data. - [scPRINT](https://github.com/cantinilab/scPRINT) — Pretrained on 50M cells for scRNA-seq denoising & zero imputation. - [ClawBio](https://github.com/ClawBio/ClawBio) — [Bioinformatics](/@harrisonqian/awesome/wiki/miscellaneous/bioinformatics)-native AI agent skill library with local-first pharmacogenomics, ancestry PCA, semantic similarity, nutrigenomics, and metagenomics skills. - [BioMedLM](https://huggingface.co/stanford-crfm/BioMedLM) — 2.7B parameter GPT-2-style language model trained exclusively on biomedical literature from PubMed for biomedical [question answering](/@harrisonqian/awesome/wiki/computer-science/question-answering) and text generation. - [MolT5](https://github.com/blender-nlp/MolT5) — Language model for molecular tasks bridging text and SMILES, enabling molecule captioning and text-driven molecule generation. - [ChatDrug](https://github.com/chao1224/ChatDrug) — LLM-based conversational pipeline for drug discovery, using natural language prompts for iterative drug editing and optimization. ### Foundation Models #### Single-cell Foundation Models ##### Transcriptomics Foundation Models - [scFoundation](https://github.com/biomap-research/scFoundation) — Large-scale foundation model for single-cell gene expression, enabling multiple downstream tasks. - [scGPT](https://github.com/bowang-lab/scGPT) — Transformer-based foundation model pretrained on millions of single-cell profiles. - [Geneformer](https://huggingface.co/ctheodoris/Geneformer) — Context-aware, attention-based [deep learning](/@harrisonqian/awesome/wiki/computer-science/deep-learning) model pretrained on a large corpus of single-cell transcriptomes. - [BulkFormer](https://github.com/KangBoming/BulkFormer) — Foundation model for bulk RNA-seq data; learns general transcriptomic representations. - [scBERT](https://github.com/TencentAILabHealthcare/scBERT) — BERT-based foundation model pretrained on large-scale scRNA-seq data for cell type annotation. - [CellPLM](https://github.com/OmicsML/CellPLM) — Cell pre-trained language model with inter-cell transformer architecture for diverse single-cell analysis tasks. - [UCE](https://github.com/snap-stanford/UCE) — Universal Cell Embeddings: zero-shot single-cell embedding model trained on 36M cells across species, tissues, and assays without fine-tuning. - [GEARS](https://github.com/snap-stanford/GEARS) — Graph-based model for predicting transcriptional responses to single and combinatorial genetic perturbations using biological priors. ##### Spatial Foundation Models - [GigaPath](https://github.com/prov-gigapath/prov-gigapath) — Slide-level digital pathology foundation model pretrained on 1.3 billion pathology image tokens from whole-slide images. - [UNI](https://github.com/mahmoodlab/UNI) — General-purpose self-supervised pathology foundation model trained on 100K+ whole-slide images for diverse computational pathology tasks. - [CONCH](https://github.com/mahmoodlab/CONCH) — Vision-language foundation model for computational pathology trained with contrastive captioning on pathology image–text pairs. - [Phikon](https://huggingface.co/owkin/phikon) — ViT-based pathology foundation model pretrained with iBOT self-supervision on TCGA whole-slide images. ##### Multi-Omics Foundation Models - [scMulan](https://github.com/SuperBianC/scMulan) — Single-cell multi-omic language model pretrained on ~10M cells spanning transcriptomics, epigenomics, and proteomics for cross-omics transfer tasks. - [totalVI](https://github.com/scverse/scvi-tools) — Probabilistic framework for joint analysis of paired scRNA-seq and protein (CITE-seq) data enabling multi-modal cell state representation across single-cell [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets). - [MultiVI](https://github.com/scverse/scvi-tools) — Multi-modal variational autoencoder for integrating paired and unpaired single-cell RNA-seq and ATAC-seq measurements into a unified latent space. - [MIRA](https://github.com/cistrome/MIRA) — Probabilistic multimodal topic model jointly modeling single-cell transcriptomics and chromatin accessibility for regulatory network inference. - [GLUE](https://github.com/gao-lab/GLUE) — Graph-Linked Unified Embedding framework for unpaired single-cell multi-omics data [integration](/@harrisonqian/awesome/wiki/platforms/integration) across RNA, ATAC, methylation, and protein modalities. - [BABEL](https://github.com/wukevin/babel) — Cross-modality translation model enabling prediction between scRNA-seq and scATAC-seq profiles without requiring paired single-cell measurements. - [Multigrate](https://github.com/theislab/multigrate) — Asymmetric multi-omics variational autoencoder for integrating single-cell data across RNA, ATAC, and protein modalities with missing-modality support. - [MOFA+](https://github.com/bioFAM/MOFA2) — Multi-Omics Factor Analysis framework identifying shared axes of variation across bulk and single-cell [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) including RNA, ATAC, proteomics, methylation, and copy number. - [GeneCompass](https://github.com/xCompass-AI/GeneCompass) — Large-scale foundation model integrating DNA regulatory sequences and single-cell transcriptomics from 120M+ cells across multiple species for gene regulation prediction. - [UnitedNet](https://github.com/LiuLab-Bioelectronics-Harvard/UnitedNet) — Interpretable multi-task deep neural network for single-cell multi-omics [integration](/@harrisonqian/awesome/wiki/platforms/integration) spanning transcriptomics, chromatin accessibility, and proteomics. - [SpatialGlue](https://github.com/zhanglabtools/SpatialGlue) — Graph attention network for spatial multi-omics [integration](/@harrisonqian/awesome/wiki/platforms/integration) jointly embedding spatial transcriptomics with chromatin accessibility or proteomics. - [MIDAS](https://github.com/labomics/midas) — Mosaic [integration](/@harrisonqian/awesome/wiki/platforms/integration) and differential accessibility model for single-cell multi-omics data that handles arbitrary missing-modality combinations across transcriptomics, chromatin accessibility, and proteomics. ##### Domain Alignment - [scArches](https://github.com/theislab/scarches) — Transfer [learning](/@harrisonqian/awesome/wiki/programming-languages/learning) framework for mapping new single-cell [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) onto pre-trained reference atlases across batches, conditions, and modalities. - [TOSICA](https://github.com/JackieHanlaopo/TOSICA) — Transformer-based framework for one-stop interpretable cell-type annotation supporting cross-dataset and cross-species transfer. #### Protein Foundation Models ##### Pre-trained Embedding - [Evolutionary Scale Modeling (ESM)](https://github.com/facebookresearch/esm) — Protein embeddings. - [ChemBERTa-2](https://github.com/seyonechithrananda/bert-loves-chemistry) — Chemical embeddings & prediction. - [ProtTrans](https://github.com/agemagician/ProtTrans) — Suite of protein language models (ProtBERT, ProtT5, ProtXLNet) trained on billions of protein sequences from UniRef and BFD. - [ProGen2](https://github.com/salesforce/progen) — Protein language model trained on diverse protein families for sequence generation and fitness prediction. - [Ankh](https://github.com/agemagician/Ankh) — Efficient protein language model optimized for downstream prediction tasks including secondary structure, localization, and function annotation. ##### Protein Structure Prediction and Design - [AlphaFold3](https://github.com/google-deepmind/alphafold3) — Predicts structures of proteins, nucleic acids, small molecules, and their complexes. - [Boltz-1](https://github.com/jwohlwend/boltz) — Open-source all-[atom](/@harrisonqian/awesome/wiki/editors/atom) biomolecular structure prediction model for proteins, nucleic acids, small molecules, and their complexes achieving AlphaFold3-level accuracy. - [Chai-1](https://github.com/chaidiscovery/chai-lab) — Unified molecular structure prediction model covering proteins, nucleic acids, small molecules, and complexes. - [ESM3](https://github.com/evolutionaryscale/esm) — Multimodal protein language model that jointly reasons over sequence, structure, and function for generative protein design and engineering. - [ESMFold](https://github.com/facebookresearch/esm) — Fast protein structure prediction using language model embeddings. - [RFdiffusion](https://github.com/RosettaCommons/RFdiffusion) — Generative model for protein [backbone](/@harrisonqian/awesome/wiki/front-end-development/backbone) design using diffusion. - [ProteinMPNN](https://github.com/dauparas/ProteinMPNN) — [Deep learning](/@harrisonqian/awesome/wiki/computer-science/deep-learning) model for protein sequence design given [backbone](/@harrisonqian/awesome/wiki/front-end-development/backbone) structure. - [OmegaFold](https://github.com/HeliXonProtein/OmegaFold) — High-resolution de novo protein structure prediction from sequence. - [RoseTTAFold](https://github.com/RosettaCommons/RoseTTAFold) — Three-track neural network for protein structure prediction. - [OpenFold](https://github.com/aqlaboratory/openfold) — Trainable, memory-efficient open-source reproduction of AlphaFold2 enabling custom protein structure prediction workflows. - [SaProt](https://github.com/westlake-reup/SaProt) — Structure-aware protein language model using structure-aware tokens that encode both sequence and [backbone](/@harrisonqian/awesome/wiki/front-end-development/backbone) geometry for improved function prediction. - [EvoDiff](https://github.com/microsoft/evodiff) — Discrete diffusion framework for protein sequence generation trained on evolutionary-scale data, supporting unconditional generation, disordered region design, and functional motif scaffolding. [ [paper-2023](https://www.biorxiv.org/content/10.1101/2023.09.11.556673v1) ] #### Multi-Modal Foundation Models - [CHIEF](https://github.com/hms-dbmi/CHIEF) — Clinical Histopathology Imaging Evaluation Foundation model integrating histology images and clinical context for pan-cancer analysis. - [BiomedCLIP](https://huggingface.co/microsoft/BiomedCLIP-PubMedBERT_256-vit_g_14) — CLIP-based vision-language foundation model for biomedical images and text trained on PubMed figure–caption pairs. #### Genomics Foundation Models - [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer) — Foundation model for genomic sequences across multiple species. - [DNABERT](https://github.com/jerryji1993/DNABERT) — Pre-trained bidirectional encoder for DNA sequence analysis. - [DNABERT-2](https://github.com/Zhihan1996/DNABERT_2) — Improved genome foundation model with efficient tokenization. - [Enformer](https://github.com/deepmind/deepmind-research/tree/master/enformer) — Transformer model predicting gene expression from DNA sequence. - [Basenji](https://github.com/calico/basenji) — Sequential regulatory activity prediction from DNA sequences. - [Caduceus](https://github.com/kuleshov-group/caduceus) — Bidirectional equivariant long-range DNA sequence model based on Mamba. - [Evo](https://github.com/evo-design/evo) — Long-context genomic foundation model (up to 1M tokens). - [HyenaDNA](https://github.com/HazyResearch/hyena-dna) — Long-range genomic foundation model handling sequences up to 1M tokens with sub-quadratic attention. - [Borzoi](https://github.com/calico/borzoi) — Extended successor to Enformer for predicting RNA-seq coverage from long genomic sequence [windows](/@harrisonqian/awesome/wiki/platforms/windows) (524 kb) with improved resolution. - [DeepSEA](http://deepsea.princeton.edu/) — [Deep learning](/@harrisonqian/awesome/wiki/computer-science/deep-learning) framework for predicting chromatin effects of sequence alterations with single-nucleotide sensitivity across thousands of chromatin features. - [Sei](https://github.com/FunctionLab/sei-framework) — Sequence-to-function framework [learning](/@harrisonqian/awesome/wiki/programming-languages/learning) a genome-wide regulatory activity code from DNA sequences for variant effect prediction. - [GPN (Genomic Pre-trained Network)](https://github.com/songlab-cal/gpn) — Masked language model for DNA sequences enabling zero-shot variant effect prediction without requiring functional annotations. ---