Structure Prediction & Docking¶
AlphaFold¶
Predicts protein structures from amino acid sequences using AlphaFold2. Generates high-confidence 3D models with optional relaxation. Supports single-sequence prediction mode for fast predictions without MSA generation.
By default each input sequence is folded as a separate monomer. Wrap proteins in Bundle(...) to fold the bundled sequences together as one multi-chain complex: ColabFold receives a single colon-joined query (SEQ_A:SEQ_B), auto-selects the multimer model, and runs its default paired+unpaired MSA pipeline. Bundle(static, Each(...)) holds static fixed and folds it against each iterated sequence (one complex per element). A bundled complex's output id is the chain ids joined with + (e.g. p1+p2). Pre-computed msas= cannot be combined with a Bundle (a bundled complex always uses ColabFold's own MSA pipeline).
Resources: GPU. H100 NVL is not compatible.
Environment: localcolabfold and biopipelines
Installation:
cd data
wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabbatch_linux.sh
bash install_colabbatch_linux.sh
rm install_colabbatch_linux.sh
Parameters: - proteins: DataStream | StandardizedOutput | Bundle | Each (required) - Input protein sequences. A bare input folds one monomer per sequence; Bundle(...) folds the bundled sequences as one multi-chain complex. - msas: StandardizedOutput = None - Pre-computed MSAs in A3M format. When provided, ColabFold uses these instead of generating new MSAs. Note: Only A3M files originally generated by AlphaFold/ColabFold or MMseqs2 are recognized. MSAs converted from Boltz2 CSV format via MSA(source, convert="a3m") are ignored by ColabFold, which will re-query the MSA server instead. - num_relax: int = 0 - Number of best models to relax with AMBER - num_recycle: int = 3 - Number of recycling iterations - rand_seed: int = 0 - Random seed (0 = random)
Streams: structures, msas
Tables: - structures:
| id | source_id | sequence |
|---|---|---|
confidence:
| id | structure | plddt | max_pae | ptm |
|---|---|---|---|---|
msas(only when MSAs are generated):
| id | sequences.id | sequence | msa_file |
|---|---|---|---|
Example:
from biopipelines.alphafold import AlphaFold
from biopipelines.msa import MSA
# Standard prediction
af = AlphaFold(
proteins=lmpnn,
num_relax=1,
num_recycle=5
)
# With pre-computed MSAs from a previous AlphaFold run
af = AlphaFold(proteins=seq, msas=previous_af_result)
# Fold a single A:B complex (one prediction, id "chainA+chainB")
from biopipelines.combinatorics import Bundle, Each
af = AlphaFold(proteins=Bundle(seqs_a, seqs_b))
# One complex per binder: fixed receptor folded against each binder
af = AlphaFold(proteins=Bundle(receptor, Each(binders)))
Boltz2¶
Predicts biomolecular complexes including proteins, nucleic acids, and small molecules. State-of-the-art model for protein-ligand and protein-protein complex prediction.
Installation:
Environment:Boltz2Env Parameters: - config: Optional[str] = None - Direct YAML configuration string - proteins: Optional[Union[DataStream, StandardizedOutput]] = None - Protein sequences - ssDNA: Optional[Union[DataStream, StandardizedOutput]] = None - Single-stranded DNA sequences - dsDNA: Optional[Union[DataStream, StandardizedOutput]] = None - Double-stranded DNA sequences - ssRNA: Optional[Union[DataStream, StandardizedOutput]] = None - Single-stranded RNA sequences - dsRNA: Optional[Union[DataStream, StandardizedOutput]] = None - Double-stranded RNA sequences - ligands: Optional[Union[DataStream, StandardizedOutput]] = None - Ligand compounds stream (e.g. from Ligand / CompoundLibrary) - msas: Optional[StandardizedOutput] = None - Pre-computed MSA files for recycling (pass entire tool output, not .msas). Supports recycling from AlphaFold via MSA(af_result, convert="csv"). - affinity: bool = True - Calculate binding affinity predictions - output_format: str = "pdb" - Output format (pdb, mmcif) - msa_server: str = "public" - MSA generation (public, local) - recycling_steps: Optional[int] = None - Number of recycling steps (default: model-specific) - diffusion_samples: Optional[int] = None - Number of diffusion samples (default: model-specific) - top_only: bool = True - When True (default), keep only the top model as <id>. When False, surface every diffusion sample as a separate structure <id>_1..N — e.g. to feed a Boltz2 pose ensemble (with native covalent linkage / full scaffold) into downstream design. Pair with diffusion_samples=N to control how many. - use_potentials: bool = False - Enable external potentials - template: Optional[str] = None - Path to PDB template file for structure guidance - template_chain_ids: Optional[List[str]] = None - Chain IDs to apply template to (e.g., ["A", "B"]) - template_force: bool = True - Force template usage - template_threshold: float = 5.0 - RMSD threshold for template matching - pocket_residues: Optional[List[int]] = None - Residue positions defining binding pocket (e.g., [50, 51, 52]) - pocket_max_distance: float = 7.0 - Maximum distance for pocket constraint - pocket_force: bool = True - Force pocket constraint - glycosylation: Optional[Dict[str, List[int]]] = None - N-glycosylation sites per chain (e.g., {"A": [164]}) - covalent_linkage: Optional[Dict[str, Any]] = None - Covalent attachment specification (see examples). The ligand_atom is a Boltz atom name — get it from Boltz2.predict_atom_names(...) (see below). - contacts: Optional[List[Dict[str, Any]]] = None - Contact constraints between atoms/residues. A ligand token is [chain_id, atom_name] (e.g. ["B", "C8"]); get the atom name from Boltz2.predict_atom_names(...) (see below). - disulfide_bonds: Optional[List[Dict[str, Any]]] = None - Disulfide bond constraints - metal_coord: Optional[List[Dict[str, Any]]] = None - Metal-coordination bond constraints
Streams: structures
Tables: - confidence:
| id | sequences.id | compounds.id | input_file | confidence_score | ptm | iptm | complex_plddt | complex_iplddt |
|---|---|---|---|---|---|---|---|---|
affinity:
| id | sequences.id | compounds.id | input_file | affinity_pred_value | affinity_probability_binary |
|---|---|---|---|---|---|
Prefer affinity_probability_binary (binder probability, higher = more likely a binder) for ranking — it is more reliable than the affinity_pred_value regression score in most cases.
Provenance columns (sequences.id, compounds.id) track which protein and ligand produced each row, enabling filtering and joins without parsing the ID string.
Example:
from biopipelines.boltz2 import Boltz2
# Basic apo and holo prediction
boltz_apo = Boltz2(proteins=lmpnn)
boltz_holo = Boltz2(
proteins=lmpnn,
ligands=Ligand(smiles="CC(=O)OC1=CC=CC=C1C(=O)O"), # Aspirin SMILES
msas=boltz_apo, # Pass entire ToolOutput
affinity=True
)
# With template guidance
boltz_template = Boltz2(
proteins=lmpnn,
ligands=compounds,
template="reference.pdb",
template_chain_ids=["A"]
)
# With pocket constraint
boltz_pocket = Boltz2(
proteins=lmpnn,
ligands=compounds,
pocket_residues=[50, 51, 52, 80, 81, 82],
pocket_max_distance=7.0
)
# With N-glycosylation (adds NAG at Asn-164)
boltz_glyco = Boltz2(
proteins=lmpnn,
ligands=compounds,
glycosylation={"A": [164]}
)
# With covalent ligand attachment (e.g., to Cys-50)
boltz_covalent = Boltz2(
proteins=lmpnn,
ligands=covalent_inhibitor,
covalent_linkage={
"chain": "A",
"position": 50,
"protein_atom": "SG", # Cysteine sulfur
"ligand_atom": "C1" # Ligand attachment atom
}
)
Predicting ligand atom names (Boltz2.predict_atom_names)
Constraints that target a ligand atom — contacts, covalent_linkage, metal_coord — need the atom name Boltz assigns to the ligand (e.g. ["B", "C8"]). Boltz names ligand atoms deterministically from chemistry, so you can look them up before running anything. Boltz2.predict_atom_names(ligand) returns a 2-D depiction with every heavy atom labelled by its Boltz name (renders inline in Jupyter and is written to a PNG), plus a .names lookup. It is a configuration-time helper: no GPU, no Pipeline context required, and it works in a plain Python session.
from biopipelines.boltz2 import Boltz2
from biopipelines import Ligand
res = Boltz2.predict_atom_names(Ligand("ampicillin"))
res # labelled image renders inline in a notebook
res.path # PNG written to ./ampicillin_atoms.png (override with path=)
res.names # {"ampicillin": {"S24": ..., "C25": ..., ...}}
It accepts a SMILES string, a Ligand / CompoundLibrary (one labelled panel per compound), or the output one returns inside a pipeline. Two naming regimes, matching Boltz exactly:
- SMILES ligands (raw SMILES,
Ligand(smiles=...), a PubChem name/CID/CAS lookup,CompoundLibrary):element.upper()+ canonical rank index, with hydrogens included in the ranking — so heavy-atom indices are not1..N(ethanol →C8,C9,O7). - CCD ligands (
Ligand("ATP")and any RCSB CCD code): the CCD's own atom names from the RCSB definition (PG,O5',C1', …).
A network call is made only to resolve a name/CID/CAS or CCD lookup; a raw SMILES or Ligand(smiles=...) is fully offline.
DiffDock¶
Blind diffusion-based docking: samples candidate ligand poses over translation, rotation, and torsion, then re-ranks them with a confidence model. Works directly from a protein PDB + a ligand (SMILES or SDF) without a binding-box hint — useful when the pocket is unknown. Output IDs follow the multi-axis pattern <protein>+<ligand>_rank<N>; the rank-1 pose is the structures stream, and the full ranked list is in the table.
References: https://github.com/gcorso/DiffDock · https://arxiv.org/abs/2210.01776
Resources: GPU (CPU fallback exists but is much slower).
Environment: diffdock
Installation: DiffDock.install() clones the repo + the facebookresearch/esm side-clone, builds the env, and installs the matching PyG wheels.
Parameters: - structures: DataStream | StandardizedOutput (required) — Protein backbones (PDB). - compounds: DataStream | StandardizedOutput (required) — Ligands as SMILES or SDF. - samples_per_complex: int = 10 — Poses sampled per (protein, ligand) pair. - inference_steps: int = 20 — Reverse-diffusion steps. - actual_steps: int = None — Steps actually executed (≤ inference_steps; default inference_steps - 1). - no_final_step_noise: bool = True — Disable noise on the last step. - batch_size: int = 32 — DiffDock internal batch size.
Streams: structures (rank-N pose SDF per pair)
Tables: - confidence: | id | structures.id | compounds.id | rank | confidence | - missing: | id | removed_by | cause |
Example:
from biopipelines.diffdock import DiffDock
target = PDB("4ufc", convert="pdb")
lig = Ligand(smiles="CC(=O)Oc1ccccc1C(=O)O", ids="ASA")
dock = DiffDock(structures=target, compounds=lig)
DynamicBind¶
Flexible-backbone docking. DynamicBind is an equivariant generative model that predicts a ligand-specific protein conformation, letting the receptor backbone flex toward its bound state rather than docking into a rigid pocket. Reports per-pose lDDT and a predicted affinity; can optionally render a transition movie.
References: https://github.com/luwei0917/DynamicBind · https://www.nature.com/articles/s41467-024-45461-2
Resources: GPU. Pin gpu="A100" — DynamicBind's torch build is cu11x and crashes ("no kernel image") on newer cards like the H100.
Environment: dynamicbind (inference) + dynamicbind_relax (OpenMM pose relaxation); both created at install.
Installation: DynamicBind.install() clones the repo, creates both envs, and fetches the weights bundle from Zenodo.
Parameters: - structures: DataStream | StandardizedOutput (required) — Protein structures (PDB). - compounds: DataStream | StandardizedOutput (required) — Ligands (SMILES or SDF). - num_samples: int = 40 — Poses sampled per pair. - num_saved: int = None — How many top poses to keep (default: all). - inference_steps: int = 20 — Reverse-diffusion steps. - num_workers: int = 1 — Data-loading workers. - rigid_protein: bool = False — Keep the protein rigid (disables the flexing that distinguishes DynamicBind). - make_movie: bool = False — Render a conformational-transition movie per pose (movies stream). - seed: int = 42 — Random seed.
Streams: structures (predicted complex per pose), movies (only when make_movie=True)
Tables: - affinity: | id | structures.id | compounds.id | rank | lddt | affinity | - missing: | id | removed_by | cause |
Example:
from biopipelines.dynamicbind import DynamicBind
with Pipeline("Examples", "DynamicBind-demo"):
Resources(gpu="A100", memory="32GB", time="6:00:00")
target = PDB("4ufc", convert="pdb")
lig = Ligand(smiles="CC(=O)Oc1ccccc1C(=O)O", ids="ASA")
dock = DynamicBind(structures=target, compounds=lig)
ESMFold¶
Predicts a protein's 3-D structure directly from its amino-acid sequence using a protein language model — no MSA or templates needed, so it is much faster to set up than AlphaFold for single-sequence predictions. One PDB per input sequence, with per-residue pLDDT (in the B-factor column) and a global pTM.
References: https://github.com/facebookresearch/esm
Resources: GPU. Long sequences may need chunk_size / max_tokens_per_batch lowered to fit memory; set memory="32GB".
Environment: esmfold
Installation: ESMFold.install() creates the env (mirrors the One-command-install-ESMfold recipe on the cluster; the ColabFold notebook recipe on Colab).
Parameters: - sequences: DataStream | StandardizedOutput (required) — Input sequences. - chunk_size: int = None — Axial-attention chunk size to reduce GPU memory (try 128/64/32 for long sequences that OOM). None = no chunking. - num_recycles: int = 4 — Recycling iterations. - max_tokens_per_batch: int = 1024 — Max tokens per forward pass; lower to avoid OOM. - cpu_offload: bool = False — Offload weights to CPU to save GPU memory. - cpu_only: bool = False — Run entirely on CPU (very slow).
Streams: structures
Tables: - structures: | id | file | sequences.id | - confidence: | id | file | plddt | ptm |
Example:
from biopipelines.esmfold import ESMFold
seqs = Sequence(["MKTVRQERLKSIVRILERSKEPVSGAQ"], ids=["p1"])
esm = ESMFold(sequences=seqs)
Gnina¶
Molecular docking with CNN-based pose scoring. Combines AutoDock Vina search with a convolutional neural network for more accurate binding pose prediction. Supports multi-run docking with statistical analysis across independent runs, optional conformer generation, and pose consistency analysis.
The binding box is determined in order: explicit center+size > autobox_ligand > crystal ligand HETATM records in the input PDB.
Resources: GPU recommended (CPU fallback available but slow).
Environment: biopipelines (plus CUDA modules configured in config.yaml under gnina:)
Installation: Downloads pre-built gnina.1.3.2 binary from GitHub.
Parameters: - structures: Union[DataStream, StandardizedOutput] (required) - Protein structures - compounds: Union[DataStream, StandardizedOutput] (required) - Ligands (from Ligand(), CompoundLibrary(), etc.) - autobox_ligand: Union[DataStream, StandardizedOutput, str, None] = None - Reference ligand for automatic box - center: Optional[str] = None - Explicit box center as "x,y,z" - size: Union[float, str, None] = None - Box dimensions in Angstroms (single float = cubic, "x,y,z" = asymmetric) - autobox_add: float = 4.0 - Padding around autobox ligand (Angstroms) - exhaustiveness: int = 8 - Search exhaustiveness - num_modes: int = 9 - Docked poses per run - num_runs: int = 1 - Independent docking runs per conformer - seed: int = 42 - Base random seed (each run uses seed + run_index) - cnn_scoring: str = "rescore" - CNN scoring mode ("rescore", "refinement", "none", "rescore_only") - generate_conformers: bool = False - Generate RDKit ETKDGv3 conformers per ligand - num_conformers: int = 50 - Conformers to generate before filtering - energy_window: float = 2.0 - Max relative MMFF94 energy (kcal/mol) for conformer filtering - conformer_rmsd: float = 1.0 - Heavy-atom RMSD cutoff (Angstroms) for Butina clustering - conformer_energies: Optional[tuple] = None - Pre-computed energies as (TableInfo, "column_name") - cnn_score_threshold: float = 0.5 - Min CNNscore to accept a pose (0-1) - rmsd_threshold: float = 2.0 - RMSD cutoff (Angstroms) for pose consistency clustering - protonate: bool = True - Add hydrogens with OpenBabel before docking - pH: float = 7.4 - Protonation pH
Streams: structures (combined protein+ligand PDB files of best poses)
Tables: - docking_results - One row per accepted pose across all runs:
| id | structures.id | compounds.id | conformer_id | run | pose | vina_score | cnn_score | cnn_affinity |
|---|---|---|---|---|---|---|---|---|
IDs are unique per pose: {protein}_{ligand}_r{run}_p{pose}.
docking_summary- One row per (protein, ligand, conformer) group, aggregated across runs:
| id | structures.id | compounds.id | conformer_id | best_vina | mean_vina | std_vina | best_cnn_score | mean_cnn_affinity | std_cnn_affinity | pose_consistency | conformer_energy | pseudo_binding_energy | best_pose_file |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean_vina/std_vina: mean and std of the best Vina score per run (not all poses — just the top pose from each independent run). Robust metric for ranking.mean_cnn_affinity/std_cnn_affinity: same logic for CNN affinity.pose_consistency: fraction of runs whose best pose clusters together by RMSD. Measures reproducibility (1.0 = all runs converge).conformer_energy: MMFF94 relative strain energy (only withgenerate_conformers=True).pseudo_binding_energy:best_vina + conformer_energy. Penalizes Vina score by conformer strain. Only populated when conformer generation is enabled.
Example:
from biopipelines.gnina import Gnina
# Basic docking (autobox from crystal ligand in PDB)
docked = Gnina(structures=boltz, compounds=ligand)
# Explicit box
docked = Gnina(
structures=boltz,
compounds=ligand,
center="10.5,20.3,15.0",
size=25.0
)
# With conformer generation for flexible ligand docking
docked = Gnina(
structures=boltz,
compounds=compound_library,
generate_conformers=True,
num_conformers=100,
energy_window=3.0,
num_runs=10,
exhaustiveness=64
)
NeuralPLexer¶
Predicts protein–ligand complex structures from a protein sequence/structure plus a ligand (SMILES/SDF), using a physics-inspired flow-based generative model — no binding-box hint required. Output IDs follow <protein>+<ligand>_rank<N>.
References: https://github.com/zrqiao/NeuralPLexer · https://www.nature.com/articles/s42256-024-00792-z
Resources: GPU. On Colab it needs a high-RAM runtime — the openfold attention + complex sampling OOM the free T4's 12 GB system RAM (verified end-to-end on an A100 high-RAM runtime).
Environment: neuralplexer
Installation: NeuralPLexer.install() clones the repo, creates the env, builds the openfold CUDA extension, and fetches the (~8.7 GB) weights bundle from Zenodo.
Parameters: - structures: DataStream | StandardizedOutput (required) — Protein structures. - compounds: DataStream | StandardizedOutput (required) — Ligands (SMILES or SDF). - n_samples: int = 16 — Complex samples per pair. - num_steps: int = 40 — Sampling steps. - chunk_size: int = 4 — Attention chunk size (lower to reduce memory). - sampler: str = "langevin_simulated_annealing" — Sampling scheme. - cuda: bool = True — Run on GPU.
Streams: structures (predicted complex PDB per sample)
Tables: - confidence: | id | structures.id | compounds.id | rank | confidence | - missing: | id | removed_by | cause |
Example:
from biopipelines.neuralplexer import NeuralPLexer
target = PDB("4ufc", convert="pdb")
lig = Ligand(smiles="CC(=O)Oc1ccccc1C(=O)O", ids="ASA")
cplx = NeuralPLexer(structures=target, compounds=lig, n_samples=16)
PLACER¶
Atomic-level graph neural network that stochastically regenerates coordinates from a partially corrupted input structure, producing a scored conformational ensemble. The wrapper exposes PLACER's two task modes, selected by the inputs given (no explicit mode flag):
- Ligand-pose mode (
ligandprovided) — the input structure has the ligand bound as HETATM; PLACER resamples that ligand's pose (and pocket sidechains), keeping any other ligands fixed. The ligandcodeis resolved from the compounds stream at runtime and passed to PLACER'spredict_ligandselector byname3(every copy of that residue is predicted viapredict_multi). Returns the full score set including the ligand-accuracy terms. - Sidechain/apo mode (
ligandomitted,target_resprovided) — a protein residue is the crop center and PLACER resamples the surrounding sidechains; no ligand is predicted. Returns only the lDDT-style confidences (noprmsd/rmsd/kabsch).
PLACER reads PDB and RCSB mmCIF directly — no SDF staging. Each input produces nsamples models; the output IDs multiply by a sample index.
Prefer mmCIF for multi-ligand / multi-chain inputs. PLACER's PDB parser asserts that ligand chains and protein chains don't share a chain letter, and raises "One or more of ligand chains already exist in parsed protein chains" on structures where they collide (e.g. RCSB
4dtzas a.pdb). Its mmCIF parser doesn't have this limitation, so for crystal complexes feed an mmCIF (PDB("/path/4dtz.cif")or any RCSB CIF) rather than the converted PDB. The driver dispatches on file extension automatically (.cif/.cif.gz→ CIF parser, else PDB parser). Note RCSB-sourced mmCIF only — Rosetta/AF3 CIFs are not parsed correctly upstream.
Resources: GPU (CUDA 12.1 / torch 2.3 stack).
Environment: placer (dedicated conda env; model weights ship in the cloned repo).
Installation: Clones baker-laboratory/PLACER and creates the placer env. Weights (weights/PLACER_model_1.pt) are bundled — no separate download.
Parameters: - structures: Union[DataStream, StandardizedOutput] (required) — protein structures (PDB/mmCIF), ligand bound as HETATM for ligand-pose mode. - ligand: Union[DataStream, StandardizedOutput, None] = None — compounds stream with the bound ligand's 3-letter code. Presence selects ligand-pose mode. - target_res: Optional[str] = None — crop-center residue "chain-resno" (e.g. "A-149"). Required in sidechain/apo mode; rejected in ligand-pose mode. - exclude_sm: bool = False — drop all small molecules from the prediction (true apo). - nsamples: int = 10 — samples per input (upstream recommends 50–100). - rerank: Optional[str] = None — rank models/CSV by "prmsd", "plddt", or "plddt_pde" (use plddt/plddt_pde in apo mode; prmsd is ligand-only). - bonds: Optional[tuple | List[tuple]] = None — covalent bonds to enforce during resampling (ligand-pose mode only). Each is (atom1, atom2, length) with atoms in <residue>.<atom> syntax, e.g. ("A145.SG", "LIG.C12", 1.8) to tether a ligand to a catalytic cysteine (true SNAP-Tag/AGT chemistry). The residue name (name3) is resolved per-structure from the input. Pair with PDB.break_bond to detach the ligand afterward for non-covalent downstream design.
Streams: - structures — predicted PDB per sample (prmsd in the B-factor column). - compounds (ligand mode only) — chemistry passthrough of the input ligand (code/SMILES unchanged).
Tables: - scores — one row per sample. - ligand mode: id | structures.id | compounds.id | sample | prmsd | plddt | plddt_pde | fape | rmsd | kabsch - apo mode: id | structures.id | sample | plddt | plddt_pde | fape - missing — id | removed_by | cause for inputs PLACER could not process.
Output IDs: <structure>+<ligand>_<sample> (ligand mode) or <structure>_<sample> (apo mode).
Example:
from biopipelines.placer import PLACER
# Ligand-pose mode: refine a bound ligand's pose
holo = PDB("4dtz") # ligand bound as HETATM
lig = Ligand("LDP") # residue code present in 4dtz
poses = PLACER(structures=holo, ligand=lig, nsamples=50, rerank="prmsd")
# Sidechain/apo mode: repack around a residue, no ligand
apo = PDB("dnHEM1_apo")
repacked = PLACER(structures=apo, target_res="A-149", exclude_sm=True,
nsamples=50, rerank="plddt")