Structure Prediction & Docking¶

AlphaFold¶

Predicts protein structures from amino acid sequences using AlphaFold2. Generates high-confidence 3D models with optional relaxation. Supports single-sequence prediction mode for fast predictions without MSA generation.

By default each input sequence is folded as a separate monomer. Wrap proteins in Bundle(...) to fold the bundled sequences together as one multi-chain complex: ColabFold receives a single colon-joined query (SEQ_A:SEQ_B), auto-selects the multimer model, and runs its default paired+unpaired MSA pipeline. Bundle(static, Each(...)) holds static fixed and folds it against each iterated sequence (one complex per element). A bundled complex's output id is the chain ids joined with + (e.g. p1+p2). Pre-computed msas= cannot be combined with a Bundle (a bundled complex always uses ColabFold's own MSA pipeline).

Resources: GPU. H100 NVL is not compatible.

Environment: localcolabfold and biopipelines

Installation:

cd data
wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabbatch_linux.sh
bash install_colabbatch_linux.sh
rm install_colabbatch_linux.sh

Parameters: - proteins: DataStream | StandardizedOutput | Bundle | Each (required) - Input protein sequences. A bare input folds one monomer per sequence; Bundle(...) folds the bundled sequences as one multi-chain complex. - msas: StandardizedOutput = None - Pre-computed MSAs in A3M format. When provided, ColabFold uses these instead of generating new MSAs. Note: Only A3M files originally generated by AlphaFold/ColabFold or MMseqs2 are recognized. MSAs converted from Boltz2 CSV format via MSA(source, convert="a3m") are ignored by ColabFold, which will re-query the MSA server instead. - num_relax: int = 0 - Number of best models to relax with AMBER - num_recycle: int = 3 - Number of recycling iterations - rand_seed: int = 0 - Random seed (0 = random)

Streams: structures, msas

Tables: - structures:

id	source_id	sequence

confidence:

id	structure	plddt	max_pae	ptm

msas (only when MSAs are generated):

id	sequences.id	sequence	msa_file

Example:

from biopipelines.alphafold import AlphaFold
from biopipelines.msa import MSA

# Standard prediction
af = AlphaFold(
    proteins=lmpnn,
    num_relax=1,
    num_recycle=5
)

# With pre-computed MSAs from a previous AlphaFold run
af = AlphaFold(proteins=seq, msas=previous_af_result)

# Fold a single A:B complex (one prediction, id "chainA+chainB")
from biopipelines.combinatorics import Bundle, Each
af = AlphaFold(proteins=Bundle(seqs_a, seqs_b))

# One complex per binder: fixed receptor folded against each binder
af = AlphaFold(proteins=Bundle(receptor, Each(binders)))

Boltz2¶

Predicts biomolecular complexes including proteins, nucleic acids, and small molecules. State-of-the-art model for protein-ligand and protein-protein complex prediction.

Installation:

mamba create -n Boltz2Env python=3.11
mamba activate Boltz2Env
pip install boltz[cuda] -U

Environment: Boltz2Env

Parameters: - config: Optional[str] = None - Direct YAML configuration string - proteins: Optional[Union[DataStream, StandardizedOutput]] = None - Protein sequences - ssDNA: Optional[Union[DataStream, StandardizedOutput]] = None - Single-stranded DNA sequences - dsDNA: Optional[Union[DataStream, StandardizedOutput]] = None - Double-stranded DNA sequences - ssRNA: Optional[Union[DataStream, StandardizedOutput]] = None - Single-stranded RNA sequences - dsRNA: Optional[Union[DataStream, StandardizedOutput]] = None - Double-stranded RNA sequences - ligands: Optional[Union[DataStream, StandardizedOutput]] = None - Ligand compounds stream (e.g. from Ligand / CompoundLibrary) - msas: Optional[StandardizedOutput] = None - Pre-computed MSA files for recycling (pass entire tool output, not .msas). Supports recycling from AlphaFold via MSA(af_result, convert="csv"). - affinity: bool = True - Calculate binding affinity predictions - output_format: str = "pdb" - Output format (pdb, mmcif) - msa_server: str = "public" - MSA generation (public, local) - recycling_steps: Optional[int] = None - Number of recycling steps (default: model-specific) - diffusion_samples: Optional[int] = None - Number of diffusion samples (default: model-specific) - top_only: bool = True - When True (default), keep only the top model as <id>. When False, surface every diffusion sample as a separate structure <id>_1..N — e.g. to feed a Boltz2 pose ensemble (with native covalent linkage / full scaffold) into downstream design. Pair with diffusion_samples=N to control how many. - use_potentials: bool = False - Enable external potentials - template: Optional[str] = None - Path to PDB template file for structure guidance - template_chain_ids: Optional[List[str]] = None - Chain IDs to apply template to (e.g., ["A", "B"]) - template_force: bool = True - Force template usage - template_threshold: float = 5.0 - RMSD threshold for template matching - pocket_residues: Optional[List[int]] = None - Residue positions defining binding pocket (e.g., [50, 51, 52]) - pocket_max_distance: float = 7.0 - Maximum distance for pocket constraint - pocket_force: bool = True - Force pocket constraint - glycosylation: Optional[Dict[str, List[int]]] = None - N-glycosylation sites per chain (e.g., {"A": [164]}) - covalent_linkage: Optional[Dict[str, Any]] = None - Covalent attachment specification (see examples). The ligand_atom is a Boltz atom name — get it from Boltz2.predict_atom_names(...) (see below). - contacts: Optional[List[Dict[str, Any]]] = None - Contact constraints between atoms/residues. A ligand token is [chain_id, atom_name] (e.g. ["B", "C8"]); get the atom name from Boltz2.predict_atom_names(...) (see below). - disulfide_bonds: Optional[List[Dict[str, Any]]] = None - Disulfide bond constraints - metal_coord: Optional[List[Dict[str, Any]]] = None - Metal-coordination bond constraints

Streams: structures

Tables: - confidence:

id	sequences.id	compounds.id	input_file	confidence_score	ptm	iptm	complex_plddt	complex_iplddt

affinity:

id	sequences.id	compounds.id	input_file	affinity_pred_value	affinity_probability_binary

Prefer affinity_probability_binary (binder probability, higher = more likely a binder) for ranking — it is more reliable than the affinity_pred_value regression score in most cases.

Provenance columns (sequences.id, compounds.id) track which protein and ligand produced each row, enabling filtering and joins without parsing the ID string.

Example:

from biopipelines.boltz2 import Boltz2

# Basic apo and holo prediction
boltz_apo = Boltz2(proteins=lmpnn)
boltz_holo = Boltz2(
    proteins=lmpnn,
    ligands=Ligand(smiles="CC(=O)OC1=CC=CC=C1C(=O)O"),  # Aspirin SMILES
    msas=boltz_apo,  # Pass entire ToolOutput
    affinity=True
)

# With template guidance
boltz_template = Boltz2(
    proteins=lmpnn,
    ligands=compounds,
    template="reference.pdb",
    template_chain_ids=["A"]
)

# With pocket constraint
boltz_pocket = Boltz2(
    proteins=lmpnn,
    ligands=compounds,
    pocket_residues=[50, 51, 52, 80, 81, 82],
    pocket_max_distance=7.0
)

# With N-glycosylation (adds NAG at Asn-164)
boltz_glyco = Boltz2(
    proteins=lmpnn,
    ligands=compounds,
    glycosylation={"A": [164]}
)

# With covalent ligand attachment (e.g., to Cys-50)
boltz_covalent = Boltz2(
    proteins=lmpnn,
    ligands=covalent_inhibitor,
    covalent_linkage={
        "chain": "A",
        "position": 50,
        "protein_atom": "SG",  # Cysteine sulfur
        "ligand_atom": "C1"    # Ligand attachment atom
    }
)

Predicting ligand atom names (Boltz2.predict_atom_names)

Constraints that target a ligand atom — contacts, covalent_linkage, metal_coord — need the atom name Boltz assigns to the ligand (e.g. ["B", "C8"]). Boltz names ligand atoms deterministically from chemistry, so you can look them up before running anything. Boltz2.predict_atom_names(ligand) returns a 2-D depiction with every heavy atom labelled by its Boltz name (renders inline in Jupyter and is written to a PNG), plus a .names lookup. It is a configuration-time helper: no GPU, no Pipeline context required, and it works in a plain Python session.

from biopipelines.boltz2 import Boltz2
from biopipelines import Ligand

res = Boltz2.predict_atom_names(Ligand("ampicillin"))
res          # labelled image renders inline in a notebook
res.path     # PNG written to ./ampicillin_atoms.png (override with path=)
res.names    # {"ampicillin": {"S24": ..., "C25": ..., ...}}

It accepts a SMILES string, a Ligand / CompoundLibrary (one labelled panel per compound), or the output one returns inside a pipeline. Two naming regimes, matching Boltz exactly:

SMILES ligands (raw SMILES, Ligand(smiles=...), a PubChem name/CID/CAS lookup, CompoundLibrary): element.upper() + canonical rank index, with hydrogens included in the ranking — so heavy-atom indices are not 1..N (ethanol → C8, C9, O7).
CCD ligands (Ligand("ATP") and any RCSB CCD code): the CCD's own atom names from the RCSB definition (PG, O5', C1', …).

A network call is made only to resolve a name/CID/CAS or CCD lookup; a raw SMILES or Ligand(smiles=...) is fully offline.

DiffDock¶

Blind diffusion-based docking: samples candidate ligand poses over translation, rotation, and torsion, then re-ranks them with a confidence model. Works directly from a protein PDB + a ligand (SMILES or SDF) without a binding-box hint — useful when the pocket is unknown. Output IDs follow the multi-axis pattern <protein>+<ligand>_rank<N>; the rank-1 pose is the structures stream, and the full ranked list is in the table.

References: https://github.com/gcorso/DiffDock · https://arxiv.org/abs/2210.01776

Resources: GPU (CPU fallback exists but is much slower).

Environment: diffdock

Installation: DiffDock.install() clones the repo + the facebookresearch/esm side-clone, builds the env, and installs the matching PyG wheels.

Parameters: - structures: DataStream | StandardizedOutput (required) — Protein backbones (PDB). - compounds: DataStream | StandardizedOutput (required) — Ligands as SMILES or SDF. - samples_per_complex: int = 10 — Poses sampled per (protein, ligand) pair. - inference_steps: int = 20 — Reverse-diffusion steps. - actual_steps: int = None — Steps actually executed (≤ inference_steps; default inference_steps - 1). - no_final_step_noise: bool = True — Disable noise on the last step. - batch_size: int = 32 — DiffDock internal batch size.

Streams: structures (rank-N pose SDF per pair)

Example:

from biopipelines.diffdock import DiffDock

target = PDB("4ufc", convert="pdb")
lig = Ligand(smiles="CC(=O)Oc1ccccc1C(=O)O", ids="ASA")
dock = DiffDock(structures=target, compounds=lig)

DynamicBind¶

Flexible-backbone docking. DynamicBind is an equivariant generative model that predicts a ligand-specific protein conformation, letting the receptor backbone flex toward its bound state rather than docking into a rigid pocket. Reports per-pose lDDT and a predicted affinity; can optionally render a transition movie.

References: https://github.com/luwei0917/DynamicBind · https://www.nature.com/articles/s41467-024-45461-2

Resources: GPU. Pin gpu="A100" — DynamicBind's torch build is cu11x and crashes ("no kernel image") on newer cards like the H100.

Environment: dynamicbind (inference) + dynamicbind_relax (OpenMM pose relaxation); both created at install.

Installation: DynamicBind.install() clones the repo, creates both envs, and fetches the weights bundle from Zenodo.

Parameters: - structures: DataStream | StandardizedOutput (required) — Protein structures (PDB). - compounds: DataStream | StandardizedOutput (required) — Ligands (SMILES or SDF). - num_samples: int = 40 — Poses sampled per pair. - num_saved: int = None — How many top poses to keep (default: all). - inference_steps: int = 20 — Reverse-diffusion steps. - num_workers: int = 1 — Data-loading workers. - rigid_protein: bool = False — Keep the protein rigid (disables the flexing that distinguishes DynamicBind). - make_movie: bool = False — Render a conformational-transition movie per pose (movies stream). - seed: int = 42 — Random seed.

Streams: structures (predicted complex per pose), movies (only when make_movie=True)

Example:

from biopipelines.dynamicbind import DynamicBind

with Pipeline("Examples", "DynamicBind-demo"):
    Resources(gpu="A100", memory="32GB", time="6:00:00")
    target = PDB("4ufc", convert="pdb")
    lig = Ligand(smiles="CC(=O)Oc1ccccc1C(=O)O", ids="ASA")
    dock = DynamicBind(structures=target, compounds=lig)

ESMFold¶

Predicts a protein's 3-D structure directly from its amino-acid sequence using a protein language model — no MSA or templates needed, so it is much faster to set up than AlphaFold for single-sequence predictions. One PDB per input sequence, with per-residue pLDDT (in the B-factor column) and a global pTM.

References: https://github.com/facebookresearch/esm

Resources: GPU. Long sequences may need chunk_size / max_tokens_per_batch lowered to fit memory; set memory="32GB".

Environment: esmfold

Installation: ESMFold.install() creates the env (mirrors the One-command-install-ESMfold recipe on the cluster; the ColabFold notebook recipe on Colab).

Parameters: - sequences: DataStream | StandardizedOutput (required) — Input sequences. - chunk_size: int = None — Axial-attention chunk size to reduce GPU memory (try 128/64/32 for long sequences that OOM). None = no chunking. - num_recycles: int = 4 — Recycling iterations. - max_tokens_per_batch: int = 1024 — Max tokens per forward pass; lower to avoid OOM. - cpu_offload: bool = False — Offload weights to CPU to save GPU memory. - cpu_only: bool = False — Run entirely on CPU (very slow).

Streams: structures

Example:

from biopipelines.esmfold import ESMFold

seqs = Sequence(["MKTVRQERLKSIVRILERSKEPVSGAQ"], ids=["p1"])
esm = ESMFold(sequences=seqs)

Gnina¶

Molecular docking with CNN-based pose scoring. Combines AutoDock Vina search with a convolutional neural network for more accurate binding pose prediction. Supports multi-run docking with statistical analysis across independent runs, optional conformer generation, and pose consistency analysis.

The binding box is determined in order: explicit center+size > autobox_ligand > crystal ligand HETATM records in the input PDB.

Resources: GPU recommended (CPU fallback available but slow).

Environment: biopipelines (plus CUDA modules configured in config.yaml under gnina:)

Installation: Downloads pre-built gnina.1.3.2 binary from GitHub.

Parameters: - structures: Union[DataStream, StandardizedOutput] (required) - Protein structures - compounds: Union[DataStream, StandardizedOutput] (required) - Ligands (from Ligand(), CompoundLibrary(), etc.) - autobox_ligand: Union[DataStream, StandardizedOutput, str, None] = None - Reference ligand for automatic box - center: Optional[str] = None - Explicit box center as "x,y,z" - size: Union[float, str, None] = None - Box dimensions in Angstroms (single float = cubic, "x,y,z" = asymmetric) - autobox_add: float = 4.0 - Padding around autobox ligand (Angstroms) - exhaustiveness: int = 8 - Search exhaustiveness - num_modes: int = 9 - Docked poses per run - num_runs: int = 1 - Independent docking runs per conformer - seed: int = 42 - Base random seed (each run uses seed + run_index) - cnn_scoring: str = "rescore" - CNN scoring mode ("rescore", "refinement", "none", "rescore_only") - generate_conformers: bool = False - Generate RDKit ETKDGv3 conformers per ligand - num_conformers: int = 50 - Conformers to generate before filtering - energy_window: float = 2.0 - Max relative MMFF94 energy (kcal/mol) for conformer filtering - conformer_rmsd: float = 1.0 - Heavy-atom RMSD cutoff (Angstroms) for Butina clustering - conformer_energies: Optional[tuple] = None - Pre-computed energies as (TableInfo, "column_name") - cnn_score_threshold: float = 0.5 - Min CNNscore to accept a pose (0-1) - rmsd_threshold: float = 2.0 - RMSD cutoff (Angstroms) for pose consistency clustering - protonate: bool = True - Add hydrogens with OpenBabel before docking - pH: float = 7.4 - Protonation pH

Streams: structures (combined protein+ligand PDB files of best poses)

Tables: - docking_results - One row per accepted pose across all runs:

id	structures.id	compounds.id	conformer_id	run	pose	vina_score	cnn_score	cnn_affinity

IDs are unique per pose: {protein}_{ligand}_r{run}_p{pose}.

docking_summary - One row per (protein, ligand, conformer) group, aggregated across runs:

id	structures.id	compounds.id	conformer_id	best_vina	mean_vina	std_vina	best_cnn_score	mean_cnn_affinity	std_cnn_affinity	pose_consistency	conformer_energy	pseudo_binding_energy	best_pose_file

mean_vina / std_vina: mean and std of the best Vina score per run (not all poses — just the top pose from each independent run). Robust metric for ranking.
mean_cnn_affinity / std_cnn_affinity: same logic for CNN affinity.
pose_consistency: fraction of runs whose best pose clusters together by RMSD. Measures reproducibility (1.0 = all runs converge).
conformer_energy: MMFF94 relative strain energy (only with generate_conformers=True).
pseudo_binding_energy: best_vina + conformer_energy. Penalizes Vina score by conformer strain. Only populated when conformer generation is enabled.

Example:

from biopipelines.gnina import Gnina

# Basic docking (autobox from crystal ligand in PDB)
docked = Gnina(structures=boltz, compounds=ligand)

# Explicit box
docked = Gnina(
    structures=boltz,
    compounds=ligand,
    center="10.5,20.3,15.0",
    size=25.0
)

# With conformer generation for flexible ligand docking
docked = Gnina(
    structures=boltz,
    compounds=compound_library,
    generate_conformers=True,
    num_conformers=100,
    energy_window=3.0,
    num_runs=10,
    exhaustiveness=64
)

NeuralPLexer¶

Predicts protein–ligand complex structures from a protein sequence/structure plus a ligand (SMILES/SDF), using a physics-inspired flow-based generative model — no binding-box hint required. Output IDs follow <protein>+<ligand>_rank<N>.

References: https://github.com/zrqiao/NeuralPLexer · https://www.nature.com/articles/s42256-024-00792-z

Resources: GPU. On Colab it needs a high-RAM runtime — the openfold attention + complex sampling OOM the free T4's 12 GB system RAM (verified end-to-end on an A100 high-RAM runtime).

Environment: neuralplexer

Installation: NeuralPLexer.install() clones the repo, creates the env, builds the openfold CUDA extension, and fetches the (~8.7 GB) weights bundle from Zenodo.

Parameters: - structures: DataStream | StandardizedOutput (required) — Protein structures. - compounds: DataStream | StandardizedOutput (required) — Ligands (SMILES or SDF). - n_samples: int = 16 — Complex samples per pair. - num_steps: int = 40 — Sampling steps. - chunk_size: int = 4 — Attention chunk size (lower to reduce memory). - sampler: str = "langevin_simulated_annealing" — Sampling scheme. - cuda: bool = True — Run on GPU.

Streams: structures (predicted complex PDB per sample)

Example:

from biopipelines.neuralplexer import NeuralPLexer

target = PDB("4ufc", convert="pdb")
lig = Ligand(smiles="CC(=O)Oc1ccccc1C(=O)O", ids="ASA")
cplx = NeuralPLexer(structures=target, compounds=lig, n_samples=16)

PLACER¶

Atomic-level graph neural network that stochastically regenerates coordinates from a partially corrupted input structure, producing a scored conformational ensemble. The wrapper exposes PLACER's two task modes, selected by the inputs given (no explicit mode flag):

Ligand-pose mode (ligand provided) — the input structure has the ligand bound as HETATM; PLACER resamples that ligand's pose (and pocket sidechains), keeping any other ligands fixed. The ligand code is resolved from the compounds stream at runtime and passed to PLACER's predict_ligand selector by name3 (every copy of that residue is predicted via predict_multi). Returns the full score set including the ligand-accuracy terms.
Sidechain/apo mode (ligand omitted, target_res provided) — a protein residue is the crop center and PLACER resamples the surrounding sidechains; no ligand is predicted. Returns only the lDDT-style confidences (no prmsd/rmsd/kabsch).

PLACER reads PDB and RCSB mmCIF directly — no SDF staging. Each input produces nsamples models; the output IDs multiply by a sample index.

Prefer mmCIF for multi-ligand / multi-chain inputs. PLACER's PDB parser asserts that ligand chains and protein chains don't share a chain letter, and raises "One or more of ligand chains already exist in parsed protein chains" on structures where they collide (e.g. RCSB 4dtz as a .pdb). Its mmCIF parser doesn't have this limitation, so for crystal complexes feed an mmCIF (PDB("/path/4dtz.cif") or any RCSB CIF) rather than the converted PDB. The driver dispatches on file extension automatically (.cif/.cif.gz → CIF parser, else PDB parser). Note RCSB-sourced mmCIF only — Rosetta/AF3 CIFs are not parsed correctly upstream.

Resources: GPU (CUDA 12.1 / torch 2.3 stack).

Environment: placer (dedicated conda env; model weights ship in the cloned repo).

Installation: Clones baker-laboratory/PLACER and creates the placer env. Weights (weights/PLACER_model_1.pt) are bundled — no separate download.

Parameters: - structures: Union[DataStream, StandardizedOutput] (required) — protein structures (PDB/mmCIF), ligand bound as HETATM for ligand-pose mode. - ligand: Union[DataStream, StandardizedOutput, None] = None — compounds stream with the bound ligand's 3-letter code. Presence selects ligand-pose mode. - target_res: Optional[str] = None — crop-center residue "chain-resno" (e.g. "A-149"). Required in sidechain/apo mode; rejected in ligand-pose mode. - exclude_sm: bool = False — drop all small molecules from the prediction (true apo). - nsamples: int = 10 — samples per input (upstream recommends 50–100). - rerank: Optional[str] = None — rank models/CSV by "prmsd", "plddt", or "plddt_pde" (use plddt/plddt_pde in apo mode; prmsd is ligand-only). - bonds: Optional[tuple | List[tuple]] = None — covalent bonds to enforce during resampling (ligand-pose mode only). Each is (atom1, atom2, length) with atoms in <residue>.<atom> syntax, e.g. ("A145.SG", "LIG.C12", 1.8) to tether a ligand to a catalytic cysteine (true SNAP-Tag/AGT chemistry). The residue name (name3) is resolved per-structure from the input. Pair with PDB.break_bond to detach the ligand afterward for non-covalent downstream design.

Streams: - structures — predicted PDB per sample (prmsd in the B-factor column). - compounds (ligand mode only) — chemistry passthrough of the input ligand (code/SMILES unchanged).

Output IDs: <structure>+<ligand>_<sample> (ligand mode) or <structure>_<sample> (apo mode).

Example:

from biopipelines.placer import PLACER

# Ligand-pose mode: refine a bound ligand's pose
holo = PDB("4dtz")              # ligand bound as HETATM
lig = Ligand("LDP")             # residue code present in 4dtz
poses = PLACER(structures=holo, ligand=lig, nsamples=50, rerank="prmsd")

# Sidechain/apo mode: repack around a residue, no ligand
apo = PDB("dnHEM1_apo")
repacked = PLACER(structures=apo, target_res="A-149", exclude_sm=True,
                  nsamples=50, rerank="plddt")