Sequence Design¶
DNAEncoder¶
Reverse-translates protein sequences to DNA with organism-specific codon optimization. Uses thresholded weighted codon sampling based on CoCoPUTs genome frequency tables.
Environment: biopipelines
Parameters: - sequences: Union[ToolOutput, StandardizedOutput] (required) - Input protein sequences - organism: str = "EC" - Target organism for codon optimization: - "EC" (Escherichia coli) - "SC" (Saccharomyces cerevisiae) - "HS" (Homo sapiens) - Combinations: "EC&HS", "EC&SC", "HS&SC", "EC&HS&SC"
Tables: - dna:
| id | protein_sequence | dna_sequence | organism | method |
|---|---|---|---|---|
- Excel file with color-coded codons (red <5‰, orange 5-10‰, black ≥10‰)
Example:
from biopipelines.dna_encoder import DNAEncoder
dna = DNAEncoder(
sequences=lmpnn,
organism="EC&HS" # Conservative optimization for both E. coli and human
)
Note: Uses thresholded weighted sampling (codons ≥10‰, fallback to ≥5‰). For multi-organism optimization, uses minimum frequency across organisms. Please cite CoCoPUTs (HIVE) when using.
Frame2Seq¶
Fast structure-conditioned inverse folding. Frame2Seq is a non-autoregressive masked-language model that generates multiple sequences per backbone in a single forward pass — same role as ProteinMPNN (structure → sequence), but materially faster, with slightly higher native-sequence recovery on CATH 4.2. Output IDs follow the ProteinMPNN multiplier convention <structure_id>_<n>.
References: https://github.com/dakpinaroglu/Frame2seq · https://arxiv.org/abs/2312.02447
Environment: frame2seq
Installation: Frame2Seq.install() creates the env and pip-installs the package (weights ship with it). Runs on CPU; a GPU speeds up large/many inputs.
Parameters: - structures: DataStream | StandardizedOutput (required) — Input backbones. - num_sequences: int = 1 — Sequences to sample per structure. - temperature: float = 1.0 — Sampling temperature (>0). - chain: str = "A" — Chain to redesign. - omit_aa: str = "" — Single-letter codes to exclude from sampling (e.g. "CM" to omit Cys and Met). - fixed: str | (TableInfo, column) = "" — Residues to keep at their input identity. Chain-aware selection ("A10-20+A30") or a table column reference. Mutually exclusive with redesigned. - redesigned: str | (TableInfo, column) = "" — Residues to redesign; everything else on chain is held fixed.
Streams: sequences, fasta
Tables: - sequences:
| id | sequence | score | recovery | structures.id |
|---|---|---|---|---|
missing: | id | removed_by | cause |
Example:
from biopipelines.frame2seq import Frame2Seq
seqs = Frame2Seq(structures=rfd, num_sequences=10, temperature=0.5)
Fuse¶
Concatenates multiple sequences with flexible linkers. Creates fusion sequences with customizable linker lengths for domain engineering. Works with both protein and DNA sequences. Outputs include sequence/linker position columns in PyMOL selection format for easy visualization.
Environment: biopipelines
Parameters: - sequences: Union[List[str], str] (required) - List of sequences or PDB file paths - name: str = "" - Job name for output files - linker: str = "GGGGSGGGGSGGGGSGGGGS" - Linker sequence that will be cut based on linker_lengths if specified - linker_lengths: List[str] = None - List of length ranges for each junction to generate multiple variants by cutting the linker (e.g., ["1-6", "1-6"])
Streams: sequences
Tables: - sequences:
| id | sequence | lengths | S1 | L1 | S2 | L2 | S3 | ... |
|---|---|---|---|---|---|---|---|---|
lengths: Shortname of the lengths e.g. 2-4, 5-2-4, ...S1,S2,S3, ...: Sequence positions in PyMOL selection format (e.g., "1-73", "76-237")L1,L2, ...: Linker positions in PyMOL selection format (e.g., "74-75", "238-240")- Number of columns depends on number of input sequences: n sequences → n sequence columns (S1...Sn) and n-1 linker columns (L1...Ln-1)
Example:
from biopipelines.fuse import Fuse
from biopipelines.pdb import PDB
N="GNH..."
mid=PDB("...")
C="EFT..."
fused = Fuse(
sequences=[N, mid, C],
linker="GSGAG",
linker_lengths=["2-4", "2-4"],
name="protein_fusion"
)
LigandMPNN¶
Designs protein sequences optimized for ligand binding. Specialized version of ProteinMPNN that considers protein-ligand interactions during sequence design.
References: https://www.nature.com/articles/s41592-025-02626-1.
Installation: As from the official repository (https://github.com/dauparas/LigandMPNN), go to your data folder then run:
git clone https://github.com/dauparas/LigandMPNN.git
cd LigandMPNN
bash get_model_params.sh "./model_params"
mamba create -n ligandmpnn_env python=3.11
pip3 install -r requirements.txt
Parameters: - structures: Union[DataStream, StandardizedOutput] (required) - Input structures - ligand: Optional[Union[str, DataStream, StandardizedOutput]] = None - Compounds stream (Ligand(code="LIG") or any compounds-producing tool) or a 3-letter code naming the bound ligand for binding-site focus; the residue code is read from the stream at runtime - num_sequences: int = 1 - Number of sequences per batch - fixed: str | (TableInfo, column) = "" - Fixed positions (LigandMPNN format "A3 A4 A5" or table reference) - redesigned: str | (TableInfo, column) = "" - Designed positions (LigandMPNN format or table reference) - design_within: float = 5.0 - Distance in Angstroms from ligand for post-generation analysis only (does not control design). For actually designing residues within a distance, use DistanceSelector to select positions first. - chain: str = "A" - Default chain ID applied to chainless position input (e.g. when positions are given as "10-20" without chain prefix) - model: str = "v_32_010" - LigandMPNN model version (v_32_005, v_32_010, v_32_020, v_32_025) - num_batches: int = 1 - Number of batches to run. Total sequences = num_sequences × num_batches - remove_duplicates: bool = True - Drop duplicate sequences from the output - fill_gaps: str = "G" - Fill gaps in the protein with an amino acid (default glycine). - temperature: float = 0.0 - Sampling temperature (0.0 = argmax / deterministic) - bias_AA_per_residue: str = "" - Per-residue amino-acid bias (LigandMPNN JSONL path or spec) - seed: int = 0 - Random seed (0 = random)
Streams: sequences
Tables: - sequences:
| id | sequence | sample | T | seed | overall_confidence | ligand_confidence | seq_rec | gaps |
|---|---|---|---|---|---|---|---|---|
Example:
from biopipelines.ligand_mpnn import LigandMPNN
from biopipelines.ligand import Ligand
lmpnn = LigandMPNN(
structures=rfdaa,
ligand=Ligand(code="LIG"),
num_sequences=5,
redesigned=rfdaa.tables.structures.designed
)
Mutagenesis¶
Performs mutagenesis at specified positions. Generates systematic amino acid substitutions for experimental library design or computational scanning.
Environment: MutationEnv
Parameters: - original: Union[DataStream, StandardizedOutput] (required) - Input structure/sequence - position: Union[int, str, TableReference, StandardizedOutput] = None (required in practice) - Target position(s) for mutagenesis: - int: Fixed position (1-indexed) for all sequences - str: PyMOL-style selection (e.g., "141+143+145-149") - TableReference: Per-row position lookup (e.g., fuse.tables.sequences.L1) - StandardizedOutput: From Selection tool (extracts selections.selection column) - mutate_to: str = "" - Target amino acid(s) for "specific" mode (e.g., "A" for alanine, "AV" for alanine and valine). Required when mode is "specific". - mode: str = "specific" - Mutagenesis strategy: - "specific": Only the amino acid(s) given in mutate_to (default) - "saturation": All 20 amino acids - "hydrophobic": Hydrophobic residues only - "hydrophilic": Hydrophilic residues only - "charged": Charged residues only - "polar": Polar residues only - "nonpolar": Nonpolar residues only - "aromatic": Aromatic residues only - "aliphatic": Aliphatic residues only - "positive": Positively charged residues only - "negative": Negatively charged residues only - include_original: bool = False - Include original amino acid in output - exclude: str = "" - Amino acids to exclude (single letter codes as string, e.g., "CP") - combinatorial: bool = False - When multiple positions are given, generate the full Cartesian product of substitutions across positions instead of mutating each position independently - msas: Union[DataStream, StandardizedOutput] = None - Optional precomputed MSAs for the original protein(s) (e.g. from AlphaFold, MMseqs2, or the MSA tool). When provided, a synthetic per-mutant MSA is derived by copying the parent's MSA (matched on the original protein id) and substituting only the query (first) row at the mutated position(s); homolog rows pass through unchanged and the format (a3m/csv) is preserved. The emitted msas stream is keyed by the mutant ids, so it feeds straight into a downstream folding tool alongside the mutant sequences. No realignment is performed — point substitutions introduce no gaps, so alignment columns are unchanged. A mutant whose parent has no MSA is skipped (warned, not fabricated). When None (default), no msas stream is emitted.
Streams: sequences; msas (only when msas= is given)
Tables: - sequences:
| id | sequences.id | sequence | mutations | mutation_positions | original_aa | new_aa |
|---|---|---|---|---|---|---|
When chaining multiple Mutagenesis steps, mutations accumulates (e.g., A42V,G50L) and mutation_positions uses PyMOL selection format (e.g., 42+50).
missing:
| id | removed_by | cause |
|---|---|---|
msas(only whenmsas=is given):
| id | sequences.id | original.id | sequence | msa_file |
|---|---|---|---|---|
id and sequences.id are both the mutant id (so downstream folding tools match the mutant query); original.id records the parent protein the MSA was derived from; sequence is the mutated query sequence.
Example:
from biopipelines.mutagenesis import Mutagenesis
# Convert position 42 to alanine
sdm = Mutagenesis(original=template, position=42, mutate_to="A")
# Saturation mutagenesis at position 42 (excluding cysteine and proline)
sdm = Mutagenesis(original=template, position=42, mode="saturation", exclude="CP")
# Multiple positions
sdm = Mutagenesis(original=template, position="42+50+55-60", mode="saturation")
# Per-row positions from a table column (e.g., linker positions from Fuse)
sdm = Mutagenesis(original=fused, position=fused.tables.sequences.L1, mode="saturation")
# Positions from Selection tool
sdm = Mutagenesis(original=template, position=selection_output, mode="saturation")
# Reuse the wild-type MSA for all mutants: derive one synthetic MSA per mutant
# (query row substituted, homolog rows kept) and fold without re-querying.
af = AlphaFold(proteins=template) # builds the original MSA
sdm = Mutagenesis(original=template, position=42, mode="saturation", msas=af)
folded = AlphaFold(proteins=sdm, msas=sdm) # per-mutant MSAs, keyed by mutant id
MutationComposer¶
Generates new protein sequences by composing mutations based on frequency analysis. Creates combinatorial mutants from mutation profiles with different sampling strategies.
Installation: Same environment as MutationProfiler.
Parameters: - frequencies: Union[List, ToolOutput, StandardizedOutput, TableInfo, str] (required) - Mutation frequency table(s) from MutationProfiler - num_sequences: int = 10 - Number of sequences to generate - mode: str = "single_point" - Generation strategy: - "single_point": One mutation per sequence - "weighted_random": Random mutations weighted by frequency - "hotspot_focused": Focus on high-frequency positions - "top_mutations": Use only top N mutations - min_frequency: float = 0.01 - Minimum frequency threshold for mutations - max_mutations: int = None - Maximum mutations per sequence - random_seed: int = None - Random seed for reproducibility - prefix: str = "" - Prefix for sequence IDs - hotspot_count: int = 10 - Number of top hotspot positions (for hotspot_focused mode) - combination_strategy: str = "average" - Strategy for combining multiple tables (average, maximum, stack, round_robin)
Streams: sequences
Tables: - sequences:
| id | sequence | mutations | mutation_positions |
|---|---|---|---|
Example:
from biopipelines.mutation_composer import MutationComposer
from biopipelines.mutation_profiler import MutationProfiler
profiler = MutationProfiler(original=ref, mutants=variants)
composer = MutationComposer(
frequencies=profiler.tables.relative_frequencies,
num_sequences=50,
mode="weighted_random",
max_mutations=5
)
ProteinMPNN¶
Designs protein sequences for given backbone structures. Uses graph neural networks to optimize sequences for structure stability while respecting fixed/designed region constraints.
References: https://www.science.org/doi/10.1126/science.add2187.
Installation: Go to your data folder and clone the official repository (https://github.com/dauparas/ProteinMPNN). The model will then work in the same environment as RFdiffusion.
Parameters: - structures: Union[DataStream, StandardizedOutput] (required) - Input structures - num_sequences: int = 1 - Number of sequences per structure - fixed: str | (TableInfo, column) = "" - Fixed positions (PyMOL selection or table reference) - redesigned: str | (TableInfo, column) = "" - Redesigned positions (PyMOL selection or table reference) - chain: str = "auto" - Chain to apply fixed positions ("auto" detects from input structure) - sampling_temp: float = 0.1 - Sampling temperature - model_name: str = "v_48_020" - ProteinMPNN model variant - soluble_model: bool = True - Use soluble protein model - remove_duplicates: bool = True - Drop duplicate sequences from the output - fill_gaps: str = "G" - Fill gaps in the protein with an amino acid (default glycine). - bias_AA_jsonl: str = "" - Path to a ProteinMPNN amino-acid bias JSONL - omit_AA_jsonl: str = "" - Path to a ProteinMPNN per-position omit-AA JSONL - seed: int = 0 - Random seed (0 = random) - ca_noise_std: float = 0.0 - Std. dev. of Gaussian noise added to Cα coordinates before design
Streams: sequences
Tables: - sequences:
| id | structures.id | source_pdb | sequence | score | seq_recovery | rmsd | gaps |
|---|---|---|---|---|---|---|---|
Note: Sample 0 is the original/template sequence, samples 1+ are designs.
Example:
from biopipelines.protein_mpnn import ProteinMPNN
pmpnn = ProteinMPNN(
structures=rfd,
num_sequences=10,
fixed="1-10+50-60",
redesigned="20-40"
)
RBSDesigner¶
Designs synthetic ribosome binding sites (RBS) to control protein expression in bacteria. Uses the Salis thermodynamic model to predict translation initiation rates and a simulated annealing optimizer to design RBS sequences matching a target expression level. Requires ViennaRNA for RNA free energy calculations.
Reference: Salis, Mirsky & Voigt, Nat. Biotechnol. 27, 946–950 (2009). doi:10.1038/nbt.1568
Environment: rbs_designer (ViennaRNA from bioconda, requires flexible channel priority)
Installation:
Parameters: - sequences: Union[ToolOutput, StandardizedOutput] (required) — Input DNA sequences (typically from DNAEncoder) - tir: Union[str, int, float] = "medium" — Target translation initiation rate: - "low" (100 au) - "medium" (1000 au) - "high" (10000 au) - "maximum" (100000 au) - Or any numeric value (au) - pre_sequence: str = "" — Optional fixed 5'UTR DNA to prepend before the designed RBS - add_start_codon: bool = False — Prepend an ATG start codon to the gene if absent
Tables: - rbs:
| id | dna_sequence | rbs_sequence | full_gene | dg_total | tir_predicted | target_tir | target_dg | spacing | dg_mrna_rrna | dg_start | dg_spacing | dg_mrna | dg_standby |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
full_gene=pre_sequence+rbs_sequence+dna_sequence(complete DNA ready for synthesis)
Example:
from biopipelines.dna_encoder import DNAEncoder
from biopipelines.rbs_designer import RBSDesigner
# Codon-optimize then design RBS for high expression in E. coli
dna = DNAEncoder(sequences=proteins, organism="EC")
rbs = RBSDesigner(sequences=dna, tir="high")
# Design RBS for specific TIR with 5'UTR prefix
rbs = RBSDesigner(sequences=dna, tir=5000, pre_sequence="AATTAA")
Thermodynamic model (Equation 2):
-dG_mRNA:rRNA: SD / anti-SD hybridization energy (ViennaRNA duplexfold) - dG_start: Start codon identity (AUG = −1.194, GUG = −0.075 kcal/mol) - dG_spacing: Penalty for non-optimal SD-to-start-codon distance - dG_standby: Energy to unfold the 4-nt standby site upstream of SD - dG_mRNA: Local mRNA folding energy (70 nt window around start codon) Note: RBS design uses simulated annealing with adaptive temperature control (5–20% acceptance rate). Each sequence typically requires thousands of energy evaluations. Computation time scales with the number of input sequences. Please cite Salis et al. 2009 when using.
StitchSequences¶
Combines a template sequence with two types of modifications: substitutions (position-to-position copying from equal-length sequences) and indels (segment replacement that can change sequence length). Generates all Cartesian product combinations.
Environment: biopipelines
Parameters: - template: Union[str, DataStream, StandardizedOutput] - Base sequence (raw string or tool output). Optional if using concatenation mode. - substitutions: Dict[str, Union[List[str], DataStream, StandardizedOutput]] = None - Position-to-position substitutions from equal-length sequences. For each position in the selection, the residue at that position in the substitution sequence replaces the residue at that position in the template. - Keys: Position strings like "11-19" or "11-19+31-44", or table references - Values: tool output with sequences (must be same length as template) - indels: Dict[str, Union[List[str], DataStream, StandardizedOutput]] = None - Segment replacements where each contiguous segment is replaced with the given sequence. Can change sequence length. - Keys: Position strings like "50-55" or "6-7+9-10+17-18", or integers for concatenation mode - Values: List of raw sequences (each segment replaced with full sequence) - remove_duplicates: bool = True - Drop duplicate sequences from the generated combinations
Position Syntax: - "10-20" → positions 10 to 20 (inclusive, 1-indexed) - "10-20+30-40" → positions 10-20 and 30-40 - "145+147+150" → specific positions 145, 147, and 150
Processing Order: 1. Substitutions are applied first (position-to-position, same length) 2. Indels are applied second (segment replacement, can change length)
Indel Segment Behavior: For discontinuous selections like "6-7+9-10+17-18", each contiguous segment is replaced with the full replacement sequence. So "6-7+9-10": "GP" replaces segment 6-7 with "GP" AND segment 9-10 with "GP".
Concatenation Mode: When template is omitted and indels keys are integers (1, 2, 3...), sequences are concatenated in order.
Streams: sequences
Tables: - sequences:
| id | sequence |
|---|---|
Examples:
from biopipelines.stitch_sequences import StitchSequences
# Position-to-position substitution from ToolOutput
# Both template and substitution sequences are 180 residues
stitched = StitchSequences(
template=pmpnn,
substitutions={
"11-19+31-44": lmpnn # Copy residues at these positions from lmpnn
}
)
# Segment replacement with indels
stitched = StitchSequences(
template="MKTAYIAKQRQISFVKSHFS...",
indels={
"11-15": ["AAAAA", "GGGGG"], # Replace segment with 5-char options
"20-22": ["XX", "YYY", "ZZZZ"] # Can change length
}
)
# Output: 2 × 3 = 6 combinations
# Combined: substitutions then indels
stitched = StitchSequences(
template=pmpnn,
substitutions={
"6-12+19+21": lmpnn # Position-to-position from lmpnn
},
indels={
"50-55": ["LINKER", "GGG"] # Replace segment 50-55
}
)
# Discontinuous indel: replace multiple segments with same sequence
stitched = StitchSequences(
template="ABCDEFGHIJKLMNOPQRSTUVWXYZ",
indels={
"3-4+7-8+11-12": ["XX", "YY"] # Each segment replaced with "XX" or "YY"
}
)
# "3-4+7-8+11-12": "XX" -> "ABXXEFXXIJXXMNOPQRSTUVWXYZ"
# ToolOutput with table-based positions
stitched = StitchSequences(
template=pmpnn,
substitutions={
distances.tables.selections.within: lmpnn
}
)
# Concatenation mode (no template, integer keys in indels)
stitched = StitchSequences(
indels={
1: ["AAAA", "BBBB"], # First segment
2: ["CCCC"], # Second segment
3: ["DDDD", "EEEE", "FFFF"] # Third segment
}
)
# Output: 2 × 1 × 3 = 6 concatenated sequences