Computational design of soluble and functional membrane protein analogues

AF2_seq design protocol

Design target preparation

The design target structures were sourced from the Protein Data Bank (PDB) and included the following protein folds: IGF (3SD2), BBF (6D0T)³⁰, TBF (5BVL)³⁸, claudin (4P79)⁶¹, rhomboid protease (3ZEB)⁴⁴ and GPCR (6FFI)⁶². Owing to missing residue positions in the TBF, claudin and GPCR X-ray structures, we used AF2 to predict the protein structure using the X-ray structure as a template. Disordered regions in the claudin (residues 34–40) and GPCR (residues 875–896) targets were replaced by three-glycine and five-glycine linkers, respectively. The GPCR sequence was predicted using the experimental structure as a template but without the endolysin domain (residues 679–838) used for crystallization.

Loss function

For computation of error gradients, a composite loss function was used:

$${\rm{loss}}={W}_{{\rm{FAPE}}}{L}_{{\rm{FAPE}}}+{W}_{{\rm{dist}}}{L}_{{\rm{dist}}}+{W}_{{\rm{pLDDT}}}{L}_{{\rm{pLDDT}}}+{W}_{{\rm{pTM}}}{L}_{{\rm{pTM}}}.$$

The loss function is represented as a combination of L, which denotes the value of the loss, and W, which denotes the weight of the loss. The frame aligned point error (FAPE) loss quantifies the L2 norm between the predicted C_α atoms and the target structure. The distogram (dist) loss is the cross entropy over the C_β distogram for non-glycine residues and the C_α distance in the case of glycine. The model confidence (pLDDT) loss of the C_α positions is computed by taking 1 − pLDDT, penalizing low confidence. Finally, the pTM score loss is a prediction confidence metric focused on global structural similarity. In this work, the designs were generated using loss terms W_FAPE = 1.0, W_pLDDT = 0.2 and W_pTM = 0.2. During initial trajectories, W_dist was set to 0.5, whereas it was disabled during trajectory reseeding (soft starts, described below).

Gradient descent

As previously described in ref. ⁷, amino acid sequences were initialized on the basis of the secondary structure of the target fold. The secondary structure assignments were encoded in sequences, using alanines for helix, valines for β-sheet and glycines for loop residues. This introduces a bias towards the correct local structure, aiding faster convergence of the design trajectories. To diversify the generated designs, 10% of the amino acids were randomly mutated in the initial sequence of each design trajectory. Subsequently, the sequence was passed through the AF2 networks, which generated five structures. These structures were then used to calculate the loss with the previously defined loss function. The error gradient was obtained by backpropagating the errors to the one-hot-encoded input, resulting in a 5 × 20 × N error gradient, where N represents the sequence length. We then took the average of the five matrices to obtain the mean error gradient (20 × N), which was used for gradient descent. A position-specific scoring matrix (PSSM) of 20 × N was updated using the ADAM optimizer⁶³ with the normalized error gradient. Following the update, the PSSM underwent a softmax function that transforms the matrix into a probability distribution of the amino acid identity for each position. The argmax function was subsequently used to determine the most probable amino acid identities per position; these were then used to construct the new input sequence for the next iteration. The cysteine residues in the PSSM were masked, so the designed sequences do not contain any cysteines.

Model settings

AF2 was run in single sequence mode using the network configuration of the original AF2 ‘model_5_ptm’ for all five AF2 models with mutiple sequence alignments (MSAs) and templates disabled. For the design trajectories, we used zero recycles, meaning that each AF2 network was only executed once. For the claudin-1 and claudin-4 designs, we only used models 1 and 2 with the network configuration of the original AF2 ‘model_1_ptm’ with templates enabled. All design runs were executed on a single Nvidia Tesla V100 (32 GB) GPU.

Computational design protocol

In each AF2-sequence design trajectory, 500 rounds of gradient descent optimization were performed (https://github.com/bene837/af2seq). Not all design trajectories of the claudin, rhomboid protease and GPCR converged. Hence, we sampled sequences from successful trajectories and introduced mutations, while disabling distogram loss. These sequences were then used as starting points for new design trajectories, which we named soft starts, resulting in a higher convergence rate. All generated sequences were then predicted using AF2 with three recycles, followed by relaxation in an AMBER force field^64,65. This resulted in high-quality structures that were used as inputs to ProteinMPNN for sequence generation. The total numbers of designs and designs passing in silico filtering are summarized in Supplementary Table 1. For the design of the claudin-1 and claudin-4 functional analogues, we first predicted their structures using AF2 with MSAs and templates enabled, owing to the lack of high-resolution experimental structures. The predictions were then used as structural templates for both design and reprediction, as the wild-type extracellular region could not be predicted by AF2 in single sequence mode. All sequence and side-chain information was removed from the template to reduce folding bias. We tried several design strategies for the functional claudin design, of which two were successful: (1) redesigning only the transmembrane surface, approximately 40% of the sequence; and (2) redesigning the entire transmembrane region, including the core, approximately 60% of the sequence. The residue positions that were fixed can be found in Supplementary Table 2.

For conformation-specific design of GPCRs, we used the template of the adenosine A2A GPCR in the active conformation bound to mini-G_s (PDB 5G53) and the inactive conformation (PDB 3VGA) to design each state individually. We fixed residues interacting with the G protein and the evolutionarily conserved DRY motif during the design of each state, resulting in designs with identical length and identical functional sites. For the design of the active conformation, we found that it was not possible to generate high-confidence designs without the presence of G protein; hence, gradient descent and prediction were performed in the presence of the mini-G_s binder.

Training of MPNN_sol

The MPNN_sol model was trained on protein assemblies in the PDB (as of 2 August 2021) determined by X-ray crystallography or cryo-EM to a resolution of better than 3.5 Å and with fewer than 10,000 residues. We followed training as described in ref. ¹³, modified only by excluding annotated transmembrane PDB codes. The list of excluded PDB codes and MPNN_sol model weights are available at https://github.com/dauparas/ProteinMPNN/tree/main/soluble_model_weights.

ProteinMPNN sequence redesign

The backbones generated by AF2_seq were used as inputs to ProteinMPNN. For the vanilla ProteinMPNN, we used the provided model weights trained on a dataset with 0.1 Å Gaussian noise¹³. For the biased ProteinMPNN (referred to in the main text as MPNN_bias), we used a modified version of the script ‘submit_example_8.sh’ as provided on the ProteinMPNN github mentioned above. We found the best results by giving a positive sampling bias to the polar amino acids and a negative sampling bias to alanine. For MPNN_sol, we generated sequences with two different models that had different levels of noise during training (0.1 Å and 0.2 Å). For all ProteinMPNN models, we generated two sequences per AF2_seq-designed backbone. No Gaussian noise was added to the input backbone, and cysteine residues were masked during the decoding process.

Structural similarity calculations

The C_α atoms of the structures were aligned using the Superimposer from the Biopython package⁶⁶. The r.m.s.d._Cα was calculated as the mean Euclidean distance between predicted and target C_α atom coordinates. The r.m.s.d._fa was calculated by first aligning all atoms with the Superimposer, after which the mean Euclidean distance between atoms was computed. The template modelling scores were determined using TM-align⁶⁷.

Sequence diversity analysis

Sequence recovery was quantified as the number of positions at which the corresponding residue matches the residue in the target fold divided by the total number of residues in the sequence multiplied by 100%. The core residues were defined as residues with less than 20 Å² solvent-accessible surface area (SASA), and surface residues were defined as residues with less than 20 Å² SASA. The e-values were obtained through a protein BLAST search of the NCBI RefSeq database of 1 October 2022 with a maximum hit value of 1,000.

Surface hydrophobicity calculations

The fraction of surface hydrophobics was calculated using Rosetta³. First, all surface residues were identified using the layer selector; these were defined as residues with SASA > 40 Å². Of these surface residues, we counted the number of apolar amino acids (defined as ‘GPAVILMFYW’) and divided it by the total number of surface residues.

Design filtering and selection

All generated sequences were predicted with AF2 using three recycles and a relaxation step in an AMBER force field. Next, the sequences were filtered using the following criteria: (1) TM score > 0.80 for all designs except the rhomboid protease (the rhomboid protease yielded slightly lower TM scores in the design trajectory; hence, we chose a cut-off value 0.75 instead); (2) pLDDT > 80 for all designs except the rhomboid protease (pLDDT > 75); and (3) an e-value threshold > 0.1 for sequence novelty. Success rates are listed in Supplementary Table 1.

Structural fold similarity search

The fold similarity search was performed using FoldSeek⁶⁸ on the SCOP database¹⁷ (downloaded March 2023). For each of the design target folds, an exhaustive search on the basis of TM score alignment was performed. The SCOP database contains globular and membrane domain annotations, which were used for the hit classification.

Fold complexity calculations

Relative contact order was calculated at the secondary structure level by computing the residue distance in the sequence between secondary structures for all pairs within 8 Å of each other and then averaging these distances for all contacts that were more than four residues apart. To ensure consistency in secondary structure annotations across all structures, we used DSSP for the determination of secondary structural elements¹⁷. The de novo protein dataset comprised 70 helical proteins, six β-sheet proteins and 42 proteins containing both α-helices and β-sheets^34,69. The natural protein dataset consisted of 1,000 proteins randomly selected from the entire collection of proteins in the CATH dataset (v.4.3)⁷⁰.

Transplantation of natural epitopes on to soluble scaffolds

Compatible epitopes were identified by means of a Foldseek search of the PDB, using soluble scaffolds as queries. Hits with TM scores above 0.7 and high structural similarity around the desired epitope were superimposed using structure visualization software, such as PyMOL or ChimeraX. Varying lengths of the epitope were selected for transplantation, encompassing either only interaction sites, entire loops or overlapping parts of the supporting secondary structures. The sequence of the overlaid epitope was then pasted into the overlapping region of interest in the soluble scaffold. The resulting chimeric sequences were predicted using AF2 in single sequence mode. Structures with high pLDDT (greater than 90) and high TM scores relative to the starting scaffold were manually inspected to verify the placement of the epitope. Finally, a subset of constructs in different soluble scaffolds were selected for experimental testing.

SPR binding assay

SPR measurements were carried out on a Biacore 8K system (Cytiva) in HBS-EP+ buffer (10 mM HEPES pH 7.4, 150 mM NaCl, 3 mM EDTA, 0.005% (v/v) Surfactant P20 Cytiva). The antibody (5 µg ml⁻¹) was immobilized on a CM5 sensor chip (Cytiva) by amide coupling in 10 mM NaOAc pH 4.5 (250 s, 10 µl min⁻¹; 700–1500 response units immobilized). Purified mini-G_s was immobilized with a contact time of 200 s (300 response units immobilized). Binding assays were carried out at a flow rate of 30 µl min⁻¹. Designed chimeras were injected as serial dilutions ranging from 18 µM to 0.1 nM, and 0 nM for 120 s, followed by dissociation for 400 s. Immobilized antibody was regenerated between cycles in 10 mM glycine-HCl pH 2.5 (30 s, 30 µl min⁻¹). GPCRs designed in the active or inactive state were injected at 0, 5, 15 and 25 µM for 90 s, followed by dissociation for 120 s. Immobilized mini-G_s ligand was not regenerated between cycles. Binding curves were fitted with a 1:1 Langmuir binding model in the Biacore 8K analysis software. Steady-state response units were plotted against analyte concentration, and a sigmoid function was fitted to the experimental data in Python 3.9 to derive the K_d.

Bio-layer interferometry

For BLI studies of claudins, synthetic claudin-His and tagless CpE in 20 mM Tris pH 7.4, 100 mM NaCl, and 5% glycerol were used. BLI was performed at 25 °C in 96-well black flat-bottomed plates (Greiner) using an acquisition rate of 5 Hz averaged by 20 using an Octet R8 System (FortéBio/Sartorius), with assays designed and set up using Blitz Pro 1.3 software. Binding experiments consisted of the following steps: sensor equilibration (30 s), loading (300 s), baseline (180 s), and association and dissociation (120–300 s each). Experiments were conducted by immobilizing 1.5–3.0 µM of synthetic claudin-His on NiNTA (Dip and Read) sensors and quantifying their binding to 0.05–5.00 µM CpE. Association and dissociation times for the two claudin-1 designs were performed for 120 s, as they exhibited rapid on and off rates, whereas for the claudin-4 design, these times were extended to 300 s to capture the slower off rates. Data were fitted to a 1:1 binding model using Octet Analysis Studio (Sartorius), which generated the K_d from the association and dissociation rate constants. At the protein concentrations used, no significant non-specific binding of CpE to NiNTA sensors was detected.

Protein crystallization and structure determination

The TBF_24 design was crystallized using sitting drop vapour diffusion at 4 °C in 0.1 M Na₃ citrate pH 4.0, 1 M LiCl, and 20% PEG 6000 buffer. The CLF_4 design was crystallized using sitting drop vapour diffusion at 4 °C in 0.1 M Na₃ citrate pH 5.0, 0.1 M Na/K phosphate pH 5.5, 0.1 M RbCl, and 25% v/v PEG smear medium (BCS Screen, Molecular Dimensions). The RPF_9 design was crystallized using sitting drop vapour diffusion at 4 °C in 0.1 M HEPES pH 7.8, 0.15 M Na₃ citrate dihydrate, and 25% v/v PEG smear low (BCS Screen, Molecular Dimensions). The GLF_18 design was crystallized using sitting drop vapour diffusion at 4 °C in Na phosphate-citrate pH 4.2, 0.2 M LiSO₄, and 20% PEG 1000 buffer. The GLF_32 design was crystallized using sitting drop vapour diffusion at 4 °C in 0.1 M Na acetate pH 5.5, 0.2 M KBr, and 25% PEG MME 2000 buffer. Crystals were cryoprotected in 20% glycerol and flash-cooled in liquid nitrogen. Diffraction data were collected at the beamline PXI (X06SA) of the Swiss Light Source (Paul Scherrer Institute, Villigen, Switzerland) and the MASSIF-1 beamline of the European Synchrotron Radiation Facility (Grenoble, France) at a temperature of 100 K. Data were processed using the autoPROC package⁷¹. Phases were obtained by molecular replacement using Phaser⁷². Atomic model refinement was completed using COOT⁷³ and Phenix.refine⁷². The quality of refined models was assessed using MolProbity⁷⁴. Structural figures were generated using PyMOL (Schrödinger, LLC; https://www.pymol.org/) and ChimeraX⁷⁵. Data collection and refinement statistics are listed in Extended Data Table 1.

Cryo-EM structure determination of CLN4-20 in complex with cCpE

Expression and purification of cCpE, COP-2 Fab and the anti-Fab nanobody were performed as described previously⁷⁶. Concentrated CLN4_20 was complexed with cCpE followed by COP-2 in a 1:1.2:1 molar excess. Next, the anti-Fab nanobody was added at a 1.3 molar excess of COP-2, followed by incubation on ice for 30 min, concentrated and subjected to SEC using a Superdex 200 increase 10/300 GL column (GE Healthcare) in 20 mM HEPES pH 8.0, 150 mM NaCl. The purified complex was concentrated to 5 mg ml⁻¹.

UltraAuFoil 1.2/1.3 grids (Quantifoil) were glow discharged for 30 s at 15 mA and vitrified using a Leica GP2 instrument (Leica microsystems). Then, 3.5 µl of the complex was applied to grids and blotted for 3 s at 4 °C under 100% humidity, before being plunge frozen into liquid ethane. Grid screening and data collection were performed on a 200 kV Glacios 2 Cryo-TEM (ThermoFisher Scientific) with a Falcon 4i direct electron detector at Hauptman-Woodward Medical Research Institute. A total of 1,159 videos were collected at a physical pixel size of 0.884 Å, with an electron dose of 49.4 e/Å² fractioned over 93 frames.

Videos were processed, patch motion corrected and patch CTF estimated in cryoSPARC. Blob picking generated a suitable template for an initial three-dimensional volume; this was used to produce two-dimensional projections for template picking, followed by two-dimensional classification, ab initio reconstruction and three-dimensional refinement, resulting in a cryo-EM density resolved to a resolution of 4.1 Å. Structural coordinates for the complex of CLN4_20, cCpE and COP-2 Fab from PDB ID 7TDM⁷⁶ were rigid body docked. The nanobody from PDB 8U4V was docked on to the L chain of COP-2. Each protein chain was then real-space refined in Coot. Final model refinement was conducted with Namdinator⁷⁷, followed by real-space refinement using Phenix phenix.real_space_refine⁷². Extended Data Table 2 shows data collection and refinement statistics for the CLN-4_20/cCpE/COP-2/Nb structure.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link