Expression and purification of proteins
Plasmids for analyte proteins were constructed using gBlocks (Integrated DNA Technologies) inserted into the pET–49b(+) plasmid (Novagen), with a dihydrofolate reductase domain, a polyhistidine tag and a TEV cleavage site upstream of the sequence encoding an analyte protein. The NEBuilder HiFi DNA assembly and Q5 site-directed mutagenesis kits (New England Biolabs) were used for plasmid construction. Cloning was done using NEB 5-α-competent Escherichia coli cells. Plasmid sequences were verified by Sanger sequencing through Genewiz. Protein expression was induced overnight at 30 °C with BL21 (DE3) E. coli cells in Overnight Express Instant TB medium (Novagen). Proteins were purified by immobilized metal affinity chromatography (IMAC) with TALON metal affinity cobalt resin and its associated buffer set (Takara), following the manufacturer’s instructions. Proteins were cleaved with TEV protease (New England Biolabs) and further purified by reverse IMAC. Purified proteins were concentrated using ultracentrifugal filters with a 10 kDa cutoff (Amicon) and stored in the short term at 4 °C or in the long term at −80 °C until use.
A covalently linked hexamer of an N-terminal truncated ClpX variant (ClpX-ΔN6)60 was prepared using the BLR E. coli strain as described previously43. In brief, cells were grown to an optical density at 600 nm (OD600) of around 0.6 in LB medium and then incubated in the presence of 0.5 mM isopropyl β-D-1-thiogalactopyranoside (IPTG) at 23 °C for about 3 h to induce ClpX expression. ClpX was purified by IMAC and anion-exchange chromatography. Purified ClpX was stored at −80 °C in small aliquots until use. ClpP expression was induced at an OD600 of around 0.6 with 0.5 mM IPTG at 30 °C for about 3 h43. ClpP was purified by IMAC and stored at −80 °C until use.
PTM assays
For asparagine deamidation, protein (around 1 mg ml−1) was incubated overnight in 100 mM sodium bicarbonate buffer (pH 9.6) at 25 °C to catalyse deamidation. For protein phosphorylation with kinase, protein was incubated with either 50,000 units per ml PKA (New England Biolabs) or 10,000 units per ml CKII (New England Biolabs) in a protein kinase buffer (10 mM MgCl2, 0.1 mM EDTA, 2 mM DTT, 0.01% Brij 35, 260 µM ATP and 50 mM Tris-HCl, pH 7.5) at 30 °C. The protein solution was used for nanopore analysis immediately after the incubation without purification.
MinION experiments
All the experiments were done on the MinION platform using R9.4.1 flow cells. Run conditions were set with a custom MinKNOW script (available from Oxford Nanopore Technologies) at a temperature of 30 °C and a constant voltage of −140 mV with a 3 kHz sampling frequency, except for initial proteins P1–P4, for which runs were performed at a constant voltage of −180 mV with a 10 kHz sampling frequency. Using the priming port, flow cells were first washed with 1 ml cis running buffer (200 mM KCl, 5 mM MgCl2, 10% glycerol and 25 mM HEPES–KOH, pH 7.6) and then loaded with 200 μl protein analyte in cis running buffer at a final concentration of 500 nM, unless otherwise specified. Following the observation of protein captures in the pores, flow cells were washed with 1 ml cis running buffer to remove uncaptured proteins and subsequently loaded with 75 μl cis running buffer supplemented with 4 mM ATP and 200 nM ClpX-ΔN6 unless otherwise specified. The flow cell was washed about 4 min after analyte loading in the initial method, and around 6 min and 2 min after analyte loading at concentrations of 5 nM and 500 nM, respectively, in the optimized method (Extended Data Fig. 10a). For MinION runs in the high-salt condition (Extended Data Fig. 6b), a buffer containing 400 mM KCl, 5 mM MgCl2 and 25 mM HEPES–KOH (pH 7.6) was used instead of standard cis running buffer to see if it would improve the signal-to-noise ratio.
Bulk degradation assays
The time-course degradation assay of the PASTOR-HDKER protein was performed in cis running buffer with 6 μM PASTOR-HDKER, 150 nM ClpX-ΔN6, 300 nM ClpP14 and an ATP-regeneration mix (4 mM ATP, 16 mM creatine phosphate and 7 units per ml creatine phosphokinase) at 30 °C. Incubation was stopped by denaturing samples in Laemmli buffer at 95 °C for 5 min. Samples were run on SDS–PAGE and stained with Coomassie blue to quantify the protein bands using the ImageJ software.
Nanopore signal analysis
Preprocessing
To help identify ClpX-mediated protein translocations, we established detection thresholds using specific statistical parameters (standard deviation, median value, standard deviation of the mean of windows, and the ratio of values relative to the open pore value) indicative of translocation to ionic current blockades preceding a return to the open channel state. This analysis was used to assist the process of manually checking traces for translocations, and translocations with particularly high noise or disruptions were discarded. PASTOR proteins were auto-segmented as described below, with the exception of those containing folded domains and PASTOR-rereads, which were segmented manually. PASTOR-reread rereads with a complete Y2–Y3–Y4–Y5–Y2 signal were assumed to be full-length reads with a back-slipping distance of 310 amino acids. Partial rereads missing the signal(s) of the C-terminal Y2, Y3, Y4 and Y5 were assigned to have back-slipping distances of 250, 188, 125 and 61 amino acids, respectively. All figures with raw traces (those shown in pA) had a low-pass Bessel filter applied using SciPy with N = 10 and Wn = 0.025, except for those showing stepping analysis (Figs. 2c and 6c, Extended Data Fig. 3 and Supplementary Figs. 5 and 6), which had Wn = 0.7. Before use in data analysis, traces were smoothed by applying a low-pass Bessel filter with N = 10 and Wn = 0.03 with SciPy, and by applying average downsampling by a factor of 50 for proteins P1–4, 20 for the 8 PASTORs and 10 for the other proteins. Then, to scale, the segment was split into tenths, and the median of the minima of each tenth and the median of the maxima of each tenth were used as the min and max, respectively, to perform min–max scaling (Extended Data Fig. 2b). For PASTOR-phos, the signals were iteratively scaled. We first used this approach, then DTW-aligned traces to two canonical presegmented traces and selected the alignment with the lowest DTW distance. The max value of the N-terminal VR was multiplied by 1.4, and the max value of VR GLSARRL was multiplied by 1.2, and the minimal max was used as the max value for min–max scaling. This was repeated after realigning to the canonical traces and segmenting the VRs. Unless otherwise specified, ‘normalized’ refers to z-score normalization, as in ‘normalized current’ when comparing a model signal with experimental signals.
Signal alignment
To align signals, we used DTW61 and normalized the DTW distances by dividing by the sum of the lengths of the two signals. To describe the similarity of a set of traces, we computed the DTW distance between all pairs of traces. In t-distributed stochastic neighbor embedding (t-SNE) plots, we then clustered traces on the vector of its DTW distances to all other traces. To create ensemble traces, we first identified the trace with the lowest mean DTW distance to all other traces and stretched it to create Tmedoid = [t1, t2,.., tn], where n is the mean length of all traces. We then DTW-aligned every other trace to Tmedoid and created Tconsensus = [median(alignments to t1), median(alignments to t2), …, median(alignments to tn)]. Ensemble traces in Fig. 1c, Fig. 5b and Extended Data Fig. 9d show all traces aligned to the Tconsensus, but do not plot Tconsensus.
Protein sequence-to-signal model
To describe the amino acids, we used their volumes62 and their charges at pH 7.6, at which the histidine residue is assumed to be neutral. The volume of phosphoserine was estimated as 126.6 cm3 mol−1, on the basis of a linear regression of molecular weight to volume of the other residues. The model signal, S = [S1, S2, …, Sn–19], of amino acid sequence [aa1, aa2, …, aan] is calculated by computing the signal for each of the n–19 windows of width 20 (Extended Data Fig. 5a–d). The vector Xi describes the window starting at index i in the sequence. The j-th index in Xi is 1 + Vc × volume(aai+j) + Pc × PositiveCharge(aai+j) + Nc × NegativeCharge(aai+j), for 0 ≤ j < 20, where the functions PositiveCharge and NegativeCharge take 1 if the residue has a positive or negative charge, respectively, and 0 otherwise. The constants representing weights between charge and volume, Vc = −3.9 × 10−3, Nc = 4.08 × 10−1 and Pc = −8.16 × 10−2, were determined empirically to minimize the average post-DTW distance of a training subset of protein traces to the model of their sequences. To weight the values in Xi, we use a vector PW (parabolic weight) of length 20 containing values representing a negative, centrally positioned parabolic curve. The i-th index in S is then finally computed as the dot product of Xi and PW.
ClpX step identification
For this analysis, the signals were not scaled or downsampled. They were filtered with a low-pass Bessel filter with N = 10 and Wn = 0.7. For this analysis, YY dips were extracted manually, including portions of the signal that would otherwise be considered part of the VR in this study, to best capture the entire portion for which the double tyrosines contribute to the signal. The number of residues per YY dip was calculated as pw/d, where p is the mean proportion of the total translocation dwell time spent in these regions (0.318; Extended Data Fig. 3a), w is the total number of reading windows in the sequence (359; Extended Data Fig. 1) and d is the number of YY dips per read (6). We primarily used a Bayesian-based algorithm63 to identify steps, unless otherwise noted. When applying this algorithm, a minimum length of 10 observations and a threshold of 18 was used. A total of 776 YY-dip regions were analysed, comprising 45% of all the YY dips in the dataset, omitting dips affected by potential backstepping (non-monotonic steps) or excessive noise. This selection was made by excluding YY dips that did not follow the pattern of the mean of each segmented step monotonically decreasing to the minimum and then monotonically increasing. A secondary t-test-based algorithm64 was also used to confirm the results of the stepping rate, which was used in a different study of ClpX stepping behaviour65. When using the t-test-based algorithm, a minimum window length of 10 observations and a threshold P-value of 5 × 10−5 were used, and a total of 456 dips were analysed.
YY segmentation
To identify the YY dips and VRs, a single PASTOR trace was segmented manually into each coloured section in Fig. 2a, and the remainder of the traces were aligned to it with DTW. The corresponding regions were assigned the label from the one manually segmented trace (Supplementary Fig. 4). For PASTOR-phos, two canonical traces were segmented manually, and the rest of the traces were aligned to both, and then labels were assigned according to the canonical trace with the lowest DTW distance.
VR classification
We used scikit-learn to develop and test classical machine learning models and Pytorch to develop and test convolutional neural-network models. The test set was composed of all current traces from a given set of experiments to create an out-of-sample test set. The set of test experiments was selected using linear programming (Python package Pulp) to ensure at least 12 VRs with each amino acid in the test set, and minimizing the test set size. We decided to use 12 because it gave the closest to an 80–20 train–test split: 79.6% of the VRs were in the training set and 20.4% were in the testing set (full counts are shown in Extended Data Table 1a). In classification tasks for which only VRs corresponding to a subset of amino acids were used, the test set was composed of a subset of this test set. We performed hyperparameter tuning with scikit-optimize on the training set using 5-fold cross-validation. The optimal parameters were: n_estimators = 250, min_samples_leaf = 2, max_features = ‘log2’, max_depth = 20, ccp_alpha = 0.0001, class_weight = ‘balanced_subsample’ and criterion = ‘gini’. All the results in Fig. 3b,c, Extended Data Fig. 6, Extended Data Table 2 and Supplementary Fig. 9 are from models evaluated on the test set. All the VRs containing an asparagine with a maximum transformed value above 1.3 had their labels changed to aspartate. In training all classical models, we upsampled minority classes, such that there was an equal representation of all classes in the training set. When training the convolutional neural network (CNN) in Extended Data Fig. 6c, we weighted the loss inversely proportional to each label’s class representation in the training set. To featurize the VRs, we performed principal component analysis on the vector of its DTW distances to all VRs in the training set to reduce the size of the vector to 64. We also used the median, max, middle, mean, dip, mean absolute value of the derivative and median absolute value of the derivative of the transformed signals, as well as the standard deviation of the raw (unfiltered, unscaled) signal. The CNN had the transformed signal as input. It was trained with a stochastic gradient descent optimizer with a learning rate of 0.01, had four convolutional layers followed by a gated recurrent unit (GRU) and then a fully connected layer, and was initialized with Kaiming initialization. Max pooling and a ReLU activation function were applied after each convolutional layer. The dummy classifier was implemented with the scikit-learn dummy classifier with default parameters.
Reread simulation
To collect the results shown in Extended Data Fig. 7d,e, we used a random forest without hyperparameter tuning and used 100 randomly selected 80–20 train–train splits. This was necessary to estimate the accuracy well enough with a large number of rereads, given the data limitation and the need to group samples in the test set.
Barcode error correction
To calculate the accuracy of barcode identification when using linear error-correcting codes, we started with our accuracy, pVR, of identifying a VR given an alphabet size, a, of 2, 4, 8 or 16. For a given a and number of VRs, L, we calculated the number of bits, n = L × log2(a), that could be encoded in a protein. We simulated the accuracy with error correction, p′, when n−k of the bits were allocated to linear error-correcting codes, for all integers k = 1 to n. We did this by conducting 50,000 trials of: first, encoding a random integer from 0 to 2k with a generating matrix into a message of n bits; second, randomly and independently, with probability pVR, changing each of the n/log2(a) consecutive sets of log2(a) bits in the encoded message (to a different set of bits of the same length) to simulate misclassifying one VR; and third, decoding the number with syndrome decoding. We calculated p′ to be the percentage of trials in which the decoded number was the same as the original random number.
Phosphorylation detection
Each section (C-terminal linker, VR V, VR GLSARRL, VR A and N-terminal linker) was extracted with YY segmentation. For each section, the transformed current was aligned to the model of all possible phosphorylation states, shown in Supplementary Fig. 12. We determined the number of phosphorylations in each section by the number of phosphorylations in the best-matching (lowest DTW distance) phosphorylation-state model (Supplementary Table 2) to the actual trace. When describing the signal increase in VR GLSARRL caused by PKA (Extended Data Fig. 8a), only the portion of the section up to the (n/3)-th index, where n is the length of the YY-segmented VR GLSARRL, was used because that is where PKA causes the signal to increase, as seen in Fig. 6b.
Null-hypothesis tests
All PERMANOVA tests were done on the DTW distance matrix of signals using scikit-bio and 106 permutations, unless we used a Bonferroni correction, in which case n × 106 permutations were used, where n is the number of comparisons performed. Kruskal–Wallis, T and Mann–Whitney U tests were performed using SciPy. Reported P values were multiplied by n if we noted that we used a Bonferroni correction. All tests were two-sided unless stated otherwise, and P values were considered significant if P < 0.05.
Materials availability
Protein expression plasmids are available at Addgene.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.