The ultra-high affinity transport proteins of ubiquitous marine bacteria

Identification of SBP genes

Nineteen candidate SBP genes in the genome of Ca. P. ubique strain HTCC1062 were identified through a search of the TransportDB 2.0 database⁵⁹ (http://membranetransport.org; accessed 22 January 2020). One of these genes, SAR11_0371, was annotated as a ‘possible transmembrane receptor’ in UniProt and showed a non-canonical predicted domain structure consisting of a short SBP-like domain (170 amino acids) followed by a coiled coil domain and unidentified C-terminal domain. Additionally, genome context analysis showed that, unlike the other ABC SBP genes in Ca. P. ubique HTCC1062, SAR11_0371 was not colocalized with genes encoding the membrane permease or ATP-binding cassette components of an ABC transport system. Thus, SAR11_0371 was considered not to represent the SBP component of an SBP-dependent transport system and was excluded from the analysis. We also attempted to identify additional SBP genes through a search of the UniProt database for proteins in Ca. P. ubique belonging to Pfam clans CL0177 (PBP; periplasmic binding protein) and CL0144 (Periplas_BP; periplasmic binding protein like); however, this search did not return any additional candidate genes.

Cloning

The protein sequence of each SBP from Ca. P. ubique HTCC1062 was obtained from the UniProt database. Signal sequences were predicted using the SignalP 5.0 server⁶⁰ and removed. The protein sequences were then back-translated and codon-optimized for expression in E. coli, and the resulting genes were obtained as synthetic DNA from Twist Bioscience or Integrated DNA Technologies. The synthetic genes were cloned into the NdeI/XhoI site of the pET-28a(+) expression vector by In-Fusion cloning using the In-Fusion HD Cloning Kit (Takara Bio), yielding expression constructs with an N-terminal hexahistidine tag and thrombin tag. Correct assembly of each expression vector was confirmed by Sanger sequencing (FASMAC). The putative csiD gene, SAR11_1354, and several homologues of the Ca. P. ubique HTCC1062 SBPs (Supplementary Table 8) were cloned similarly into the pET-28a(+) vector, except that the thrombin tag was removed from the constructs of SAR11_1354, SAR11_0266 (Fub), or SAR11_1290 (SAR324). The sequences of oligonucleotides and synthetic genes used in this study are listed in Supplementary Table 9.

Optimization of protein expression

Protein expression was initially tested in E. coli BL21(DE3) cells grown in Luria-Bertani (LB) and Terrific Broth (TB) media at 30 °C and 17 °C. SAR11_0655 showed optimal soluble expression in LB medium at 17 °C, SAR11_1203 showed optimal soluble expression in TB medium at 30 °C, and 7 proteins (SAR11_0797, SAR11_0807, SAR11_0864, SAR11_1068, SAR11_1179, SAR11_1210, SAR11_1238, and SAR11_1361) showed optimal soluble expression in TB medium at 17 °C. Next, the remaining proteins were tested for expression in E. coli SHuffle T7 cells (New England Biolabs) in TB medium at 17 °C; this strain expresses the disulfide bond isomerase DsbC, which can increase soluble recombinant expression of cytoplasmic proteins by promoting correct formation of disulfide bonds. Soluble expression of SAR11_0769, SAR11_0953, SAR11_1302, and SAR11_1336 was achieved under these conditions. Due to the lack of soluble expression for the remaining four proteins (SAR11_0266, SAR11_0271, SAR11_1290 and SAR11_1346), we also tested expression of one or two close homologues of each protein (Supplementary Table 8). The SAR11_0271 homologue from ‘Ca. Pelagibacter’ sp. HIMB1321 (denoted SAR11_0271*) could be expressed in soluble form in SHuffle T7 cells in TB medium at 17 °C, while the SAR11_1346 homologue from the same species (denoted SAR11_1346*) could be expressed in soluble form in BL21(DE3) cells in TB medium at 17 °C. SAR11_0271* and SAR11_1346* share 91.4% and 88.9% sequence identity, respectively, with the corresponding proteins from Ca. P. ubique HTCC1062, and the binding site residues are completely conserved (Supplementary Fig. 5), indicating that the functions and properties of the homologous SBPs are likely to be identical. Neither homologue of SAR11_0266 or SAR11_1290 could be expressed in soluble form in BL21(DE3) or SHuffle T7 cells. Expression of SAR11_0266 and SAR11_1290 without His₆ or thrombin tags also yielded insoluble protein.

Protein expression was typically evaluated by SDS–PAGE analysis as follows. Cells transformed with the relevant expression vector by electroporation were spread from a frozen glycerol stock onto an LB agar plate containing 0.2% (w/v) glucose and 25 µg ml⁻¹ kanamycin and incubated at 30 °C overnight. The cells were then scraped into a small volume of LB medium and used to inoculate 3 ml of the relevant growth medium containing 25 µg ml⁻¹ kanamycin in a 10 ml round bottom tube at a starting OD₆₀₀ of 0.05. The culture was incubated at 37 °C with shaking at 220 rpm until the OD₆₀₀ reached 0.5. One-millilitre aliquots were transferred to clean round bottom tubes and isopropyl β-d-1-thiogalactopyranoside (IPTG) was added to a final concentration of 0.5 mM. The induced cultures were incubated with shaking at 220 rpm at 17 °C overnight or 30 °C for 3 h. A 500-µl aliquot of each culture was resuspended in lysis buffer (20 mM Tris, 0.5 M NaCl, 1% (v/v) Triton X-100, pH 8.0) and incubated at room temperature for 10 min. The cell lysate was centrifuged at 21,000g for 5 min (4 °C). The soluble fraction of the cell lysate was transferred to a tube containing 30 µl cOMPLETE His-Tag purification Ni-NTA resin (Roche) suspended in 500 µl buffer A (8 M urea, 20 mM Tris, 0.5 M NaCl, pH 8.0), while the insoluble fraction of the cell lysate was dissolved in 500 µl buffer A, centrifuged at 21,000g for 5 min, and then transferred to a tube containing 30 µl Ni-NTA resin suspended in 500 µl buffer A. In both cases, the resin was incubated at room temperature for 10 min, washed twice with 500 µl buffer A, and then eluted by incubation with 50 µl buffer B (8 M urea, 20 mM Tris, 0.5 M NaCl, 0.5 M imidazole, pH 8.0) at room temperature for 5 min. Fifteen microliters of supernatant was mixed with 5 µl of 4× SDS–PAGE sample loading buffer and heated at 90 °C for 10 min, then loaded onto a 4–15% pre-cast SDS–PAGE gel (Bio-Rad). The gel was run at 200 V for 30 min and visualized with Coomassie Blue.

Large-scale protein expression and purification

For expression and purification of the Ca. P. ubique SBPs, E. coli BL21(DE3) or SHuffle T7 cells transformed with the relevant expression vector were spread from a frozen glycerol stock onto an LB agar plate containing 0.2% (w/v) glucose and 25 µg ml⁻¹ kanamycin, and incubated at 30 °C overnight. The cells were then scraped into 3 ml LB medium, and 500 µl of the resulting cell suspension was used to inoculate 500 ml LB or TB medium supplemented with 25 µg ml⁻¹ kanamycin in a 2 l or 3 l flask, preheated at 37 °C. The culture was incubated at 37 °C with shaking at 220 rpm until the OD₆₀₀ reached 0.5, then cooled briefly in an ice-water bath until the temperature reached ~25 °C. IPTG was added to a concentration of 0.5 mM, and the culture was incubated at 17 °C with shaking at 220 rpm for a further 16 h. Cells were pelleted by centrifugation (3,300g, 15 min, 4 °C) and frozen at −20 °C until use. For protein purification, cells were thawed on ice, resuspended in 100 ml Ni binding buffer (20 mM Tris, 500 mM NaCl, 20 mM imidazole, pH 8.0), and lysed by sonication. After addition of 500 U Benzonase Nuclease (Sigma-Aldrich) to digest DNA, the cell lysate was centrifuged at 10,000g for 1 h (4 °C). The supernatant was filtered through a 0.45-µm syringe filter and then loaded onto a 1 ml HisTrap HP column (Cytiva) equilibrated with Ni wash buffer using an ÄKTA Pure FPLC system (Cytiva). For purification under native conditions, the column was washed with 10 ml Ni binding buffer followed by 10 ml Ni wash buffer (20 mM Tris, 500 mM NaCl, 44 mM imidazole, pH 8.0), and then the target protein was eluted in 10 ml Ni elution buffer (20 mM Tris, 500 mM NaCl, 500 mM imidazole, pH 8.0). For purification under denaturing conditions, the column was washed with denaturing Ni binding buffer (8 M urea, 20 mM Tris, 250 mM NaCl, 20 mM imidazole, pH 8.0) at 1 ml min⁻¹ for 30 min after loading of the clarified cell lysate, and the target protein was eluted with 10 ml denaturing Ni elution buffer (8 M urea, 20 mM Tris, 250 mM NaCl, 250 mM imidazole, pH 8.0). Proteins purified under native conditions were concentrated to 400 µl using a 10 kDa molecular weight cut-off (MWCO) Amicon Ultra-4 centrifugal spin concentrator (Merck-Millipore) and purified by size-exclusion chromatography using a Superdex 200 Increase 10/300 column (Cytiva), eluting in DSF buffer (20 mM HEPES, 0.3 M NaCl, pH 7.50). For storage, proteins were concentrated to a volume of 0.5–2 ml and glycerol was added to a concentration of 10% (v/v). The protein was then flash-frozen in 100–200-µl aliquots in liquid nitrogen and stored at −80 °C until use. ArgT from S. enterica was expressed from a pETMCSIII plasmid and purified as described previously⁶¹.

Protein refolding

In most cases, protein purified under denaturing conditions was diluted to a concentration of 0.5 mg ml⁻¹ and volume of 10–30 ml in denaturing Ni binding buffer (8 M urea, 20 mM Tris, 250 mM NaCl, 20 mM imidazole, pH 8.0) and transferred to 10 kDa MWCO SnakeSkin dialysis tubing (Thermo Scientific). The protein was then dialysed against 2 l dialysis buffer (20 mM Tris, 150 mM NaCl, pH 8.0) at 4 °C with three buffer changes over a period of 24 h. The protein was collected and exchanged into DSF buffer using a 10 kDa MWCO Amicon Ultra-15 centrifugal concentrator, then concentrated to 400 µl and purified by size-exclusion chromatography as described above. For SAR11_1346*, an improved yield of monomeric protein was obtained using the rapid dilution for refolding: 2 ml of denatured protein (5 mg ml⁻¹ in denaturing Ni binding buffer) was added dropwise with stirring to 40 ml pre-chilled refolding buffer (20 mM Tris, 150 mM NaCl, 10% (v/v) glycerol, pH 8.0) and incubated at 4 °C with stirring for 20 h. The protein was then concentrated and purified by size-exclusion chromatography as above.

Differential scanning fluorimetry

DSF experiments were performed using a StepOnePlus Real-Time PCR System and StepOne software (Applied Biosystems) based on literature protocols^62,63. Reaction mixtures were prepared in twin.tec Real-Time PCR Plates (Eppendorf) and contained 5× SYPRO Orange (Sigma-Aldrich), 2.5 µM protein, and 2 µl 10× ligand in a total volume of 20 µl DSF buffer. The plate was sealed with optically clear sealing film and centrifuged at 2,000g for 1 min before loading into the real-time PCR instrument. The temperature was ramped at a rate of 1% (approximately 1.33 °C min⁻¹), typically over a 60 °C window centred on the melting temperature (T_M) of the target protein. Fluorescence was monitored using the ROX channel. T_M values were determined by taking the derivative of fluorescence intensity with respect to temperature and fitting the resulting data to a quadratic equation in a 6 °C window in the vicinity of the T_M in R software.

Proteins were initially screened for binding to metabolites in four Phenotype MicroArray plates, PM1 to PM4 (Biolog). The contents of each well were dissolved in 50 µl (PM1 to PM3) or 20 µl (PM4) sterile filtered water, giving a concentration of approximately 10–20 mM in each well⁶³. The plates were then sealed with aluminium sealing films and stored at −80 °C. Prior to use, the plates were thawed at room temperature and then shaken at 30 °C until the compounds had redissolved. Two microliters of each compound was added to 18 µl reaction mixture prepared as described above. A 2 °C increase in T_M compared with the median value across the plate was taken as indicative of binding^63,64.

For screening of individual compounds and confirmatory assays, compounds were dissolved at a concentration of 100 mM in ligand buffer (0.1 M HEPES pH 7.5), and the pH was adjusted with 1 M NaOH or 1 M HCl if necessary (specifically, if the pH of a 10 mM solution of the compound diluted in DSF buffer fell outside the range 6.5–8.0). These stock solutions were stored at −20 °C. Two microlitres of each compound was directly added to 18 µl reaction mixture, giving a final concentration of 10 mM, or first diluted 10-fold or 100-fold in DSF buffer to give final concentrations of 1 mM or 0.1 mM in the assay. A list of chemicals used for screening, including the supplier and catalogue number, is provided in Supplementary Table 3. Sodium (R)- and (S)-2,3-dihydroxypropane-1-sulfonate were synthesized from (R)- and (S)-3-chloro-1,2-propanediol following a literature protocol⁶⁵ and verified by ¹H and ¹³C NMR.

In the case of the TRAP and TTT SBPs, SAR11_0864 and SAR11_1203, we hypothesized that a metal ion might be required for high-affinity binding, due to the biphasic melting curve observed in the presence of isethionate in Biolog screening experiments, suggesting the presence of a mixture of active and inactive protein (SAR11_0864) or due to the discord between the highly charged ligand and the largely uncharged binding site of the SBP (SAR11_1203). Therefore, we tested the effect of the addition of metal ions (Mg²⁺, Ca²⁺, K⁺, Zn²⁺, Mn²⁺, Co²⁺, Ni²⁺, Fe²⁺ and Fe³⁺) on binding of isethionate to SAR11_0864 and citrate to SAR11_1203 by DSF (Supplementary Fig. 6). DSF experiments were performed using refolded protein as described above, with the addition of 1 mM metal ion and 1 mM ligand. Based on these results, and considering the concentration of each metal ion in seawater⁶⁶, 10 mM CaCl₂ (SAR11_0864) or 53 mM MgSO₄ (SAR11_1203) were included in subsequent DSF and ITC binding experiments for these SBPs.

Isothermal titration calorimetry

ITC experiments were performed using a MicroCal PEAQ-ITC system (Malvern Panalytical). Protein samples were refolded and freshly purified (not frozen), and protein and ligand samples were prepared in the same batch of DSF buffer used for size-exclusion chromatography to minimize the heat of dilution. For SAR11_0864 and SAR11_1203, calcium chloride (final concentration 10.3 mM) or magnesium sulfate (final concentration 53 mM), respectively, was added to the protein and ligand samples. Experiments were performed at 25 °C with stirring at 700 rpm and 10 µcal s⁻¹ reference power. Titration parameters were varied depending on the protein yield, the fraction of active protein, and the affinity and enthalpy of the interaction. In a typical titration, 35 µM protein was titrated with 1× 0.4-µl and 19× 1.6-µl injections of ligand, with the ligand concentration chosen to give >1.5-fold molar excess of ligand to active protein at the end of the titration. ITC experiments were generally performed at least in duplicate.

For simple 1:1 binding interactions, the association constant (K_a), enthalpy (ΔH), and stoichiometry (n) of the interaction were determined by fitting the data to the one-set-of-sites model in MicroCal PEAQ-ITC analysis software. In the case of the SAR11_0769 + d-glucose interaction, thermodynamic parameters were estimated through Bayesian fitting to a modified competitive binding model, which incorporated an additional parameter to account for the fraction of the ligand in each anomeric form, and a two-sets-of-sites model implemented in pytc software⁶⁷; the latter model is equivalent to the two-sets-of-sites model in the MicroCal software, except without the minor correction for heat associated with the displaced volume for each injection (for consistency with the other models in pytc). Thermodynamic parameters for the SAR11_0953 + l-glutamate, SAR11_1203 + citrate, SAR11_1210 + l-arginine, SAR11_1336 + glycine betaine, and SAR11_1346* + l-leucine interactions were determined through competitive displacement experiments⁶⁸, in which l-phenylalanine, cis-aconitate, d-octopine, glycine, or l-serine (respectively) were included at a fixed concentration in the cell to reduce the apparent binding affinity for the ligand of interest. The data for these competitive binding experiments were analysed by Bayesian fitting to the competitive binding sites model in pytc software. To confirm the high affinity of the SAR11_1210 + l-arginine interaction, a competitive binding experiment was performed where SAR11_1210 and ArgT from S. enterica (which has a K_d of 15 nM for l-arginine) were included in the cell together at the same concentration (28 µM) and titrated with l-arginine. Similarly, for the SAR11_1210(E108A) + l-arginine interaction, a mixture of SAR11_1210(E108A) and SAR11_1210 (35 µM each) was titrated with l-arginine. For these titrations, the data was fitted to a two-sets-of-sites binding model as described above to obtain thermodynamic parameters for both protein–ligand interactions. For all analyses, the heat of dilution was assumed to be a small constant value and included as a fitted parameter in the model. The validity of this assumption was confirmed for each ligand by performing a control titration where the ligand was injected into DSF buffer.

Spectrophotometric analysis of iron(iii) binding

Binding of iron(iii) to SAR11_1238 was analysed using a spectrophotometric assay based on literature protocols^69,70. UV–vis spectra were recorded at room temperature (25 °C) in a 96-well plate from 300 nm to 630 nm with 1 nm bandwidth using a Multiskan GO spectrophotometer (Thermo Scientific). An initial protein concentration of 100 µM and an initial volume of 200 µl were used for all spectrophotometric assays. First, purified SAR11_1238 was thawed and exchanged into 50 mM Tris, 200 mM NaCl buffer (pH 8.0) using a centrifugal concentrator, and the spectrum of the resulting protein sample was recorded. To prepare unliganded protein for iron-binding assays, the protein was exchanged into 50 mM Tris, 200 mM NaCl, 20 mM sodium citrate buffer (pH 8.0) by three rounds of 30-fold dilution and concentration, allowing chelation and removal of the metal ligand. Citrate was then removed by four rounds of 30-fold dilution and concentration with 50 mM Tris, 200 mM NaCl buffer (pH 8.0). Binding assays were performed by titrating the unliganded protein (200 µl of 100 µM solution) with 8× or 10× 5-µl injections of 800 µM iron(iii) solution, which was prepared from iron(iii) chloride and a 2.5-fold molar excess of trisodium citrate (which ensures that the iron(iii) remains soluble) in ultrapure water. To confirm that SAR11_1238 binds iron(iii) rather than the iron(iii)–citrate complex, the protein was also titrated under the same conditions with 800 µM ammonium iron(II) sulfate; under the aerobic conditions of the assay, iron(ii) is rapidly oxidized to iron(iii)⁶⁹. UV–vis spectra were recorded 1 min (iron(ii)) or 15 min (iron(iii)) after each injection. Finally, a competitive binding assay with citrate was used to estimate the affinity of SAR11_1238 for iron(iii). The protein was saturated with a twofold molar excess of iron(iii) solution, diluted to a volume of 1 ml, and then dialysed against 500 ml of 50 mM Tris, 200 mM NaCl buffer (pH 8.0) at 4 °C overnight to remove excess iron(iii) and citrate. The protein was then concentrated to 100 µM and titrated with 5-µl injections of 8 twofold serial dilutions of 500 mM sodium citrate (adjusted to pH 8.0 in 50 mM Tris, 200 mM NaCl buffer). The absorbance at 440 nm was recorded 5 min after each addition. The data were fitted to a hyperbolic curve, yielding an apparent K_d of 9.0 mM for citrate. Given that citrate has a K_d of ~10⁻¹⁷ M for iron(iii), this implies that SAR11_1238 has a K_d for iron(iii) on the order of ~10⁻¹⁹ M, similar to previously characterized iron(iii)-binding proteins^70,71.

X-ray crystallography

For the SAR11_0769/d-glucose and SAR11_1210/l-arginine structures, the proteins were first expressed and purified by nickel affinity chromatography under native conditions as described above. After addition of a 20-fold molar excess of d-glucose (SAR11_0769) or l-arginine (SAR11_1210), the protein was purified further by size-exclusion chromatography on a HiLoad 26/600 Superdex 75 pg column (Cytiva), eluting in 3× crystallization buffer (60 mM HEPES, 150 mM NaCl, pH 7.5). Fractions containing the target protein were collected, and d-glucose (SAR11_0769) or l-arginine (SAR11_1210) was added to a concentration of 30 µM. The protein was concentrated to a volume of ~500 µl, diluted threefold in water to reduce the NaCl concentration to 50 mM, and then concentrated further to 12 mg ml⁻¹. For the SAR11_0769/d-galactose and SAR11_0655/l-pyroglutamate structures, the proteins were expressed and purified in the same way, except that no ligands were added. Protein crystals were obtained using the vapour diffusion method in hanging drops at 20 °C, then cryoprotected and flash-frozen in liquid nitrogen. Crystallization and cryoprotection conditions for each protein are given in Supplementary Methods. X-ray diffraction data were collected on beamline BL32XU at the SPring-8 synchrotron (Harima, Japan), using the ZOO suite for automated data collection⁷². The data were automatically indexed, integrated, scaled and merged in XDS⁷³ using KAMO⁷⁴. The structure was solved by molecular replacement in Phaser⁷⁵ or MOLREP⁷⁶. For SAR11_1210, the structure of an opine-binding protein from Agrobacterium fabrum (PDB ID 5OT8) was used as a search model; in the remaining cases, an AlphaFold2 model was used⁷⁷. The structures were then refined by iterative real-space and reciprocal-space refinement in REFMAC⁷⁸, Phenix⁷⁹, and COOT⁸⁰. Data collection and refinement statistics are given in Supplementary Table 10 and Supplementary Table 11. Structures were visualized in Pymol.

Gas chromatography–mass spectrometry

SBPs purified under native conditions were exchanged into 200 mM ammonium acetate using a PD-10 desalting column (Cytiva) and concentrated to ~1 mM. A 10-nmol aliquot of protein was mixed with 10 µl of 300 µM α-methylglucopyranoside (as an internal control) and 200 µl methanol. The mixture was agitated at 1500 rpm at 24 °C for 10 min and then centrifuged at 21,000g for 20 min at 4 °C. The supernatant was evaporated to dryness using a vacuum evaporator, redissolved in 20 µl anhydrous pyridine, and derivatized by addition of 30 µl N-methyl-N-(trimethylsilyl)trifluoroacetamide (MSTFA) containing 1% trimethylchlorosilane (Supelco) followed by incubation at 70 °C for 1 h. In the case of SAR11_1361, the dried sample was instead dissolved in 20 µl of 20 mg ml⁻¹ methoxyamine hydrochloride in anhydrous pyridine and incubated at 37 °C for 90 min with agitation at 750 rpm before addition of the MSTFA mixture. The derivatized samples were injected immediately onto an Agilent 7890 A GC System (Agilent Technologies) equipped with a PAL COMBI-XT autosampler (CTC Analytics) and connected to a PEGASUS 4D GC×GC TOF-MS instrument (LECO) operating in one-dimensional mode. The GC was fitted with a DB-1MS column (Agilent Technologies) with 30 m length, 0.25 mm internal diameter, and 0.25 µm film thickness. The instrument was operated in pulsed split mode with a split ratio of 2 and injection volume of 1 µl. The inlet temperature was 250 °C. Helium was used as the carrier gas with a flow rate of 1 ml min⁻¹. The GC oven temperature was held at 70 °C for 5 min, then raised at 12 °C min⁻¹ to 300 °C, and finally held at 300 °C for 10 min. Mass spectrometry data were collected from 50 to 500 m/z after a 6.5-min solvent delay. The ion source and transfer line temperatures were 250 °C and the ionization energy was 70 eV. Data analysis and spectral database searches against the NIST database were performed using ChromaTOF software (LECO). Protein-derived samples were analysed before control samples to prevent carryover.

Biogeographical analysis

Biogeographical analysis was performed using the Ocean Gene Atlas v2.0 server³³. Abundance data for each SBP gene from Ca. P. ubique HTCC1062 in the Tara Oceans OM-RGC_v2_metaG and OM-RGC_v2_metaT datasets was obtained through a BLAST search with a stringent e-value threshold of 10⁻³⁰. To avoid inclusion of homologous SBPs with different transport functions, hits with a sequence identity of less than 40% (for ABC SBPs) or 55% (for TRAP and TTT SBPs) compared with the corresponding HTCC1062 SBP were excluded from the analysis.

To estimate the total abundance of SBP transcripts, abundance data for each of the 38 PFAM families in CL0177 (PBP; periplasmic binding protein) and CL0144 (Periplas_BP; periplasmic binding protein like), excluding the transferrin family (PF00405) and any families that contain solely enzymes or transcription factors (PF00800, PF01379, PF01634, PF02621, PF03466, PF09084), were obtained using a hmmer search of the OM-RGC_v2_metaT dataset with an e-value threshold of 10⁻¹⁰. Hits were obtained for 26 out of 31 PFAM families. For each PFAM family, the corresponding hidden Markov model (HMM) was obtained from the InterPro database⁸¹. The protein sequences from the hmmer search were then aligned to this HMM using hmmalign and used to construct a new HMM using hmmbuild in HMMER3.4 (http://hmmer.org). A second hmmer search of the OM-RGC_v2_metaT dataset, with a lower e-value threshold of 10⁻⁵, was then conducted using the resulting HMM. The hits from all 52 searches were combined and redundant hits were removed, resulting in a total of 211,222 unique SBP genes. The two-step search recovered 94% of the 23,879 genes identified as homologues of the Ca. P. ubique HTCC1062 SBPs in the BLAST analysis before application of a sequence identity threshold; the remaining 1267 genes were also added to the list of SBP genes. Finally, the total abundance of SBP genes at each site was calculated.

To estimate the percentage of SAR11 bacteria at a site containing a given SBP from Ca. P. ubique HTCC1062, we used the recruitment values of 159 SAR11 genomes in the Tara Ocean metagenome dataset calculated by Haro-Moreno et al.³⁴. The presence of a homologue of each SBP in each of the corresponding genomes was determined by BLAST using a 50% sequence identity and 50% coverage threshold. The relative abundance of SAR11 bacteria containing a given SBP homologue was then calculated for each station. Plots were generated using R and GraphPad Prism.

Phylogenetic analysis

Protein sequences homologous to the SBP of interest were identified via a BLAST search of the UniProtKB Reference Proteomes and Swiss-Prot databases⁸². The resulting sequences were filtered to remove a small number of unusually long sequences (>20% greater than mean length) and aligned in MUSCLE v3.8.31⁸³. The alignment was trimmed in trimAl v1.2 using the automated1 option⁸⁴ and then used to generate a maximum-likelihood phylogeny in FastTree v2.1.11, using LG + Γ₂₀ as the substitution model⁸⁵. For each protein sequence in the tree, the fraction of conserved binding site residues, compared with the corresponding protein from Ca. P. ubique HTCC1062, was estimated. The binding site residues were obtained from the crystal structure (SAR11_0769) or estimated from an AlphaFold2 model^86,87. For this analysis, the following substitutions were treated as conservative: S/T, I/M, V/L, I/V, L/M, D/E, Q/N, A/V, F/Y, Y/W, F/W. Phylogenetic tree figures were generated using the ggtree package in R⁸⁸. Figures showing taxonomic distribution (Extended Data Fig. 8b) were generated using Krona⁸⁹.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link