Ethics and tissue acquisition
Human fetal bone and liver samples were obtained from 15 fetuses with Ts21 12–20 post-conception weeks (PCW) of age and 5 disomic fetuses 11–19 PCW of age, following termination of pregnancy and informed written consent. The human fetal material was provided by the Joint MRC/Wellcome Trust (grant MR/R006237/1) Human Developmental Biology Resource (http://www.hdbr.org), with maternal informed consent, in accordance with ethical approval by the National Health Service (NHS) Research Health Authority, REC Ref: 18/LO/0822. HDBR is regulated by the UK Human Tissue Authority (HTA; www.hta.gov.uk) and operates in accordance with the relevant HTA Codes of Practice. Sample size was not predetermined but based on sample availability and limited by the time period. Randomization and blinding were not applicable because samples were collected based on their karyotype, and data were analysed following automated computational pipelines.
Dissociation of fetal tissues
Fetal livers and femurs were received in L15 media and processed within 3 h from dissection. Livers were cut in smaller pieces with a scalpel and transferred to a tube containing prewarmed digestion media: RPMI (Gibco) supplemented with 10% FBS (Gibco), penicillin–streptomycin (10 U ml−1 penicillin and 100 ng ml−1 streptomycin, Sigma-Aldrich), 2 mM l-glutamine (Thermo Scientific), 1× MEM NEAA (Gibco), 1 mM sodium pyruvate (Gibco) and 1.6 mg ml−1 collagenase IV (Sigma-Aldrich). The tube was vortexed for 10 s, then incubated at 37 °C for 30 min, and vortexed for 10 s every 15 min. The digested tissue was filtered through a 100-μm filter and diluted in cold D-PBS (Gibco). Cells were centrifuged at 300g for 5 min, then aliquoted and cryopreserved in KnockOut Serum Replacement (Gibco) + 5% DMSO (Sigma-Aldrich). For femurs, adherent material was removed, then the epiphyses were removed with a scalpel and the bone marrow flushed with D-PBS. The remaining bone was cut in small pieces, then ground with a mortar and pestle using digestion media and incubated at 37 °C for 30 min, vortexing every 15 min. The digested material and the bone marrow flush were mixed and filtered through a 100-μm filter. Cells were centrifuged at 300g for 5 min, then aliquoted and cryopreserved in KnockOut Serum Replacement (Gibco) + 5% DMSO (Sigma-Aldrich). Cells were stored in liquid nitrogen until further analysis.
The karyotype for each sample used in this study was determined by quantitative fluorescent PCR (QF-PCR). The analysis was performed by the tissue bank from which the samples were obtained. QF-PCR was performed using chromosome-specific microsatellite markers. The analysis showed normal results with an apparently normal diploid complement for chromosomes 13, 15, 16, 18, 21 and 22 and the sex chromosomes in disomic samples and trisomy for chromosome 21 in Ts21 samples. No mosaicisms were detected.
FACS sorting for scRNA-seq
On the day of FACS sorting, cells were rapidly thawed at 37 °C and transferred to complete RPMI media (RPMI (Gibco) supplemented with 10% FBS (Gibco), penicillin–streptomycin (10 U ml−1 penicillin and 100 ng ml−1 streptomycin, Sigma-Aldrich), 2 mM l-glutamine (Thermo Scientific), 1× MEM NEAA (Gibco) and 1 mM sodium pyruvate (Gibco)). Live-cell enrichment was performed using MACS Dead Cell Removal Kit (130-090-101, Miltenyi Biotec) following the manufacturer’s instructions. When depleting for CD235a+ cells, a magnetic negative selection was performed using CD235a Microbeads (130-050-501, Miltenyi Biotec) and MACS LS columns (130-042-401, Miltenyi Biotec) following the manufacturer’s instructions.
For FACS sorting, cells were stained with Zombie Aqua (Thermo Fisher) to exclude dead cells and the cocktail of antibodies (Supplementary Table 21, sc-sorting panel) for 30 min at 4 °C. Cells were centrifuged for 5 min at 300g at 4 °C, resuspended in a final volume of 500 μl of 5% FBS in PBS, subsequently filtered into polypropylene FACS tubes (352063, Thermo Fisher) and sorted on a BD FACSAria Fusion.
scRNA-seq
Each cell suspension was submitted for 3′ scRNA-seq using Single Cell G Chip Kit, chemistry v3.1 (10X Genomics), following the manufacturer’s instructions. Libraries were sequenced on an Illumina NovaSeq S4 targeting 50,000 reads per cell, and mapped to the GRCh38 human reference genome using the Cell Ranger toolkit (v3.0.0).
Nuclei preparation and multiome sequencing
Nuclei preparation was performed following 10X genomics recommendations. Live, CD45+-sorted cells were centrifuged at 300g for 10 min at 4 °C. Pellets were resuspended in 45 µl chilled lysis buffer (10 mM Tris-HCl (pH 7.4), 10 mM NaCl, 3 mM MgCl2, 0.1% Tween-20, 0.1% Nonidet P40 substitute, 0.01% digitonin, 1% BSA, 1 mM dithiothreitol and 1 U ml−1 RNase inhibitor in nuclease-free water) and incubated on ice for 5 min. Of chilled wash buffer (10 mM Tris-HCl (pH 7.4), 10 mM NaCl, 3 mM MgCl2, 1% BSA, 0.1% Tween-20, 1 mM dithiothreitol and 1 U ml−1 RNase inhibitor in nuclease-free water), 50 µl was added and nuclei were centrifuged at 500g for 7 min at 4 °C. After removing the supernatant, the nuclear pellets were washed with 45 µl chilled diluted nuclei buffer (1× nuclei buffer, 1 mM dithiothreitol and 1 U ml−1 RNase inhibitor in nuclease-free water) without pipetting. Nuclei were centrifuged at 500g for 10 min at 4 °C. Nuclei were resuspended in 7 µl of chilled diluted nuclei buffer and counted. Fifteen thousand nuclei were targeted for library preparation. Each nuclei suspension was submitted for library preparation using the Chromium Next GEM Chip J Single Cell Kit (10X Genomics), following the manufacturer’s instructions. Libraries were sequenced on an Illumina NovaSeq S4 targeting 50,000 reads per nucleus, and mapped to the GRCh38 human reference genome using the Cell Ranger Arc toolkit (v1.0.1).
Processing of tissues for 10X Visium spatial transcriptomics
Tissues were frozen in dry-ice-cooled isopentane and stored in air-tight tissue cryovials at −80 °C. Before undertaking any spatial transcriptomics protocol, the tissues were embedded in an optimal cutting temperature compound (OCT) and tested for RNA quality with RNA integrity number (RIN). Tissues with RIN values > 7 were cryosectioned in a pre-cooled cryostat at 10 μm thickness. Two consecutive sections were cryosectioned at 10 μm thickness in a pre-cooled cryostat and transferred to the four 6.5 mm × 6.5 mm capture areas of the gene expression slide. Slides were fixed in methanol for 30 min before staining with haematoxylin and eosin and then imaged using the Nanozoomer slide scanner. The tissues underwent permeabilization for 6 min. Reverse transcription and second-strand synthesis was performed on the slide with cDNA quantification using quantitative PCR with reverse transcription (qRT–PCR) using the KAPA SYBR FAST-qPCR kit (KAPA Biosystems) and analysed on the QuantStudio (Thermo Fisher). Following library construction, these were quantified and pooled at 2.25 nM concentration. Pooled libraries from each slide were sequenced on NovaSeq SP (Illumina) using 150-bp paired-end dual-indexed setup to obtain a sequencing depth of approximately 50,000 reads as per the 10X Genomics recommendations.
Single-cell in vitro culture
Single-cell colony-forming unit (sc-CFU) was performed on fetal HSCs, as previously described51. Single, live, Lin−, CD34+, CD38−, CD62L+, CD52+ cells isolated from the fetal liver of three different fetuses with Ts21 (median of 13 PCW) and disomy (median of 12 PCW) were index-sorted into 96-well plates (Supplementary Table 21, HSC sc-CFU panel) containing StemSpan SFEM (Stemcell Technologies) supplemented with penicillin–streptomycin (10 U ml−1 penicillin and 100 ng ml−1 streptomycin, Sigma-Aldrich), 2 mM l-glutamine (Thermo Scientific), 20 ng ml−1 G-CSF (Peprotech), 20 ng ml−1 SCF (Peprotech), 20 ng ml−1 Flt3-L (Peprotech), 50 ng ml−1 TPO (Peprotech), 20 ng ml−1 IL-3 (Peprotech), 20 ng ml−1 IL-6 (Peprotech), 20 ng ml−1 IL-5 (Peprotech), 20 ng ml−1 M-CSF (Peprotech), 20 ng ml−1 GM-CSF (Peprotech) and 20 U ml−1 EPO (RnD). Cells were cultured for 15 days at 37 °C at 5% CO2. At the end of the culture, colonies were assessed for their lineage output by the expression of CD41a (megakaryocytic), CD235a (erythroid), CD3/CD56 (lymphoid) and CD11b (myeloid) by flow cytometry using a BD LSR-Fortessa analyser (Supplementary Table 21, sc-CFU lineage panel). Colonies were considered positive for a lineage if 30 or more cells were detected in the relative gate. The total number of cells in the colony was determined by Trypan blue exclusion using a Countess II cell counter (Thermo Fisher). To assess differences in the colony output between Ts21 and disomy, we performed a chi-squared test using Ts21 as the observed distribution and disomy as the expected distribution.
To compare Ts21 erythroid lineage output to disomic erythroid lineage output in the largest colonies, we first subset the Ts21 colonies with a cell count higher than 95% of all disomic colonies (equivalent to 20% of all Ts21 colonies), and the top 20% size colonies in disomic colonies. The output of multi-lineage colonies was binarized to the lineage with the highest number of cells in the relative gate. We then performed a binomial test with n = 17 observed Ts21 erythroid lineages, k = 23 total Ts21 lineages, and P = 0.5 (the proportion of the disomic lineages that are erythroid).
Fetal liver phenotyping by flow cytometry
Cells were rapidly thawed in complete RPMI media at 37 °C, then centrifuged at 300g for 5 min and washed again with D-PBS. Cells were then resuspended in D-PBS and LIVE/DEAD blue was added at a 1:800 final concentration. Cells were incubated for 15 min in the dark at room temperature, then washed with D-PBS. Cells were then stained for 30 min in the dark at room temperature with the antibody cocktail (Supplementary Table 21, phenotype panel), in the presence of BD Horizon Brilliant Stain Buffer (final dilution 1:4) and Miltenyi FcR blocking reagent (final dilution 1:5) in a final volume of 200 µl. Cells were washed with D-PBS and immediately acquired on a Cytek Aurora (five lasers setup). Data were analysed on FlowJo (v10.8.2).
MitoTracker and MitoSOX staining
MitoSOX is a specific fluorogenic dye for live-cell mitochondria, producing bright green fluorescence upon oxidation by mitochondrial superoxide. MitoTracker green FM accumulates in mitochondria independent of membrane potential and oxidative stress, serving as a reliable tool for mitochondrial mass measurement38. Considering dye efflux bias in HSCs and progenitor cells39, we used verapamil treatment to block xenobiotic efflux pumps and mitigate preferential dye efflux (Supplementary Fig. 16c,e), ensuring a more accurate representation of mitochondrial mass and mtROS levels.
Cells were rapidly thawed in complete RPMI media at 37 °C, then centrifuged at 300g for 5 min and washed again with D-PBS. Immediately before the incubation with the dyes, MitoTracker Green FM reagent was dissolved in DMSO, and MitoSOX green was dissolved in anhydrous N,N-dimethylformamide at a concentration of 1 mM. Cells were resuspended in 1 ml D-PBS and incubated with MitoTracker green FM (final dilution 1:1,000) or 2 µM MitoSOX green for 30 min at 37 °C in the presence of 50 µM verapamil (diluted from an aqueous 10 mM solution). Cells were then washed with D-PBS and stained for 30 min in the dark on ice with the antibody cocktail (Supplementary Table 21, mito panel) in the presence of BD Horizon Brilliant Stain Buffer (final dilution 1:4), Miltenyi FcR blocking reagent (final dilution 1:5) and 50 µM verapamil, in a final volume of 100 µl. Cells were washed again in D-PBS and immediately acquired on a Cytek Aurora (five lasers setup). Data were analysed on FlowJo (v10.8.2). The populations of interest (Lin+, CD38+, CD38−, HSCs) were exported as a fsc file and imported in R via flowCore 2.2 to obtain the fluorescence data of each mitochondrial probe for each cell.
MitoTracker and MitoSOX data analysis
To test whether Ts21 cells have significantly different mitochondrial mass or mtROS from disomic cells, we fit a Gaussian generalized linear mixed model. At single-cell resolution, we transformed the MitoSOX and mitochondrial mass values using a rank inverse normal transformation and used the transformed values as the response variable. We accounted for age as a fixed effect and sample as a random intercept. We tested for the effect of disease status (a fixed effect) within our model and determined significance using FDR across the eight fitted models (four cell types by two response variables).
TFR2 overexpression in HUDEP-2 cells
For the preparation of lentiviruses, HEK293T cells were co-transfected with lentiviral TFR2 expression plasmid or an empty vector (purchased from vectorbuilder) together with psPAX2 packing plasmid and pMD2.G envelope plasmid using Lipofectamine 3000 (Thermo Fisher). The viral supernatants were harvested at 48 h post-transfection and 72 h post-transfection, pooled and concentrated by ultracentrifugation (for 90 min at 90,000g). The pellet was resuspended in StemSpan SFEM II medium (Stemcell Technologies), aliquoted and stored at −80 °C. Functionality and titre of the lentiviruses were determined by serial dilutions on 293T cells and assessing GFP/TFR2 expression 48 h after transduction. HUDEP-2 cells were cultured and differentiated as previously described52. In brief, the cultivation medium of HUDEP-2 is based on StemSpan SFEM II medium (Stemcell Technologies) supplemented with 2% penicillin–streptomycin (20 U ml−1 penicillin and 200 ng ml−1 streptomycin, Sigma-Aldrich), 2 mM l-glutamine (Thermo Scientific), 50 ng ml−1 SCF (Peprotech), 3 U ml−1 EPO (R&D), 1 µM dexamethasone (Sigma) and 1 μg ml−1 doxycycline (Sigma). HUDEP-2 cells were incubated with concentrated lentiviruses (multiplicity of infection of 1) for 6 h, spun down and resuspended in fresh cultivation medium. After 2 days, cells were spun down and resuspended in differentiation medium, which is composed of IMDM (Thermo Fisher), supplemented with 2% penicillin–streptomycin (20 U ml−1 penicillin and 200 ng ml−1 streptomycin, Sigma-Aldrich), 2 mM l-glutamine (Thermo Scientific), 50 ng ml−1 SCF (Peprotech), 3 U ml−1 EPO (R&D), 5% human AB serum (Sigma), 10 μg ml−1 insulin (Sigma), 330 μg ml−1 holo-transferrin (Sigma), 2 U ml−1 heparin (Sigma) and 1 μg ml−1 doxycycline (Sigma). Cells were cultured for 4 days in differentiation medium at 37 °C at 5% CO2. Cells were assessed for TFR2 and erythroid lineage marker expression (CD235a, CD71 and CD36) by flow cytometry using a BD LSR-Fortessa analyzer (Supplementary Table 21, HUDEP panel). HUDEP-2 and HEK293T cells were gifted by the Grønbæk and Issazadeh-Navikas laboratories, respectively (BRIC, University of Copenhagen), and routinely tested for mycoplasma contamination.
Analysis of the scRNA-seq data
For each CellRanger output corresponding to a specific technical and biological replicate, we identified low-quality cells or empty droplets by applying the barcodeRanks and emptyDrops functions using the R package DropletUtils53. We then merged all CellRanger outputs into a single Scanpy object54. Following per-sample droplets removal, quality control was applied based on three parameters: the total unique molecular identifier count (lower-upper threshold (750, 110,000)), the number of detected genes (lower-upper threshold (250, 8,500)), and the proportion of mitochondrial gene count per cell (an upper bound of 20%). We further applied Scrublet55 to remove potential doublets.
Next, we subsetted to samples of the same organ and Ts21 status and merged into a single dataset (for example, a dataset containing only Ts21 liver samples). We reasoned that Ts21 and disomic cell transcriptomes would be influenced heavily by an extra copy of chromosome 21 (for example, 50% higher expression within Ts21 cells that would not be present in disomic cells). As a result, the highly variable genes will be impacted by disomic versus Ts21 differences and insufficiently capture the genes relevant to liver-residing or femur-residing cells. Therefore, we chose to create disomic liver, Ts21 liver, disomic femur and Ts21 femur datasets to most accurately annotate the population of the individual cells in the data. Within each of the four merged datasets, we applied log-normalization, using the scaling factor 10,000 to correct for between-sample differences in library size, and calculated highly variable genes, using the Seurat (v5.0.3) implementation56. We performed principal component analysis (PCA) on the highly variable genes for dimensionality reduction, retaining the top 15 components using the Scree plot elbow rule. Data were batch-corrected using Harmony16 to account for additional technical variations arising between samples that are non-biological in origin.
We then performed an iterative clustering procedure to identify clusters in the single-cell data. Broadly, our iterative clustering procedure first found initial clusters using the Leiden algorithm, then merged clusters from seemingly identical cell populations, and finally subclustered into further refined populations using K-means clustering. Thus, the iterative clustering allowed us to further refine initial clustering, such that initial clusters containing multiple cell types can be further split into lower-level cell types. This is particularly useful for broad cell types such as erythroid cells that were further split into early and late erythroid cells. Following the between-sample batch correction above, we computed a neighbourhood graph using the uniform manifold approximation and projection (UMAP) approach implemented in Scanpy and subsequently clustered with the Leiden algorithm. For visualization purposes, we used UMAP manifold embedding to capture the global features in two and three dimensions. We identified marker genes for each cluster by performing a Wilcoxon signed-rank test with FDR correction, and we annotated clusters using these marker genes and canonical marker genes. We performed further clustering, by manually choosing clusters to subcluster using K-means clustering, merging clusters of the same cell type and performing marker gene detection. With this approach, we generated four separate annotated scRNA-seq datasets, together with associated marker genes, for the Ts21 (liver and femur) and disomic (liver and femur) datasets.
Contrasting cell-type abundances between Ts21 and disomic samples
To compare cell-type abundances, we calculated the proportion of each major cell type group in each sample. We contrasted cell-type proportions between developmental stage-matched Ts21 and disomic samples with the same sorting strategy using a Mann–Whitney U-test. Finally, we corrected for multiple testing using FDR and assessed significance at FDR < 0.1.
Ligand–receptor analysis using CellPhoneDB
We inferred statistically significant ligand–receptors and their corresponding cell types using CellPhoneDB on a subsampled Ts21 liver dataset, such that the proportion of cells in the reduced sample recapitulated the proportion in the full Ts21 dataset and corresponded to the number of cells in the disomic dataset. We repeated the same analysis on the full disomic dataset (which now has an identical cell size). We kept any pairs that did not involve HLA or a protein complex, and kept only those that involved a single receptor. Among the significant ligand–receptors (P < 0.001), we selected ligands or receptors identified in HSC/MPPs and used to communicate with vascular endothelial cells, and performed gene set enrichment analysis (GSEA) on those using EnrichR.
Analysing differential trajectory of the osteo-lineage
For the disomic and Ts21 femur, we computed partition-based graph abstraction (PAGA) using all annotated stromal cells (with a PAGA threshold of 0.05). We also computed a force-directed diffusion graph using Pegasus57, and overlaid the Pegasus and PAGA outputs.
Next, we focused on two different cases of osteo-linage transitioning: (1) within CAR cells, LepR+ CAR cells, osteoprogenitors and osteoblasts, and (2) within arterial endothelial cells, transitioning endothelial cells and osteoblasts. We computed pseudotime using Scanpy, and used the pseudotimes as input into PseudotimeKernel in CellRank58 (without usage of RNA velocity information) to obtain generalized Perron cluster cluster analysis (GPCCA) estimators for identifying macrostates and computing transition probabilities among them. We set the terminal state number according to the shape of the force-directed graph. For case (1), we chose two states in disomic cells based on the observation that there are two clear branches splitting between CAR cells and osteoblasts, and we chose one state in Ts21 because we observed a single branch leading to osteoprogenitors. For case (2), we chose two states considering osteoblasts as one end and some transitioning endothelial cells as the other. Next, we plotted the STREAM plot using the scVelo package59 to visualize the cell-type transition matrix. Finally, we correlated gene expression with estimated absorption probabilities (Pearson correlation, as implemented in the CellRank package). We identified the positively or negatively correlated genes at a significance level of FDR < 0.05 separately in Ts21 and the disomic femur. We checked the Gene Ontology terms of the top 500 genes that were most positively or negatively correlated to absorption probabilities using the clusterProfiler R package60.
Performing differential expression analysis in scRNA-seq
Within each cell type, four distinct differential expression analyses were performed to identify differentially expressed genes (DEGs) due to disease status (Ts21 or disomic) or the microenvironment (liver or femur).
-
(1)
Liver versus femur in disomic samples
-
(2)
Liver versus femur in Ts21 samples
-
(3)
Ts21 versus disomic in liver samples
-
(4)
Ts21 versus disomic in femur samples
Previous literature has shown that pseudobulk differential expression methods have improved FDRs compared with single-cell differential expression methods61. As a result, our analyses were performed by first computing cell-type-specific pseudobulk profiles for each sample and then analysing pseudobulk RNA-seq profiles using limma62.
To calculate sample-level pseudobulk profiles, we aggregated the read counts across cells of the same type. We kept samples for analysis that contained at least ten cells, and we used the filterByExpr() function in the edgeR package with default settings to retain genes for differential expression analysis and reduce the burden of multiple test correction, by removing genes with low expression across samples63.
Next, limma-voom was used to perform a statistical analysis for differential expression. In brief, sample-level weights were calculated by computing normalization factors for transforming count data into log2 counts per million and deriving weights based on a mean-variance relationship (using the calcNormFactors() function in edgeR and the voom() function in limma in R). Log fold changes for each gene were estimated using a linear model with sorting strategy as a covariate. P values were estimated after empirical Bayes shrinkage (lmFit and eBayes() functions in limma). A Benjamini–Hochberg FDR correction was applied across all gene P values, and significance was assessed at FDR < 0.05.
Analysis of Ts21 versus disomic DEGs
As we observed an exponential cross-dependency between the proportion of DEGs on chromosome 21 and other chromosomes, we investigated additional factors that could be relevant. We first tested whether cell-type-specific overexpression of a particular gene on chromosome 21 can lead to greater dysregulation on either chromosome 21 or on other chromosomes. As the probability of a gene being differentially expressed is linked to the number of cells tested in the differentially expressed analysis, we tested this using log fold change values. For each chromosome 21 gene and across cell types, we tested the Pearson correlation between log2 fold change of gene expression (from Ts21 versus disomic samples) and (1) chromosome 21 DEG (%) or (2) non-chromosome 21 DEG (%), and determined the significance of the correlation using FDR < 0.1. Second, we reasoned that overall expression of an important chromosome 21 gene could lead to greater dysregulation, as a highly expressed chromatin modifier or transcription factor might have a consistent 50% overexpression in Ts21 across cell types, but the overexpression might matter more in the cell types where the gene is expressed. We tested this for each chromosome 21 gene. Across cell types, we tested the Pearson correlation between average cell-type-specific gene expression (from Ts21 and disomic samples) and (1) chromosome 21 DEG (%) or (2) non-chromosome 21 DEG (%).
Identification of context-specific DEGs in HSC/MPPs
It is difficult to ascertain whether a gene is commonly or uniquely upregulated in single-cell data (for example, a gene upregulated in Ts21 liver HSCs compared with Ts21 femur HSCs, but not disomic liver HSCs compared with disomic femur HSCs). The presence of a DEG in one cell type and the absence in another may be a result of differences in population size, and thus purely statistical.
As there are sample and cell count differences between datasets, we could not directly take the Ts21 liver versus femur DEGs as the Ts21 population was much larger than the disomic datasets. Instead, we identified DEGs specific to disease status (in liver versus femur analyses) and the microenvironment (in Ts21 versus disomic analyses) in HSC/MPPs by using a subsampling procedure. Downsampling allows the ability to compare two analyses from distinct datasets that are confounded by differences in size. We did not repeat the same procedure across additional cell populations to conclude whether genes are differentially expressed in specific cell populations, as this would require to downsample to the smallest population sizes. This would erode statistical power and be computationally expensive.
Within liver versus femur analyses, we downsampled the Ts21 liver and Ts21 femur dataset to have the same number of fetuses contributing the same number of samples with the same number of HSC/MPPs as the disomic liver and disomic femur data. As a result, the Ts21 data matched the disomic data in terms of fetus sample cell counts. Similarly, within Ts21 versus disomic analyses, we downsampled the Ts21 and disomic liver data based on fetus sample cell counts in the Ts21 and disomic femur data. As an additional restriction in our downsampling, we ensured that fetuses present in both liver and femur data, with equal or greater number of cells and samples in the liver than in the femur, would still be selected in the downsample. The downsampling routine was repeated 100 times, such that 100 new datasets were created that match the smaller dataset. Differential expression analysis was performed identically to the full data using sample-level pseudobulks and limma-voom. The median nominal P value for each DEG was calculated across 100 iterations. We verified the robustness of this choice of 100 iterations by visualizing the variability of the median P value across iterations to assess its stability.
Next, we used differential expression analyses in the full data and in the downsampled data to categorize the context dependence of DEGs. In the liver versus femur analysis, we implicate environment-driven DEGs, Ts21-induced DEGs and Ts21-reverted DEGs.
-
(1)
Environment-driven DEGs are genes with an adjusted FDR < 0.05 in either Ts21 or disomic samples, and nominal P < 0.05 in the other dataset.
-
(2)
Ts21-induced DEGs were discovered by identifying genes with adjusted FDR < 0.05 and median P value across 100 subsamples of P < 0.05 in the Ts21 data and a nominal P > 0.05 in the disomic data.
-
(3)
Ts21-reverted DEGs are those discovered in only the disomic dataset: an adjusted FDR < 0.05 in disomic samples but nominal P > 0.05 in the Ts21 dataset.
-
(4)
Environment-driven or Ts21-induced DEGs are those genes that have FDR < 0.05 in the full Ts21 dataset, but have a median nominal P value across 100 Ts21 subsamples of P > 0.05 and FDR > 0.05 in the disomic data. For these genes, we do not have sufficient evidence to claim context dependence nor reject context dependence, as the discovery of these DEGs might be highly dependent on sample size.
To visualize gene–environment interactions, we examined the expression of environment-driven DEGs and Ts21-induced DEGs across Ts21 and disomic liver and femur HSC/MPPs. We scaled expression across cells for each DEG to mean = 0 and variance = 1. We averaged the scaled expression across genes within the environment-driven and Ts21-induced gene sets, such that each cell has its own value for each gene set. We visualized the mean and standard error of these values across cells in Fig. 2c.
GSEA of upregulated Ts21-induced genes was performed by inputting the list of genes into EnrichR. Scatterplots show the top Gene Ontology terms (molecular function and biological process) or ENCODE and ChEA transcription factors.
Identifying differences in cell cycling across cell types, environment and Ts21
We assigned a cell cycle using the score_genes_cell_cycle() function in Scanpy with the standard list of cycling genes from Tirosh et al.64, as applied to all cells from samples of the same environment and disease type. We determined cycling by the predicted cycling phase being equal to ‘G1’ or not (either ‘G2M’ or ‘S’). We compared the proportions of cycling cells using a Mann–Whitney U-test.
Evaluating the regional distribution of catalogued somatic mutations
Using publicly available data, we evaluated the hypothesis that the regulatory landscape is affected in Down syndrome to influence where somatic mutations occur. First, we downloaded somatic mutation data from Hasaart et al. in fetal Ts21 and disomic HSCs4, and converted the mutation positions from hg19 to hg38 using liftOver. Second, we identified genes expressed in Ts21 HSCs. We used the filtered set from the Ts21 cycling versus less-cycling HSC differential expression analyses, which were identified by the filterByExpr() function. Next, we narrowed down the Ts21 and disomic sets of mutations within the HSC-expressed genes to the set of non-coding intronic mutations. Finally, we downloaded candidate cis-regulatory elements (cCREs) in hg38 from ENCODE; for Ts21 and disomic mutation sets, we calculated the proportion of intronic somatics in Ts21 HSC-expressed genes that overlap with ENCODE cCREs. To determine significant differences, we bootstrapped the disomic intronic mutation set for 1,000 times and compared the observed Ts21 proportion to the disomic distribution. We calculated P values as the proportion of bootstrapped disomic mutation sets with larger values than the Ts21 value.
Analysis of the 10X Visium data
For disomic and Ts21 liver datasets, we used our annotated cell types from the disomic and Ts21 large scRNA-seq datasets as our input reference data for Cell2location65. Next, we merged all SpaceRanger outputs of tissue sections for the disomic liver and then separately for the Ts21 liver to create two Scanpy objects56. We removed mitochondrial genes and spots with the total expressed gene count of less than 800 (the remaining spots were of sufficient good quality for downstream analysis).
We then estimated cell-type abundances for each spatial spot. Using Cell2location, we trained a negative binomial regression model on the input reference data. We applied our model to the Scanpy formatted data, considering tissue section as a covariate to account for distinct batches. We used the estimated posterior mean value of each cell type (from Cell2location) as the local abundances. For each section, we computed the section-level relative abundance of each cell type as the proportion of its estimated abundance across all spots over the total estimated abundance of all cell types across all spots. We compared relative abundances between disomic and Ts21 using a Wilcoxon rank-sum test, and corrected P values by using Benjamini–Hochberg.
To evaluate cell-type colocalization, we computed spot-level relative abundance of each cell type, dividing the abundance of each cell type on an individual spot by the total abundance of all cell types on the same spot. We then computed a Pearson distance matrix among cell types, based on these spot-level relative abundances across sufficient-quality spots of tissue sections in the same disease status, respectively, for the Ts21 liver and disomic liver. We next performed hierarchical clustering, with inter-cluster distance estimated by the Ward variance minimization algorithm.
Processing the multiome data
We performed the initial processing of multiome data using Seurat (v5.0.3) and Signac (v1.13)66. After Cellranger processing, we identified high-quality multiome cells for downstream analysis if they satisfied the following criteria: more than 750 RNA unique molecular identifiers, more than 250 expressed genes, less than 40% mitochondrial read fraction, transcription start site (TSS) enrichment score of more than 3 and more than 1,000 ATAC fragments in peaks. We next identified and annotated transcriptionally distinct clusters within Ts21 and disomic samples using Seurat. We created Ts21-specific and disomic-specific expression matrices by merging the matrices across Ts21 or disomic samples, respectively. Within the separate Ts21 and disomic expression datasets, we log-normalized with a scaling factor of 10,000, identified 2,000 highly variable genes, scaled and centred the data, performed PCA, and used Harmony with lambda = 1 to batch correct for sample-specific variation. We then constructed a k-nearest neighbours graph based on the Euclidean distance in PCA space (using the first 30 components), and identified transcriptionally distinct clusters using the Leiden algorithm. We nominated marker genes for each cluster by performing a Wilcoxon signed-rank test that compares cells within one cluster to all other cells. We performed further clustering by performing K-means clustering and we merged clusters of the same cell type. With this approach, we annotated each cell in two separate multiome datasets (Ts21 and disomic). The Harmony-corrected datasets were visualized in two dimensions using UMAP.
Our overarching process for creating the chromatin accessibility matrix was to call peaks within each sample before calling a final set of peaks from cell-type-specific ATAC profiles. We first called peaks within each sample using macs2, as implemented with default parameters in the callPeaks() function in Signac. We created a unified set of peaks across all samples by combining any intersecting peaks into a single peak, and removing the combined peaks that were less than 20 bp or more than 10 kb wide. This set of peaks was used to compute a cell × peaks matrix for each sample from the ATAC fragments file. Using all peaks present in at least ten cells, we ran latent semantic indexing (a two-step procedure of first using term frequency-inverse document frequency normalization and then singular-value decomposition) to project the ATAC matrix into a reduced dimension representation. We performed batch correction over all samples using Harmony before constructing a k-nearest neighbours graph across the first 30 components (except omitting the first component, which correlates with sequencing depth). This graph was used to perform clustering using the Leiden algorithm with resolution = 1; within each cluster, a new set of peaks were called using macs2. This set of peaks was combined into a unified set of peaks, which was used to form the final cell × peaks matrix, which contained all final Ts21 and disomic cells.
For downstream analyses involving a single combined dataset, the two datasets were merged into a single matrix. Log-normalization, scaling, PCA, Harmony batch correction and UMAP were applied to the combined dataset.
Myeloid trajectory analysis of snRNA using CellRank
Processed snRNA data from multiome was subset to cells of the myeloid lineage. Both disomic and Ts21 cohorts were downsampled to 15,000 cells. A trajectory graph was calculated using the force-directed layout (FLE) function in Pegasus. Instead of the UMAP space, the cell trajectories were plotted in the FLE space. Trajectory analysis of the RNA expression data closely followed the CellRank tutorial ‘CellRank beyond RNA velocity’. Moments of connectivity were calculated using scVelo with 30 principal components and 15 neighbours. The root cell was manually selected according to the diffusion map of HSCs, selecting the cell with the greatest Euclidean distance in FLE space from the centre of the cluster, indicating a cell with a divergent transcriptome. The pseudotimeKernel was used to calculate pseudotime with default parameters. The CytoTRACEKernel was used to compute the transition matrix with the parameters used in the tutorial (threshold_scheme = ‘soft’, nu = 0.5). To compute terminal states and the probability of each cell differentiating towards each terminal state, the GPCCA estimator was utilized with default parameters. Schur decomposition was performed, and five terminal states were automatically selected according to an eigengap in the real part of the eigenvalues. Terminal states were labelled according to the cell type with the closest association (late erythroid, monocytes, mast cells, pDCs and megakaryocytes) and absorption probability was calculated. Significant differences in predicted terminal states between Ts21 and disomic HSCs were calculated using a binomial test that used disomic proportions as the background probabilities.
Topic modelling of the HSC multiome using MIRA
We used MIRA to perform topic modelling of HSCs in the 10X multiome dataset42. Only HSCs were included for all downstream MIRA analysis. Out of the 6,215 HSCs, 3,784 were Ts21 and 2,431 were disomic. All MIRA analysis closely followed the online tutorials.
To generate the latent topics for RNA, the variational autoencoder (VAE) framework uses raw expression counts as input. Rare genes were removed by filtering genes only expressed in 15 or fewer cells. Exogenous genes (n = 7,905) were selected using the highly variable gene function in Scanpy, selecting for all genes with a minimum mean dispersion of 0.1. Exogenous genes are genes that will be captured in topics but will not be used as VAE features. Endogenous genes (n = 4,359) were selected by filtering the exogenous genes for those with a normalized dispersion greater than 0.5. Endogenous genes were used as features for the VAE network. The ExpressionTopicModel was instantiated with the default parameters. The learning rate bounds were manually tuned to cover the portion of the learning rate versus loss curve with the steepest slope. The model was then tuned using TopicModelTuner with default iterations, a minimum number of topics set to 2, a maximum number of topics set to 15, a batch size of 32, threefold cross-validation and a training size of 0.8.
To generate the latent topics for ATAC, the VAE framework used binarized peak counts. Peaks were filtered according to the epiScanpy tutorial. Using the same process for RNA, 72,541 exogenous peaks and 45,095 endogenous peaks were selected according to a minimum mean dispersion of 0.05 and a normalized dispersion greater than 0.5. The AccessibilityTopicModel was instantiated with the default parameters and ‘dataset_loader_workers’ set to 3. The learning rate bounds were manually tuned to cover the portion of the learning rate versus loss curve with the steepest slope. The model was then tuned using TopicModelTuner with default iterations, a minimum number of topics set to 2, a maximum number of topics set to 15, a batch size of 8, onefold cross-validation and a training size of 0.8.
GSEA was performed on the expression topics using wrapper of enrichr in MIRA and the top 200 genes associated with each topic. For the accessibility topics, transcription factor-binding sites were annotated in peaks using the motif scanning in MIRA and the hg38 reference from the UCSC repository. Each peak was scanned according to the JASPAR 2020 vertebrate collection of transcription factor-binding motifs. Transcription factors that were not expressed in the RNA data were removed. For comparing RNA and ATAC topics, the joint data were split into a disomic and Ts21 dataset and the ‘get_topic_cross_correlation’ function in MIRA was performed.
Trajectory analysis of the HSC multiome using MIRA
First, a joint embedding space was calculated using the ‘make_joint_representation’ function in MIRA to combine both modalities. A neighbourhood graph and UMAP embedding were performed on the joint representation (15 neighbours and a minimum distance of 0.1). The datasets were then batch corrected using Harmony on the joint UMAP features. The neighbourhood graph and UMAP embedding were re-calculated on the Harmony-adjusted feature space.
We next calculated several different HSC branches. First, the diffusion map was calculated using Scanpy ‘diffmap’ with default parameters and then normalized by MIRA to regularize distortions in magnitude of the eigenvectors. Schur decomposition was performed and the eigengap heuristic was used to automatically select the proper number of diffusion components within the data (3). The data were subsetted to three diffusion components and the neighbourhood graph was calculated on the diffusion map embedding and components were connected. Pseudotime was calculated using the ‘get_transport_map’ function in MIRA, which defines a transport map using a Markov chain model of forward differentiation. A root cell was selected as the maximum value of the third diffusion component based on the suggestion in the tutorial. This root cell was located in the centre of the highest density of HSCs. Terminal cells were identified using the ‘find_terminal_cells’ function in MIRA with 8 iterations and a threshold of 0.01. There were three distinct clusters of terminal cells, the cell in each cluster farthest away from the root cell was selected, and the probability of each cell differentiating towards those three branches was calculated. The lineage probabilities were parsed into a bifurcating tree structure using ‘get_tree_structure’ with the threshold set to 1. Expression and accessibility dynamics across time were plotted using a streamgraph that depicts this tree structure.
Regulatory potential modelling of the HSC multiome using MIRA
LITE modelling in MIRA was performed to link gene expression to nearby cis-regulatory elements. For every gene, MIRA learns a regulatory window describing a range in which changes in local accessibility appear to influence gene expression. The regulatory window decays exponentially both upstream and downstream according to the TSS of each gene. Consequently, each gene is associated with a unique TSS using non-redundant human TSS annotations (hg38 GENCODE VM39). The model was trained on the union of genes that were highly variable and were the top 5% most activated in each of the 10 expression topics (n = 5,367). The genes that did not have an annotated TSS were removed leaving 4,454 genes. The LITE model was instantiated with default parameters and the raw expression and accessibility data were then fitted (4 out of 4,454 genes failed to fit). Now that each gene contained a trained regulatory potential model, the expression of each gene was estimated by calculating the maximum a posteriori prediction given the accessibility state of each gene in each cell.
Next, we identified transcription factors that regulate the expression of genes specific to each of the three HSC branches using the probabilistic in silico depletion method in MIRA. MIRA simulates ‘computational knockouts’ of each transcription factor. MIRA uses regulatory potential modelling to predict gene expression based on local chromatin accessibility, and then masks cis-regulatory elements with specific motifs to define transcription factors where motif accessibility is important to gene expression prediction. This is measured by the changes in performance of the regulatory potential model of the gene to predict expression after computationally masking the binding sites of every transcription factor. In this manner, transcription factors that strongly regulate the expression of a gene will be prioritized because masking of their binding site will significantly decrease the accuracy of the LITE model prediction. The function ‘probabilistic_isd’ was run with default parameters across all modelled genes. To counteract the inefficiency of noisy transcription factor-binding site predictions, co-varying genes associated with individual topics were queried for a shared association across many transcription factors. The ‘driver_TF_test’ function was utilized to identify potential transcription factors regulating the expression of branch-specific topics. The top 150 topic-specific genes were included, and a Wilcoxon rank-sum test was performed over the association scores. After identifying transcription factors regulating topic expression, the top genes being regulated by these transcription factors were queried using ‘fetch_ISD_matrix’ and selected according to the ranked association score. We additionally performed this procedure across all RNA topics, which we reported in Supplementary Table 16.
Divergence in accessibility and expression of the HSC multiome using MIRA
To identify genes in which expression cannot be accurately predicted by local chromatin accessibility alone, MIRA NITE modelling was performed. In addition to cis-regulation, the NITE model expands on the scope of the LITE model by incorporating accessibility topics, which are genome wide. The NITE model was initialized using the same parameters, topics and genes as the LITE model using ‘spawn_NITE_model’. The model was fit and expression was predicted using default parameters. The difference between LITE and NITE model performance for every gene was calculated using ‘get_chromatin_differential’. The ‘get_NITE_score_genes’ function was used to calculate a cumulative metric per gene describing the divergence of local accessibility and expression across all cells. Genes with a high cumulative NITE score indicate genes that are regulated in part by non-local mechanisms. The top 500 genes with the highest NITE score were incorporated into the GSEA analysis in MIRA. To ascertain whether Ts21 HSCs expressed genes enriched for non-local regulation, pseudobulk differential expression was performed (DESeq2) between disomic and Ts21 HSCs. The distribution of cumulative gene NITE scores for the top 300 condition-specific genes (log2FC) was compared using Wilcoxon rank-sum test with Benjamini–Hochberg correction.
Defining differentially accessible promoters using the EPD
We downloaded a set of curated promoter annotations from the EPD43 available through EPDnew for build hg38 (https://epd.expasy.org/epd/human/human_database.php?db=human). We then intersected these annotations with peaks that were differentially accessible in Ts21 HSCs (as compared with disomy HSCs) using the genomicRanges R package. Differentially accessible promoters (FDR < 0.1) were then tested to determine whether they were more likely to be linked to a significantly DEG (FDR < 0.1) using a Fisher’s exact test.
Defining functional enhancers in ATAC data using ABC
We defined functional enhancers as those with a high potential to regulate gene expression. Following this definition, we used the ABC model to construct genome-wide maps of enhancer–gene connections45. The ABC scores generated for each enhancer reflect the chromatin activity and chromosome interaction between the enhancer and surrounding genes. We utilized merged ATAC-seq mapping results from disomy HSCs and Ts21 HSCs to identify potential enhancer locations and quantify their chromatin activity. We applied the averaged Hi-C contact data provided by the ABC authors to map contact interactions between chromosomal regions at 5-kb resolution. The default parameters and thresholds were then used to run the ABC model pipeline available on GitHub for implicating a list of relevant enhancers. For motif analyses, non-promoter enhancers were kept (those ABC enhancers that did not overlap with an EPD promoter).
Motif enrichment analysis in differentially accessible peaks using monaLisa
To identify transcription factor motifs that are responsible for differential accessibility, we utilized the JASPAR2022 (v0.99.7)67 and monaLisa (v0.63.0)68 R packages. First, liver HSC peaks with FDR < 0.1 in a Ts21 versus disomy HSC analysis were identified and stratified into separate bins based on the direction of differential accessibility. Next, to account for any size differences between groups, we resized all peaks to the median peak length across all groups. After accounting for differences in peak length, we used the ‘calcBinnedMotifEnrR’ function from monaLisa, which internally corrects for GC content and identifies motifs that are significantly enriched in either Ts21-biased or disomy-biased peaks. The motif enrichment analysis above was repeated using all differentially accessible peaks, only promoters and only enhancers.
Next, to assess differential accessibility of promoters, we downloaded a set of curated promoter annotations from the EPDnew database for hg38 (https://epd.expasy.org/epd/human/human_database.php?db=human). We then intersected these annotations with peaks that were differentially accessible using the genomicRanges R package. Differentially accessible promoters were then tested to determine whether they were more likely to be linked to a significantly DEG (FDR < 0.1) using a Fisher’s exact test. The motif enrichment analysis above was repeated using the EPD promoters.
To assess utilization of the AP-1 motif between Ts21 and disomy, we specifically used motifs for AP-1 identified in Isakova et al.44. We first identified which motifs belonged to the TRA-response element (TRE) or CRE family by calculating the similarity between motifs as the Pearson correlation between the position frequency matrices using the ‘MotifSimilarity’ function from monaLisa. We then identified two distinct clusters of motifs using hclust() and cutree() R functions with K = 2. These clusters corresponded to the TRE and the CRE motif families. All TRE and CRE motifs were then used in the pipeline described above to identify differential usage of AP-1 motifs. Gene Ontology terms were then assessed using enrichR.
Finally, we quantified the contribution of each motif to differential accessibility in Ts21 HSCs. We utilized motifmatchR (v1.20.0) to identify ABC enhancers or EPD promoters that contain a significant match to each motif. Then, for each motif, we fit a linear regression model logFC ~ match + peak_type + match:peak_type, where match indicates whether the peak contains a significant match of the motif, peak_type indicates whether the peak is an enhancer or promoter, and logFC represents the log fold change values of the peak. This model allows each motif to have a different effect on differential accessibility between enhancers and promoters. To quantify the contribution of the motif to overall differential accessibility, we computed the multiple R2 from the regression model.
Assessing enrichment of GWAS SNPs in single cells using SCAVENGE
Trait-relevant individual cells were calculated using SCAVENGE69, which combines network propagation and SNP enrichment analysis to map causal variants to their cellular context. First, variants with posterior inclusion probability (PIP) > 0.001 were downloaded in the format used by Yu et al.69, which were originally processed and described within Vuckovic et al.70. Second, the full multiome ATAC dataset (including all cell populations) was used as input to downstream tasks. A mutual k-nearest neighbour graph was computed to represent the relationship between neighbouring single cells, using k = 30. Next, g-chromVAR71 was used to calculate bias-corrected z scores for each tested trait and each single cell to estimate cell-trait relevance. The top 5% ranked cells served as seed cells for the SCAVENGE network propagation, which were further scaled and normalized to calculate the final SCAVENGE trait relevance score (TRS).
SCAVENGE TRS was contrasted between different lineage branches using a pseudobulk approach similar to the approach that we used for differential expression analysis. On a per-sample-and-branch level, we pseudobulked the SCAVENGE scores by summing SCAVENGE scores across all cells of the same branch and of the same sample. We then applied three linear models to assess the effect of each of the three branches on the SCAVENGE TRS. In each model, we accounted for the number of cells within the pseudobulk as a covariate. We determined significance by multiple test corrections at FDR < 0.1 across the 9 analyses (3 branches (1, 2, 3) and 3 traits (RBC, white blood cell and lymphocyte counts)).
Identification and analysis of peak–gene links using SCENT
SCENT72 was used to identify peak–gene links in disomic HSCs and Ts21 HSCs, by correlating peak accessibility (binarized) and gene expression (raw) counts. SCENT is a Poisson regression model that recomputes the standard errors in the model coefficients by using bootstrapping, which helps to maintain false-positive rates. We first identified all peak–gene combinations within 500 kb of each other. Separately within disomic and Ts21 HSCs, we retained all peak–gene combinations where more than 5% of cells were accessible and expressed the peak and the gene, retaining a total of 38,478 peak–gene combinations. Owing to biological differences between disomy and Ts21, this was an overlapping but not identical peak–gene set tested in disomic and Ts21 HSCs.
Next, we applied a SCENT model using binarized peak accessibility, percent mitochondrial reads, log(number of UMIs) and sample as the covariates, and expression counts as the dependent variable to each peak–gene combination. On the basis of the SCENT paper, we set up an iterative bootstrapping scheme to balance runtime and P value accuracy based on the Poisson regression model P values, where P > 0.1 consisted of 100 bootstraps, P < 0.1 consisted of 1,000 bootstraps and P < 0.01 consisted of 10,000 bootstraps. SCENT was performed on a computing cluster using chunks of 100–500 peak–gene sets. We corrected for multiple testing using FDR, and determined significant peak–gene links at FDR < 0.2. Finally, we repeated the Ts21 analysis after downsampling the 3,784 Ts21 HSCs to match the sample size of 2,431 disomic HSCs to assess the impact of cell population size.
Furthermore, we used SCENT to test whether the effect of peak accessibility on gene expression is modified by trisomy. To do so, we included an interaction term between ATAC peak accessibility and Ts21 status in the SCENT model to address whether the effect of accessibility on expression depends on trisomy. We applied the new SCENT interaction model to a combined Ts21 and disomic HSC dataset. The analysis was performed on all significant peak–gene links in the disomic-only or Ts21-only analyses, and significant interaction terms were assessed at FDR < 0.2.
One reason why Ts21-only peak–gene links were only identified in Ts21 yet have no significant interaction term would be because of a lack of gene expression or peak accessibility in disomic cells. To test whether Ts21-only peak–gene links without a significant interaction term were more accessible and expressed in Ts21 HSCs than in disomic HSCs, we used differential accessibility (from the 10X multiome ATAC) and differential expression results (from the large scRNA-seq analysis). To calculate differential accessibility in the 10X multiome ATAC, we computed pseudobulk profiles and performed limma-voom with trisomy status as the covariate of interest, as similarly described in the large scRNA-seq analysis. We filtered peaks for differential analysis using the filterByExpr() function in edgeR. We performed two separate binomial tests to assess whether the list of (1) Ts21-only peaks or (2) Ts21-only genes were upregulated in Ts21 compared with all other peaks or genes tested in differential analyses. We defined upregulated in Ts21 as nominal P < 0.05 and logFC > 0 in the limma-voom results.
We performed two enrichment tests with regard to our Ts21 peak–gene links and RBC GWAS. We defined SCENT peaks and SCENT genes according to three significance thresholds from the SCENT peak–gene analysis: FDR < 0.1, FDR < 0.2 and nominal P < 0.05, and identified GWAS-related peaks and genes using fine-mapped RBC GWAS SNPs at PIP > 0.2. In the first analysis, we assessed the enrichment of fine-mapped RBC GWAS SNPs (PIP > 0.2) within Ts21 SCENT peaks (also known as active cis-regulatory regions) using a Fisher’s exact test. Here, we are comparing whether peaks with a GWAS variant are more likely to be associated with gene expression than with a background set of peaks without a GWAS variant, which include peaks accessible (in more than 5% of cells) and close to a gene (less than 500 kb) that is expressed (in more than 5% of cells). In the second analysis, we assessed the enrichment of differential expression at target GWAS genes defined by Ts21 SCENT peak–gene links. We defined differential expression as FDR < 0.1 in the differential expression analysis of HSCs using the large scRNA-seq data. We subsetted to significant peak–gene links and calculated a 2 × 2 contingency table reflecting (1) whether the peak contained a fine-mapped variant, and (2) whether the target gene is differentially expressed. A Fisher’s exact test assessed the hypothesis that important enhancers (containing fine-mapped GWAS SNPs) have a role in differential expression within HSCs (affecting their target genes).
Within the results, we report the enrichment P values within the text when using the set of peak–gene links with nominal P < 0.05 (from SCENT). In addition, when reporting peak–gene examples within the results, we use the nominal P values for the SCENT peak–gene test, the differential expression test or the differentially accessibility test. We include the full summary statistics within the Supplementary tables.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.