Strand-resolved mutagenicity of DNA damage and repair

Genomic annotation

The C3H/HeJ mouse strain reference genome assembly C3H_HeJ_v1 (ref. ⁵⁸) was used for read mapping, annotation and analysis. WGS regions with abnormal read coverage (ARC regions; 12.7% of the genome) were masked from analysis, as previously described². Gene annotation was obtained from Ensembl v.91 (ref. ⁵⁹).

Mutation asymmetry

Mutation calling and quality filtering were performed using WGS of 371 DEN-induced liver tumours from n = 104 male C3H mice (Supplementary Table 1), as previously reported². All mutation data were derived from sequence data in the European Nucleotide Archive (ENA) under accession PRJEB37808, and processed files directly used as input for this work are publicly available².

Genomic segmentation on mutational asymmetry was performed as previously reported². Mutational strand asymmetry was scored for each genomic segment using the relative difference metric S = (F − R)/(F + R) where F is the rate of mutations from T on the forward (plus) strand of the reference genome and R is the rate of mutations from T on the minus strand (mutations from A on the plus strand). A mutational asymmetry score of S > 0.33 was used to identify the inheritance of forward strand lesions and S < −0.33 as the inheritance of reverse strand lesions. A rare subset of tumours (2.7%) exhibited uniform mutational symmetry (more than 99% of autosomal mutations in genomic segments with abs(S) < 0.2; these were labelled ‘symmetric’ tumours.

Except where otherwise stated (within the final results section), analyses were confined to n = 237, clonally distinct DEN-induced tumours that met the combined criteria of: (1) not labelled as symmetric, (2) tumour cellularity of more than 50%, and (3) more than 80% of substitution mutations attributed to the DEN1 signature² by sigFit (v.2.0)⁶⁰.

Relative to the reference genome sequence, a plus (P) strand gene was transcribed using the reverse (R) strand as a template. So, a P strand gene in a genomic segment with R strand lesions (denoted RP orientation) is expected to be subject to TCR. A minus (M) strand gene with forward (F) strand lesions (FM orientation) is also expected to be subject to TCR, as the retained lesions are again on the transcription template strand. Conversely, FP and RM orientation combinations will have lesions on the non-template strand for transcription. For DNA replication, we similarly refer to whether the preferential template for the leading strand contains the retained lesions or whether the preferential template for the lagging strand contains the retained lesions.

Mutation rates and spectra

Mutation rates were calculated as 192 category vectors representing every possible single-nucleotide substitution conditioned on the identity of both the upstream and the downstream nucleotides. Each rate being the observed count of a mutation category divided by the count of the trinucleotide context in the analysed sequence. To report a single aggregate mutation rate, the three rates for each trinucleotide context were summed to give a 64 category vector and the weighted mean of that vector reported as the mutation rate. The vector of weights being the fraction of each trinucleotide in a reference sequence, for example, the composition of the whole genome. Strand-specific mutation rates were calculated with respect to the lesion-containing strand, with both mutation calls and sequence composition reverse complemented for reverse strand lesions. Autosomal chromosomes were considered diploid and the X chromosome haploid (male mice) for the purposes of calculating mutation rates and sequence composition. For the counting of strand-specific mutations, a threshold VAF > 10% was applied to remove mutation calls from contaminating non-clonal cells.

Subtracted spectra plots (Fig. 2c,d) were calculated by subtracting the counts of simulated tumour datasets from those of observed datasets and then scaling as for mutation spectra, so that the absolute area of the histogram summed to 100. Percent repair efficiency (Extended Data Fig. 7j) was calculated as (observed/expected) × 100, where expected was the corresponding mutation rate for non-expressed genes (stratum 1, see below) averaged between the template and non-template strand. Cosine similarity was used as a relative measure of mutation signature similarity. Mutation signature deconvolution was performed using sigFit (v.2.0), with two component signatures (K = 2) chosen based on heuristic goodness-of-fit for integer values of K from 2 to 8, with 2,000 iterations each. Final K = 2 deconvolution used 40,000 iterations.

The expected number of mutations at each position of the analysed transcription factor-binding site (Supplementary Table 2) and nucleosome regions was calculated as a sum of genome-wide rates (mutations per base pair) for that particular trinucleotide context from each tumour that had this region classified as either forward or reverse segment. The genome-wide rate for each tumour was calculated by dividing the number of mutations in a particular trinucleotide context (that fall within genomic space phased to have inherited either a forward or a reverse lesion-containing strand) by the total count of that trinucleotide in that genomic space; this was done separately for forward and reverse segments.

Excess mutations per Mb were calculated as (observed_i,n − expected_i,n) × 10⁶/(count_i), where i is the relative position within the region, count_i represents a total number of regions with non-‘N’ nucleotide at position i, and n is the specific mutation context (for example, mutation from A). Mutation enrichment was calculated as (observed_i,n − expected_i,n)/ (observed_i,n + expected_i,n). Rolling mean values were plotted using windows of 51 bp and 21 bp for nucleosome-centred and CTCF-centred plots, respectively. On the basis of bootstrap sampling of the analysed regions, 95% confidence intervals were calculated.

Multiallelic mutation rates

Aligned reads spanning genomic positions of somatic mutations were re-genotyped using SAMtools mpileup (v1.9)⁶¹. Genotypes supported by 2 or more reads with a nucleotide quality score of 20 or more were reported, considering sites with two alleles as biallelic, those with three or four alleles as multiallelic. For a defined set of mutations, the background composition is the count of mutations in each of the 64 possible trinucleotide contexts. The count of multiallelic mutations in each of those 64 categories was divided by the corresponding background mutation count and the weighted average of those ratios are reported as the multiallelic rate. As for mutation rates, the vector of weights being the fraction of each trinucleotide in a reference sequence, for example, the composition of the whole genome.

Replication time

We generated early–late Repli-seq as previously described⁶² for two mouse hepatocellular carcinoma-derived cell lines (Hep−74.3a and Hepa1-6, obtained from biohippo and the American Type Culture Collection, respectively, and tested for mycoplasma at source), matching for the study cell type⁶³. Furthermore, the tumour from which the Hep-74.3a cell line was derived was induced by a single intraperitoneal injection of DEN at P15 into a C3H/He mouse⁶⁴, thus closely matching the DEN-induced tumours in our study. For each cell line, two ENCODE-style biological replicates were generated with individual BrdU labelling and fluorescence-activated cell sorting (FACS) into early and late S-phase fractions for Repli-seq Illumina sequencing library preparation⁶². Sequencing was performed on Illumina NextSeq550 using a Mid-Output v2.5 kit generating 75 bp paired-end reads, producing a total of 1.2 × 10⁸ read pairs (Hep-74.3a), and Illumina NovaSeq with an S1 flowcell generating 50 bp paired-end reads, producing a total of 3.9 ×1 0⁷ read pairs (Hepa1-6). Sequencing reads were mapped using Bowtie2 (v2.4.5) to the C3H_HeJ_v1 reference genome. SAMtools (v1.15.1) was used for alignment quality filtering (-bSq 20), matepair annotation (fixmate -m) and deduplication (markdup -r -s). After confirming concordance, replicates were aggregated and read coverage was calculated for 10 kb consecutive windows with local smoothing: 50 kb windows with a step-length of 10 kb using the central 10 kb window coordinates using bedtools (v2.30.0) multicov. Windowed read counts were normalized to aggregate library size (tags per million, separately for early (E) and late (L)) and replication time was taken as the relative enrichment (E − L)/(E + L). For replication time analysis, genomic regions were categorized into 21 quantile bins of replication time relative enrichment, and the median value for each bin used in quantile-based visualization and regression analysis. As the Hep-74.3a cell line is better matched for both strain and treatment, these Repli-seq data were used throughout the paper. The results were replicated with matched analyses of the Hepa1-6 Repli-seq data (Extended Data Figs. 2a and 3h–j).

Repli-seq data are available at the ENA at EMBL-EBI under accessions PRJEB72349 (Hep-74.3a) and PRJEB67994 (Hepa1-6).

Replication strand bias

Replication fork directionality (RFD) is a relative difference metric that scales from 1 to −1. RFD values > 0 indicate a consensus rightward progressing replication fork, whereas RFD < 0 indicates a consensus leftward progressing fork. RFD can be directly measured at 1 kb resolution from Okazaki fragment sequencing (OK-seq)⁶⁵, but such data have only been obtained from cultured cells that can be prepared in large quantities with a high fraction in S phase. Alternatively, RFD has been inferred from Repli-seq data, where RFD is calculated as the derivative of the change in replication time along the genome^12,66, but has lower spatial resolution and is dependent on ad hoc filtering. Here we intersected cell-type-matched Repli-seq RFD with higher resolution OK-seq to ensure high-resolution tissue-matched RFD, and removing the need for ad hoc filtering. Replication time was converted to Repli-seq RFD by taking the average of the difference in replication time of the adjacent upstream and downstream windows.

OK-seq data from mouse activated primary splenic B cells⁶⁵ were aligned to the C3H_HeJ_v1 reference genome using Bowtie2 (v2.4.5)⁶⁷, quantified using bedtools multicov and RFD calculated as the relative enrichment of reverse (R) versus forward (F) read coverage (RFD = (R − F)/(R + F))⁶⁸. This OK-seq RFD (OK-RFD) metric was calculated for 10 kb consecutive windows to match Repli-seq RFD analysis. Both OK-RFD and Repli-seq RFD measures were categorized into 21 quantile bins. Subsequent mutation rate analysis used OK-RFD quantile classification but was restricted to those that differed from the corresponding Repli-seq RFD by less than 19% of the category range (four bins). Other OK-seq and Repli-seq datasets (Supplementary Table 3) were processed as outlined above, aligning to the GRCh37 reference genome in the case of human-derived sequences. For comparisons between Repli-seq RFD and high-resolution OK-RFD (Extended Data Fig. 2), OK-RFD was calculated as above but in 1 kb consecutive windows and smoothed (R loess function), with the span parameter set to encompass 25 windows.

For each DEN-induced tumour, we identified all RFD segments that were completely contained within lesion segregation mutational asymmetry segments (as defined above) with |S| > 0.33. For these segments, we resolved the lesion-containing strand to the template of either the leading or lagging replication strand. A forward strand mutation asymmetry (lesions on the forward strand, S > 0.33) and rightward progressing replication fork (RFD > 0) was consensus lagging strand replication over the lesions (Fig. 1e). Similarly S < −0.33 and RFD < 0 was also lagging strand replication over lesions. Consensus leading strand replication over lesions is indicated by S > 0.33, RFD < 0; or S < −0.33, RFD > 0. For the purposes of visualization and the aggregation of equivalent data for increased statistical power, a single replication strand bias (RSB) metric was defined by consistently orienting the strandedness of analyses such that the lesion-containing strand is the reverse strand (compare Extended Data Fig. 3d and 3f). Consequently, new replication and transcription will proceed left to right as the forward strand over a damaged template strand in all RSB figures.

Gene expression

Paired-end, stranded total RNA-seq from unexposed P15 C3H male mouse livers (n = 4, matching the developmental time of mutagenesis) were aligned, annotated and quantified previously². All transcriptome data used were derived from sequence data in Array Express under accession E-MTAB-8518 and are publicly available².

The transcription strand of RNA-seq reads was resolved using read-end and mapping orientation using SAMtools (v.1.7.0) and read pairs exclusively mapping within annotated exons were identified using bedtools intersect (v2.29.2)⁶⁹. Intronic read pairs were defined as those mapping within a genic span, derived from a sense strand transcript and not in the exonic set.

For genes with multiple annotated transcript isoforms, the sum of transcripts per million (TPM) over the isoforms was taken as the expression measure (mature transcript, steady state), although similar results — with the same conclusions — were obtained if the maximum for any one isoform was used. Nascent transcription was quantified by counting read pairs with a mapping quality of more than 10 overlapping intronic regions (defined as intronic in all annotated transcript isoforms of the gene) using bedtools multicov (v2.29.2). The read count was normalized to reads per kilobase of analysed intron for each gene in each sequence library, and then normalized to TPM for each library. The final nascent transcript expression estimate per gene was taken as the mean of nascent TPM over replicate libraries. Nascent transcription estimates could be generated for 85% (n = 17,304) of protein-coding genes.

Gene-based analyses of mutation rates used the genomic extent of the most highly expressed transcript isoform (the primary transcript) based on P15 C3H mouse liver gene expression. Overlapping genes, defined by primary transcript coordinates, were hierarchically excluded from analysis. Starting with the most expressed gene, any overlapping less-expressed genes were excluded. For the plotting of per-gene, per-strand mutation rates (Fig. 3b and Extended Data Fig. 7b–d), only genes spanning more than 2 million nucleotides of strand-resolved tumour genome in aggregate were shown (n = 3,392 genes) to minimize stochastic noise from genes with little power individually to accurately estimate mutation rates. Analyses of aggregating rates by expression bin included all genes within the bin.

Genes with similar estimates of nascent expression were aggregated for analysis of TCR. The sigmoidal distribution relating nascent transcription rate to mutation rate (Fig. 3b) was segmented using linear regression models in the R package Segmented (v1.3-3)⁷⁰. This defined n = 4,649 genes with zero or low-detected nascent expression (less than 0.287 TPM) in which reduced mutation rates associated with TCR are essentially undetectable; subsequently, stratum 1 genes (light blue in plots). Genes expressed at a greater rate than segmentation threshold (more than 3.73 TPM) do not show a further decrease in mutation rate with increased expression; these n = 7,176 highly expressed genes were defined as stratum 6 (bright red in plots). The n = 4,005 genes with intermediate expression (0.287–3.73 TPM) exhibited a log-linear relationship between expression and mutation rate. These were quantile split into strata 2–5, containing approximately 1,000 genes in each strata.

Genomic intersection and bootstrapping

The intersection and subsetting of genomic intervals were performed using bedtools intersect (v2.30.0). For the removal of genic subregions, overlapping genes were merged (bedtools merge), the regions extended 5 kb upstream and downstream (bedtools slop) and removed from pre-defined intervals using bedtools subtract. Genomic window coordinates were defined using bedtools makewindows. Bootstrap analysis, for example, in mutation rate calculations, resampled genomic intervals that met the selection criteria (for example, RFD category 1, non-genic, minus strand lesions) with replacement to the same total count, within the same tumour.

Multivariate regression analysis was performed using the lm function of R. The reference genome was partitioned into consecutive 10 kb windows, and composition-corrected mutation rates were calculated for each window in aggregate across tumours, separately for forward- strand and reverse strand lesions. Windows in a tumour with an unresolved lesion strand or containing lesion strand transitions were excluded. The fraction of nucleotides within a window overlapping genomic extents expressed at more than 1 TPM were separately calculated for template and non-template strand lesions. Replication time and RSB were both annotated for 10 kb windows by overlap with larger-scale replication time and RSB measures described above, taking the consensus measure (most nucleotide span) for the 10 kb window as the value for regression analysis. The fraction of window nucleotides annotated as genic but excluding regions identified as expressed genes was also included as a predictor variable (residual genic). The relative enrichment measures RSB and replication time were bounded (−1,1), whereas other parameters were fractions bounded (0,1). To ensure equal scaling for regression analysis, RSB and replication time were rescaled to the (0,1) range as f = 1 − (1 − r)/2, where r is the relative enrichment metric and f is the rescaled fractional range. Regression models were constructed with mutation rate as the outcome variable and other variables as independent predictor variables.

Substitution mutation clusters

For each nucleotide substitution mutation, the closest adjacent mutation was found. Null expectations of mutation spacing were generated by sampling mutation positions from other tumours without replacement, to generate an identical number of proxy mutations for each tumour. Initial analysis of mutation spacing indicated strong enrichment of mutations spaced less than 11 nt apart and evidence of enrichment to 100 nt spacing. Mutation clusters were defined as chains of mutations within the same tumour spaced less than X nucleotides from adjacent mutations, with X = 11, X = 101 or X = 201 depending on analysis as indicated. Over 97% of X = 101 mutation clusters (29,307 of 30,028) contained only two mutations, 721 clusters contained three mutations and no larger clusters were identified. Of X = 101 clusters from proxy-tumour mutations, 100% contained only two mutations.

For each mutation cluster, if it was located within a lesion segregation mutation asymmetry segment, we annotated the mutations within the cluster with respect to the inferred lesion-containing strand. For a genomic segment containing reverse strand lesions, the leftmost mutation site would be the first used as a template for an extending DNA polymerase (as DNA synthesis extends 5′→3′), and the rightmost mutation site replicated over subsequently. These orientations are reversed for a genomic segment containing forward strand lesions. The first replicated-over mutation site for each cluster was annotated distinctly from subsequent sites in the cluster.

Pairs of mutations were phased to the same chromosome by co-occurrence in the same sequencing read. Sequencing reads were extracted from genomic alignments using SAMtools mpileup (v1.7) where they overlapped both genomic positions of a pair of mutations called from the same tumour and separated by 75 nt or fewer. Any sequencing read supporting the called mutant allele with a phred-scaled quality score ≥ 20 at both mutation positions was taken as support for those mutations occurring on the same chromosome.

Mutation clusters were resolved to preferential leading or lagging strand replication-based RSB measures as defined above. Only the more extreme RSB windows (quantiles 1, 2, 20 and 21; |RSB| > 0.51) were considered for comparisons of leading versus lagging strand asymmetry, so that any strand differences were not swamped by regions with low levels of replicative asymmetry. Clusters were defined with X = 101 as above, resulting in n = 2,791 leading strand and n = 3,289 lagging strand clusters, the difference in count attributable to TCR correlating with leading strand replication (Fig. 1f). Cluster length distributions were compared using a two-sample, two-sided Kolmogorov–Smirnov test (ks.test function in R). To estimate statistical power for detecting differences in cluster size distribution between leading and lagging strands, we simulated distorted length distributions. The lagging strand length distribution vector was partitioned into clusters of length of 10 or less (short) or more than 10 (long) and randomly sampled with replacement to produce a vector of length matching the leading strand vector. Bias sampling between the short and long cluster bins was controlled by parameter d. An undistorted sample of the original distribution would be d = 0; whereas 10% of short clusters sampled from the long bin instead of the short bin would be d = 0.1. Two-sample, two-sided Kolmogorov–Smirnov tests comparing the original to the distorted sample distribution were applied to 100 bootstraps for each tested value of d (0–0.1 in increments of 0.0005), recording nominal significant difference at P < 0.05. The percent of bootstraps supporting nominal significance is the power to detect significance at the tested value of d.

Indel–substitution mutation clusters

Insertion and deletion (indel) mutations were filtered as previously described for base substitutions². For clustering analysis, we only considered indel mutations in lesion strand-resolved autosomal regions where at least three reads support precisely the called mutation. We identified the closest upstream or downstream substitution to each insertion or deletion, called within the same tumour. Null expectation datasets were generated by sampling substitution mutations between tumours as described for substitution mutation clustering above; 100 of these permuted datasets were generated for each tumour. Enrichment of clustering was evaluated by two-sided Fisher’s exact test (fisher.test function in R) considering the observed count of indels with a substitution within 100 bp versus the count of indels without a substitution within 100 bp, as compared with the same values estimated from the average of permuted datasets.

For a pair of sequences that differ by a single substitution and a single indel, there can be multiple equally optimal alignments. We identified all cases where there was a substitution mutation within 100 nt of the indel. For each of these, the ancestral and derived sequences were constructed by editing the mutations into the reference genome sequence, and they were oriented to represent the forward strand being newly synthesized over a lesion-containing template (that is, reverse complemented if the reference genome forward strand was the lesion-containing strand). We considered all possible gap placements within those more than 200 bp (2 × 100 flanks + indel length) alignments between ancestral and derived sequence. All alignments that had a single indel-length gap and one substitution were kept, but multiple solutions fractionally weighted, for example, four equally scoring alignment solutions would each be scored 1/4 = 0.25, whereas an alignment with just one solution would score 1/1 = 1. For the distance between indel and substitution, and the identity of the substituted, inserted or deleted bases were recorded for each weighted solution. Observed indel–substitution clusters were further filtered to ensure at least two sequence reads supported the existence of both the indel and the substitution in the same read (SAMtools v1.7.0 mpileup), confirming that the mutations occur on the same copy of the same chromosome. This filtering was not possible for the permuted data and thus makes our estimate of mutation clustering in the observed data conservative.

To consider whether substitutions were preferentially located upstream or downstream of the indel with respect to synthesis over the lesion strand, we considered both the full set of indel–substitution mutation clusters and additionally the subset where all equally scoring alignments placed the substitution on a single side of the indel. To generate a null expectation, for each of these datasets, the annotation of the lesion strand was randomly permuted, the distribution of biases from 10,000 permuted datasets were used to derive an empirical P value for each considered set of indel–substitution clusters.

Transcription-coupled repair

Annotated genes (Ensembl v91) were partitioned into six expression strata based on P15 liver RNA-seq (see above). For each tumour, genes were identified that were wholly contained within a mutation asymmetry segment. Using the annotated transcriptional orientation of the gene and mutational asymmetry of the tumour, each of these genes was categorized as either template strand lesion or non-template strand lesion.

Mouse colony management

Animal experimentation was carried out in accordance with the Animals (Scientific Procedures) Act 1986 (UK) and with the approval of the Cancer Research UK Cambridge Institute Animal Welfare and Ethical Review Body (AWERB). Animals were maintained using standard husbandry: mice were group housed in Tecniplast GM500 IVC cages with a 12–12-h light–dark cycle and ad libitum access to water, food (LabDiet 5058) and environmental enrichments. Ethical approval, tumour size limits, sample size choice, randomization and blinding for the tumour samples have been previously reported². At least three biological replicates were included for ATAC-seq and ChIP–seq experiments.

ATAC-seq

Liver samples from P15 mice (matching the developmental time of mutagenesis) were isolated and flash frozen. ATAC-seq was performed as previously described⁷¹, with minor modifications to the nuclear isolation steps (in step 1, 1 ml of 1× homogenizer buffer was used instead of 2 ml; in step 4, douncing was performed with 30 strokes instead of 20). Pooled libraries were sequenced on a NovaSeq6000 (Illumina) to produce paired-end 50 bp reads, according to the manufacturer’s instructions. Experiments were performed with three biological replicates.

ATAC-seq data processing and analysis

ATAC-seq data processing was performed using a Snakemake pipeline (v6.1.1)⁷². Adaptor sequences were removed using cutadapt (v2.6)⁷³. Reads were aligned to the reference genome (Ensembl v91: C3H_HeJ_v1 (ref. ⁵⁹)) using BWA (v0.7.17)⁷⁴. Data from multiple lanes were merged before deduplication; duplicates were marked using Picard (v2.23.8)⁷⁵. Reads overlapping ARC regions were removed using SAMtools (v1.9). Reads aligning to mitochondrial DNA were excluded from further analysis. Read positions aligning to forward and reverse strands were offset by +5 bp and −4 bp, respectively, to represent the middle of the transposition event, as previously described⁷⁶. ATAC-seq peaks were called using MACS2 (v2.1.2)⁷⁷ on pooled data containing all replicates. Single-nucleotide-resolution chromatin accessibility was measured and plotted as coverage of ATAC-seq ‘tags’ (Tn5 insertion sites, adjusted to represent the middle of the transposition event, as described above).

ATAC-seq data are available from Array Express at EMBL-EBI under accession E-MTAB-11780.

Nucleosome positioning analysis

We used nucleosome positions determined through chemical profiling of mouse embryonic stem cells³⁹ using a nucleosome centre positioning score to signify the prevalence of nucleosome dyads for a given genomic position. We transferred genome coordinates from mm9 to mm10 using UCSC liftover⁷⁸, before using halLiftover (v2.1) to derive expanded C3H-specific coordinates, considering only unique non-overlapping and syntenic positions. The top 4 million dyad positions were selected based on the nucleosome centre positioning score.

The positions and span of the major groove (either facing out or into the histones relative to the dyad) were calculated with the centre of the major groove facing inwards, repeating every ±10.3 bp away from the dyad position and spanning 5.15 bp (ref. ¹⁰).

CTCF ChIP–seq

Livers from P15 mice (matching the developmental time of mutagenesis) were perfused in situ with PBS and then dissected, minced, cross-linked using 1% formaldehyde solution for 20 min, quenched for 10 min with 250 mM glycine, washed twice with ice-cold PBS and then stored as tissue pellets at –80 °C. Tissues were homogenized using a dounce tissue grinder, washed twice with PBS and lysed according to published protocols⁷⁹. Chromatin was sonicated to an average fragment length of 300 bp using a Misonix tip sonicator 3000. To negate batch effects and allow multiple ChIP experiments to be performed using the same tissue, we pooled ten livers for each experiment; 0.5 g of washed homogenized tissue was used for each ChIP, using 20 μg CTCF antibody (rabbit polyclonal; 07-729, lot 2517762, Merck Millipore). Library preparation was performed using immunoprecipitated DNA or input DNA (maximum 50 ng) as previously described⁸⁰ with the ThruPLEX DNA-Seq library preparation protocol (Rubicon Genomics). Libraries were quantified by qPCR (Kapa Biosystems), and fragment size was determined using a 2100 Bioanalyzer (Agilent). Pooled libraries were initially sequenced on a MiSeq (Illumina) to ensure balanced pooling, followed by deeper sequencing on a HiSeq4000 (Illumina) to produce paired-end 150 bp reads, according to the manufacturer’s instructions; only HiSeq libraries were used for downstream analyses. Experiments were performed with five biological replicates.

To identify ChIP–seq-positive regions, we trimmed the HiSeq sequencing reads to 50 bp and then aligned them using BWA (v0.7.17) using default parameters. Uniquely mapping reads were selected for further analysis. Peaks were identified for each ChIP library and input control using MACS2 (v2.1.2) callpeak with default parameters, and all peaks with a q > 0.05 were included in downstream analyses. Input libraries were used to filter spurious peaks associated with a high-input signal using the GreyListChIP R package⁸¹. Biologically reproducible peaks were identified by merging ChIP–seq peaks defined as above from individual replicates and selecting those that overlapped with two or more individual replicate peaks.

ChIP–seq data are available from Array Express at EMBL-EBI under accession E-MTAB-11959.

Transcription factor binding site identification and analysis

ChIP–seq data for transcription factors, apart from CTCF (see above), were obtained from Life Science Database Archive (https://dbarchive.biosciencedbc.jp/datameta-list-e.html) with genomic coordinates for the mm9 reference assembly. Liver-specific ChIP–seq was used whenever possible, otherwise files marked with ‘All cell types’ were used instead (Supplementary Table 2). Genomic coordinates were lifted to mm10 using liftOver, and then lifted to the C3H genome assembly using halLiftover (as above). Overlapping ChIP–seq regions were merged, using the outermost coordinates as the new start/end of regions. FASTA sequences of the regions were extracted using bedtools getfasta (v2.27.1) and used together with non-redundant vertebrate position weight matrices from JASPAR⁸² to run FIMO (MEME suite)⁸³ with default parameters to detect motifs within ChIP–seq peaks. Those motifs were then filtered based on an overlap with ATAC-seq peaks (defined above) to ensure that the analysed set was within open chromatin regions of P15 C3H mouse livers. For CTCF-binding site analysis, in-house generated ChIP–seq data (described above) was used. For wider flank (1 kb) analysis, all motifs (JASPAR matrix profile MA0139.1) within the peaks were retained regardless of ATAC-seq intersection, allowing multiple motifs per ChIP–seq peak.

For high-resolution CTCF and transcription factor-binding site analysis (Extended Data Fig. 8), only one highest-scoring motif per ChIP–seq peak was retained. Similarly, for aggregate transcription factor analysis, only one highest-scoring motif per ChIP–seq peak was retained if it overlapped with an ATAC-seq peak. A total of 129 transcription factors were analysed based on ChIP–seq and position weight matrix availability, RNA-seq support for transcription factor expression (1 TPM or more) in the P15 mouse liver². In all the analyses, ‘bit score’ refers to the information content of the whole position. Within the motif, only mutations with the reference nucleotide matching the consensus nucleotide from position weight matrix were retained. In the flanks, mutations from all reference nucleotides were used.

CTCF structural analysis

High-resolution crystal structures for CTCF zinc fingers complexed with binding site DNA were obtained from the Protein Data Bank (PDB; 5YEL, 5T0U and 5UND)^84,85. As no single structure contains all 11 CTCF zinc fingers, a composite structure was compiled through alignment using PyMOL (v2.5.2)⁸⁶ align function. The PDB 5UND A chain 406–556 was aligned to the PDB 5T0U A chain (root mean square deviation of 1.06 Å); then the PDB 5YEL A chain was aligned to the PDB 5UND chain A (root mean square deviation of 1.3 Å). The composite image (Extended Data Fig. 8d) then shows the PDB 5T0U A chain 289–405, PDB 5UND A chain 406–488 and PDB 5YEL A chain 489–556, which collectively spans CTCF zinc fingers 2–11 inclusive. The bound DNA strands comprise the PDB 5YEL F chain 1–24, PDB 5T0U C chain 7–23, PDB 5T0U B chain 1–18 and PDB 5YEL E chain 5–26.

Protein–DNA contact distance measurements were performed using the Protein Contacts Atlas⁸⁷. Non-covalent interatomic contacts of 3 Å or less between CTCF protein and DNA were considered close contacts. Close contacts of atoms within phosphate groups or deoxyribose were considered backbone, and other DNA contacts were annotated as base contacts. Close base contacts involving atoms expected to acquire DEN-induced mutagenic adducts²³ or structurally equivalent positions in other bases (purines: N6 and O6; pyrimidines: O4, N4 and O2) were annotated as lesion site contacts. Distance measurements were taken separately for each structure (rather than from the composite) and excluded PDB 5T0U nucleotide contacts upstream of binding motif position +1 where this structure substantially deviates from PDB 5YEL. PDB 5T0U is truncated at zinc finger 7, whereas PDB 5YEL extends to zinc finger 11 and makes additional base-specific contacts absent from PDB 5T0U. Close backbone, base and lesion site contacts were reported if the distance threshold criteria were met in any of the three considered structures, although concordance was high in the overlapping regions.

Histology and image analysis

Digitized histology images of DEN-induced tumours² were obtained from Biostudies (accession S-BSST383).

Whole-slide images of tumours that met inclusion criteria (cellularity of more than 50% and DEN1 signature of more than 80%) were annotated in QuPath (v0.2.2)⁸⁸ using the polygon tool to include neoplastic tissue and excluded adjacent parenchyma, cyst cavities, processing artefacts and white space. For tumours with multiple transections, only a single whole-slide image was used. Annotations were reviewed for quality by a histopathologist (S.J.A.). Using Groovy in QuPath, annotated regions were tessellated into fixed size, non-overlapping 256 × 256 µm tiles. For segmentation of epithelioid nuclei, a pre-trained StarDist⁸⁹ model (he_heavy_augment.zip) was downloaded from https://github.com/stardist/stardist-imagej/tree/master/src/main/resources/models/2D, and an inference instance was deployed using Groovy across the tiles in QuPath, built from source with Tensorflow⁹⁰, with a minimum detection threshold of 0.5. Python (v3.9.7) was used for downstream analyses. Data were filtered to exclude extreme outliers: tiles with 43 nuclei per tile or fewer; nuclei with an area of 227.18386 µm or more, circularity of 0.4841 or less, or non-computable circularity were excluded. From the 245 whole-slide images (n = 237 mutationally asymmetric tumours and n = 8 symmetric tumours), 70,414 tiles were generated, and 9,999,783 nuclei were segmented (post-filtering). To compute inter-nuclear distance, for each nucleus in a tile represented by its x–y centroid coordinates, nearest neighbours were identified using the k-dimensional tree function from the spatial module of SciPy (v1.7.1)⁹¹. The Euclidean distance for each nearest neighbour pair was computed using the paired distances function from the metrics module of SciKit-Learn (v1.0.2)⁹². The median nuclear area, median nuclei per tile and median inter-nuclear distances were compared between asymmetric and symmetric tumours using a two-tailed Wilcoxon rank-sum test.

Symmetric versus asymmetric tumour comparison

Mutationally symmetric tumours (defined above; more than 99% of autosomal mutations in genomic segments with abs(S) < 0.2) were filtered to the subset that met the same inclusion criteria as the other n = 237 tumours analysed in this study (more than 50% cellularity (after adjusting for the presence of two genomes) and more than 80% substitution mutations attributed to the DEN1 signature). Eight tumours met this criteria. We subsequently show that these tumours are not whole-genome duplicated, but that they contain both daughter lineages of an originally mutagenized cell (Extended Data Fig. 10b). For each autosomal variant in a tumour, we calculated its VAF quantile position among point mutations in that tumour, using the R ecdf function⁹³. The quantile positions (range 0–1) were grouped into consecutive bins of 0.005 unit span, that is, the 0.995–1.0 was the rightmost bin representing the top 0.5% of VAF values for mutations in a tumour. The mutations within a VAF quantile bin were classified as either overlapping or not overlapping with the genomic span of the most highly expressed genes (stratum 6) using the R data.table foverlaps function⁹⁴. The counts of overlapping and non-overlapping mutations from the focal tumour were compared as a two-tailed Fisher’s exact test to the equivalent counts aggregated from all asymmetric tumours (excluding the focal tumour in the case of asymmetric focal tumours for the calculation of background expectation). The same analysis was performed in aggregate for all symmetric tumours (n = 8) compared with all asymmetric tumours (n = 237). The calculations were repeated for each of the 200 consecutive bins to demonstrate the VAF range over which high VAF mutations are preferentially enriched in highly expressed genes specifically in symmetric tumours, as predicted under NER-TRIM.

Computational analysis environment

Except where otherwise noted, analysis was performed in Conda environments and choreographed with Snakemake⁷² running in an LSF 965 or Univa Grid Engine batch control system (Supplementary Table 3). Statistical tests were performed in R (v4.0.5) using fisher.test, ks.test, cor.test and wilcox.test functions for Fisher’s exact, Kolmogorov–Smirnov, Pearson’s and Spearman’s correlation and Wilcoxon tests, respectively. Graphics were generated using R.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link