The progressive plasticity during colorectal cancer is happening

Getting Started with Xenium Panel Designer: Calculation and Analysis of Module Scores on the Top of each Heatmap in Extended Data Fig. 6

Module scores on top of each heatmap in Extended Data Fig. 6 were calculated with the AddModuleScore function from Seurat68 using the genes listed in each heatmap. This score represents the average expression levels of a gene set. The score was calculated for each spot and a box plot was used to show the distribution of module scores in each microregion.

Custom Xenium gene and mutation probes were designed using Xenium Panel Designer (https://cloud.10xgenomics.com/xenium-panel-designer) following instructions outlined in the ‘Getting Started with Xenium Panel Design’ instructions (https://www.10xgenomics.com/support/in-situ-gene-expression/documentation/steps/panel-design/xenium-panel-getting-started#design-tool). In brief, 21-bp sequences flanking the targeted transcribed variant site were curated from the Ensembl canonical transcript (Ensembl v.100). Three of the four possible ligation junctions were evaluated in case of deletions and two of the four for the WT allele. The sites where only non-preferred junctions were available were not included. The two bases of the ligation junction sequence were the last base of the RBD5 (RNA binding domain) and the first base of the RBD3 probe. Preferred junctions were always prioritized over neutral junctions unless a neutral junction was necessary to avoid hairpins, homopolymer regions, dimers or an unfavourable annealing temperature. The temperature target for both probes was adjusted from the 21-bp starting length to between 50 C and 70 C. IDT did not include sites that were predicted to be dimers or hairpins. The sites with at least one homopolymer region of five consecutive bases were excluded.

3D feature volume for tumour, HT397B1 and lymph node annotated CODEX imaging data using Napari and PASTE70

The 3D feature volumes were generated, and coloured points on the surface mesh based on the voxel value at the corresponding location. A feature volume is a volume that describes features from the serial section dataset for example, expression of a given genes and so on. Feature volumes used in this analysis were constructed in the following manner. The 3D neighbourhoods were binned at the same resolution as the serial sections, for which the feature was applicable. The gaps between sections were filled with the binned feature. The resulting volume was of the same shape as the integrated neighbourhood volume, for which the value of each voxel was the aggregated feature count for the voxel. For HT268B1, the features used were logged expression of TYMP1 and IGLC2. For HT397B1, we used fibroblast and immune cell fraction. Cells were annotated as described in the section ‘Cell-type annotation of CODEX imaging data’. The surface mesh was visualized using Napari (https://github.com/napari/napari) and contrast was adjusted on a volume-to-volume basis. We also visualized the HT397B1 tissue volume with the Imaris platform, for which we generated surfaces from the following CODEX markers: pan-cytokeratin (epithelial), CD45 (immune) and SMA (stromal).

The neighbourhoods that were used to construct the new volume were classified into neighbourhoods that were positive and neighbourhoods that were negative. This 3D tumour mask was then smoothed with a Gaussian kernel (sigma = 1.0). The resulting values were used as input to generate a surface mesh for the tumor volume. We used the scikit-image implementation (skimage.measure.marching_cubes) of the marching cubes algorithm with default parameters.

The update of the alignment tool PASTE70 enabled partial image alignment. Serial sections of the same tumour piece were aligned pairwise with default settings. Each Visium data point in each section received new coordinates based on the alignment results. We then identified the nearest spot on each adjacent section for every spot, connecting them along the z axis. This process allowed for the linking of all the spots on the z axis. We counted connected spots after removing stromal spots to determine if one microregion was connected to another in an adjacent section. If any microregion on one section connected to the next section with more than three shared spots, then we considered these two microregions, located on different sections, as connected in 3D space and forming the same tumour volume. This connection was labelled as volume 1, volume 2, and so forth in the figures (Fig. 5d,e and Extended Data Fig. 9a–d…

The set of all possible annotations were made up of epithelial, CD4 T cell, CD 8 T cell, regulatory T cell, T cell, macrophage, macrophage-M2, B cell, dendritic, immune, and fibroblast. For some images, not all proteins required to gate a specific cell type were present. CD4 was not available to use in the annotations of CD4 T cells in most image panels. The gating strategy that was constructed in those instances was that cells can be labelled more broadly as T cell, if specific proteins aren’t present. If a cell was negative for all the steps in the gating strategy, it was marked as unlabelled. The code for image format conversion can be found at GitHub.

Afterwards, we ran the Morph toolset (https://github.com/ding-lab/morph), which uses mathematical morphology to refine the tumour microregions. That is, if the total number of spots in a microregion is less than or equal to three, then we labelled all such spots as stroma. For example, Morph assigned the layer of each spot of a tumours micro region to a sequence of mathematical operations described in the Spot-depth correlation analysis method.

Source: Tumour evolution and microenvironment interactions in 2D and 3D space

Multi-Vector Graph-Based Clustering for 3D Neighbourhood Volume Generation and Transfer in Serial Section Experiments

To focus on neighbourhoods most related to the TME biology, we filtered out neighbourhoods with >50% overlap with copy number annotated subclones. Additionally, we excluded neighbourhoods that mapped to fewer than ten total spots across all ST sections for a sample.

The VisiumST spots were assigned to neighbourhoods in the following manner. Each spot was assigned the neighbourhood label of the neighbourhood overlapping its spot centroid.

We used a graph-based clustering approach to create multiple data-type specific volumes for HT397B1, which we integrated with other neighbourhood volumes. All neighbourhood voxel annotations were identified. The edges are the distance between the partition combinations which were represented in the graph. The graph was then clustered to identify integrated neighbourhoods. Hyperparameters for the above clustering process are provided in Supplementary Table 4. 3D neighbourhoods were displayed using the open-source visualization tool Napari (https://github.com/napari/napari).

After the assignment of neighbourhoods for each section, slides were interpolated to generate a 3D neighbourhood volume. For this, we used linear interpolation of neighbourhood assignment probabilities with the torchio library74.

After this, a slide token was concatenated to patch tokens. The slide token (representing the slide from which the image tile was selected) was indexed from a trainable embedding of size n_slides × d, where n_slides is the number of slides in the serial section experiment. As the slide token is passed through the transformer blocks along with the patch token, it can be shared across all the others, allowing the slide token to learn from useful representations of the patches. The model was more robust due to this feature. After the addition of the slide token, the transformer blocks which are used to deliver the viT were used to transfer the embedded images. All variables and details of the transformer architecture are included in Supplementary Table 4.

Two separate runs of the model were trained for HT397B1 (six H&E, four CODEX and two Visium ST slides) and HT268B1 (four Visium ST slides). Supplementary Table 4 contains training hyperparameters such as the batch size and number of training steps. For HT268B1, only one instance was trained because only one data type was present. Three model instances for each type were trained and merged following the procedure described in the section of 3D neighbourhood construction and integration.

Where λNBHD (maximum of 0.01) and λMSE (set to 1.0) are scalers for the neighbourhood loss (LNBHD) and reconstruction loss (LMSE), respectively. NBHD was increased to its maximum value during training.

Source: Tumour evolution and microenvironment interactions in 2D and 3D space

Vision Transformer Network: Unsupervised Training in a Novel Neighbourhood Model for Image Representations and Alignment of Representation Patterns

$${{\bf{L}}}{{\rm{o}}{\rm{v}}{\rm{e}}{\rm{r}}{\rm{a}}{\rm{l}}{\rm{l}}}={\lambda }{{\rm{N}}{\rm{B}}{\rm{H}}{\rm{D}}}{{\bf{L}}}{{\rm{N}}{\rm{B}}{\rm{H}}{\rm{D}}}+{\lambda }{{\rm{M}}{\rm{S}}{\rm{E}}}{{\bf{L}}}_{{\rm{M}}{\rm{S}}{\rm{E}}}$$

During the training, the autoencoder was trying to maximize two main tasks: the reconstruction of the expression profile of each patch and the alignment of neighbourhood labels between adjacent sections. The model was forced to learn representation patterns while also keeping neighbourhoods aligned for the sake of fighting neighbourhood differences due tobatch effects, thanks to two competing objectives. Differences in patch expression were quantified by MSE, whereas neighbourhood adjacency was enforced by minimizing the cross-entropy of patches adjacent to each other in the z direction during training.

Mean squared error on the reconstruction of the input patches is one of the main contributions to the overall loss function.

The neighbourhood model contained a vision transformer and an autoencoder. In brief, an autoencoder is an unsupervised training method for which an encoder (embedding component) and a decoder (reconstruction component) work together to learn how input data are generated. The network derives an approximation Q, to the true generating function P, from the input. The autoencoder used was asymmetric, meaning that the encoder and decoder were not inverse copies of one another. The architectures previously described in the supplementary table 4 are similar to the ones used in the encoder.

ViTs work on image tokens as input. A image token is a representation of a patch of the input image. During training, image tiles were sampled from a uniform distribution across the set of input sections (Supplementary Fig. 7a). The sampled tile was then split into patches, for which the number of patches was determined by two hyperparameters: patch height (ph) and patch width (pw). In the case of spatial transcriptomics data, c is the number of genes in the picture, so the patch was flattened to a 1 (ph pw c) vector. The unrolled patches were added to a matrix called a n (ph t s) matrix, where the number of patches is the image tile. The image tile is represented by a token in this matrix. The transformer blocks were projected from a linear layer to shape the token into n d.

Expression profiles for each patch were generated differently for image-native data (CODEX and H&E) and point-based data (Visium). For CODEX and H&E patches the average intensities for each image channel over all the parts within the patch bounds were calculated. The expression profile of each spot in the Visium patch was determined by its distance to the centre of the patch. The number of spots within the patch boundaries help to account for the differential weight.

For registration, we used BigWarp71, which was packaged in the Fiji/ImageJ software application. To register each collection of serial images, we used the first serial section as the fixed image and the second image as the moving image. The second image was used as the fixed image in order to make the third image. Key point registration began for all images in the experiment. Key points were selected for every image transformation. Once key points were selected, a moving field was exported from BigWarp for each image transformation. This dense displacement field was upscaled by a factor of 5 so it could be used to warp the full-resolution images. The data was then registered by using the dense displacement field. The code used for registration is available at GitHub (https://github.com/ding-lab/mushroom/tree/subclone_submission).

Before registration, imaging data underwent the following transformations. Multiplex images were converted to greyscale images of DAPI intensity. The image was then downscaled by a factor of 5 before key point selection. H&E images (also downsampled by a factor of 5) were used for keypoint selection with Visium data.

Following the same threshold as used for Visium, we evaluated the spatial-based cell–cell interaction in the sample with COMMOT69 with Cellchat database. When comparing the median sender and receiver signals for each interaction family, the tumours boundary spot was used as a comparison because it was the one with the lowest rank-sum test. Interaction pathways with signal difference great than 0.1 and FDR less than 0.05 are considered significantly boundary-enriched. Boundary DEGs were identified with FindMarkers function on three sets of comparisons: boundary/tumour, boundary/TME and boundary/all non-boundary. A boundary DEG has changed its P value 0.25 in the boundary/non-boundary test and log2(fold change) > 0 in the other tests.

The correlation coefficients for covariant purity and layer correlation are calculated by dividing the layer number by the total number of layers in a tumours. Purity was inferred with deconvolution when there was matching snRNA-seq data (deconvoluted tumour fraction per spot by RCTD), or with ESTIMATE (that is, tumour purity estimate score per spot) otherwise. Each gene was checked against a set of snRNA-seq-derived non-malignant gene lists to ensure that the change in fraction did not derive from a shift in cell type composition. Finally, we performed multiple-testing adjustments for all tests done in each ST section.

The expression values of all cells in a small amount from 0.0 to 1.0 are used to determine if the cells are positive or negative for the markers. The method allowed us to avoid using a single threshold for all markers, as markers have different expression across the samples. The intuition behind our approach is that the knee-point at which quantile values start increasing rapidly reflects where the population changes from negative to positive. This can also be seen as a way of finding the knee-point in the distribution without fitting an estimated distribution to the data. We compared the expression of tumours markers with the tumour regions manually annotated by the doctor in the images to validation our labels. To identify the subset of cells with high marker expression, we performed a knee-point analysis on all cells, reporting a percentage of them positive for any tumour marker. 3f and 5a–d.

We studied ten cases, four of which were from the BRCA, two of which were from the CRC and one from the PDAC. In order to get subclone specific DEGs, we used Find Markers from the function in Seurat with the option of Wilcox DEGs between the subclones. The cut-off was applied to P 0.01, average log2(fold change) > 1 and per cent expression in at least one cell type, to select significant DEGs. To infer treatment response, we used the perturbation database LINCS L1000 (ref. 65), specifically the LINCS_L1000_Chem_Pert_down dataset from Enrichr66, to evaluate the gene set overlap between upregulated DEGs in spatial subclones and downregulated genes after compound treatment. The data was sorted by the name of the subclone and the compounds were chosen from the list. The corresponding compound metadata, including mechanism of action, was obtained from CLUE (clue.io, ‘Expanded CMap LINCS Resource 2020 Release’) to add annotation on the heatmap.

The reverse applied to candidate stromal-specific DEGs. If a DEG did not meet both of these requirements to be tumour or stromal specific, it was designated as either tumour-enriched or stromal-enriched based on whether the expression level was higher in tumour or stromal cell types (Supplementary Fig. 8a).

Cell-type assignment was done based on the following known markers: B cell, CD79A, CD79B, CD19, MS4A1, IGHD, CD22 and CD52; cDC1, CADM1, XCR1, CLEC9A, RAB32 and C1orf54; cDC2, CD1C, FCER1A, CLEC10A and CD1E; mregDC, LAMP3, CCR7, FSCN1, CD83 and CCL22; pDC, IL3RA, BCL11A, CLEC4C and NRP1; macrophage, CX3CR1, CD80, CD86, CD163 and MSR1; mast cell, HPGD, TPSB2, HDC, SLC18A2, CPA3 and SLC8A3; endothelial, EMCN, FLT1, PECAM1, VWF, PTPRB, ACTA2 and ANGPT2; fibroblast, COL1A1, COL3A1, COL5A1, LUM and MMP2; pericyte, RGS5, PLXDC1, FN1 and MCAM; NK cell, FCGR3A, GZMA and NCAM1; plasma cell, CD38, SDC1, IGHG1, IGKC and MZB1; T cell, IL7R, CD4, CD8A, CD8B, CD3G, CD3D and CD3E; and regulatory T cell, IL2RA, CTLA4, FOXP3, TNFRSF18 and IKZF2. The breast’s normal cells were annotated with markers: Lum Sec, GABRP, ELF5 and CL28; LumHR, ANKRD30A, ERBB4. Normal epithelial cells in the liver were annotated with the following markers: hepatocyte, ALB, CYP3A7, HMGCS1, ACSS2 and AKR1C1; cholangiocyte, SOX9, CFTR and PKD2. Normal epithelial cells in the pancreas, including ductal, acinar, islet-α, islet-β and islet-γ cells, were annotated with singleR (v.1.8.1) using reference data BaronPancreasData(‘human’).

Calculation of the area of microregions in a lattice of Visium spots using a modified Jaccard similarity score

The density per m2 was calculated by summing the number of microregions in a section with the smallest section size being the most dense. density per m2 106 is the rate at which the density is divided by the number of millimetres.

We used the spot size of 55 m and the center-to- centre distance of 100 m to calculate the area each spot takes. In Supplementary fig. 6, the Visium spots form a hexagonal lattice. The repeating unit is made of eight equilateral triangles and is centred at each spot’s centre. Each triangle has a side of 50 µm (half of the spot the centre-to-centre distance). Using the area equation of equilateral triangles and multiplying it by 8, we obtained the area of each trapezoid as 8,660 µm2, which is the average area occupied by each spot. We took the spot count and divided it by the number to get the size in millimetres.

To determine the similarity between two spatial CNV profiles, we use a modified Jaccard similarity score. CNV profiles were defined as a set of genomic windows with an amplification or deletion at least one copy neutral. The two profiles were compared and the overlaps in the windows were broken down so that they had the same set of windows. Then, the CNV similarity score (Sim) was defined as follows:

irmsizeleft(w) times

Source: Tumour evolution and microenvironment interactions in 2D and 3D space

InferCNV and CalicoST for Large-scale chromosomal CNV detection with snRNA-seq and Visium data

To detect large-scale chromosomal CNVs using scRNA-seq, snRNA-seq and Visium data, InferCNV (v.1.10.1) was used with default parameters recommended for 10x Genomics data (https://github.com/broadinstitute/inferCNV). The sample level was used for InferCNG and only the post quality control data was used. For snRNA-seq and scRNA-seq data, all non-malignant cells were used as a reference with the annotation ‘non-tumour’ and all malignant cells had the same annotation ‘tumour’, with the following parameters: analysis_mode=“subclusters”, –cluster_by_groups=T, –denoise=T, and –HMM=T. For Visium ST data, 200 spots annotated as ‘non-malignant’ with the lowest ESTIMATE purity score were used as a reference, and ‘malignant’ spots had their microregion ID as annotation, with the following parameters: window_length=151, analysis_mode=“sample”, –cluster_by_groups=T, –denoise=T, and –HMM=T. CalicoST (https://github.com/raphael-group/CalicoST)63 was run on Visium ST data with the same input annotation (microregion ID). The spots from the same region were considered the smallest unit of analysis. CalicoST was then run with default parameters with results manually inspected.

For each sample, we obtained the unfiltered feature–barcode matrix per sample by passing the demultiplexed FASTQ files and associated H&E image to Space Ranger (v.1.3.0, v.2.0.0 and v2.1.0 ‘count’ command using default parameters with reorient-images enabled) and the prebuilt GRCh38 genome reference 2020-A (GRCh38 and Ensembl 98). Seurat was used for all subsequent analyses. We constructed a Seurat object using the Load10X_Spatial function for every slide. The slide was scaled and normalized to correct for the effects. The same scaling and normalization method was used when analyzing cells and samples with several slides. The original Louvain algorithm was used to group spots, and the top 30 principal component analysis dimensions were used to find them using FindNeighbors and FindClusters functions.

Bedtools62 intersection was used to map copy number ratios from segments to genes and to assign the called amplifications or deletions. A custom script was written in python that used a weighted copy number ratio and the length of overlaps to determine whether a genes is amplified, neutral or deleted. The bounds of the default Z score thresholds were used instead if the resulting Z score cut-off value was within that range.

Inference for the comparisons of the probabilities of Mutations per location between different groups was done through percentile confidence intervals for each types of tumours.

All samples were collected with consent from the Washington University School of Medicine in St Louis. Samples from BRCA, PDAC, CRC, CHOL, RCC and UCEC were collected during surgical resection and verified by standard pathology (institutional review board protocols 201108117, 201411135 and 202106166). After verification, a 1.5 × 1.5 × 0.5 cm3 portion of the tumour was removed, photographed, weighed and measured. Each portion was then subdivided into 6–9 pieces and then further subdivided into 4 transverse-cut pieces. Each piece was then put into formalin, snap- frozen in liquid nitrogen, and then snap-frozen before being embedded in OCT. Utility-based grid processing minimized remaining tissue and the purpose was to choose it over punch sampling. Relevant protocols can be found at protocols.io (https://doi.org/10.17504/protocols.io.bszynf7w)46.

Paraffin blocks (FFPE blocks) were sectioned at 5 μm and placed on Xenium slides following the FFPE Tissue Preparation guide (10x Genomics, CG000578, Rev B). Those slides underwent a series of xylene and ethanol washes for deparaffinization and decrosslinking, using the FFPE tissue enhancer as outlined (10x Genomics, CG000580, Rev B). In situ probe hybridization took place overnight with 409 probes from thexenium Human Multi-Tissue Panel, plus an additional 100 custom probes. After hybridization probes were ligated, the sample underwent rolling circle amplification, and the background was quenched using an autofluorescence mixture. Nuclei were stained with DAPI to improve sample tracking and approximate cell boundaries (10x Genomics, CG000582, Rev D). These samples, along with buffers and decoding consumables, were loaded into a Xenium analyzer (10x Genomics, 1000481). The run was read by the guidance provided by 10x Genomics. Diagnostic images of the reporters were made using the barcoded circularized cDNA. H&E staining was performed on the same region after the run was complete.

FFPE 4 µm tissue sections of were baked for 2.5 h at 63 °C in vertical slide orientation with subsequent deparaffinization performed on the Leica Bond RX followed by 30 min of antigen retrieval with Leica Bond ER2 followed by six sequential cycles of staining with each round including a 30 min combined block and primary antibody incubation (Akoya antibody diluent/block ARD1001EA), except for HER2, which required a 1 h incubation. Each 1.34mm2FOV was captured from around nine fields of view from each tissue section.

Nuclei and cells and barcoded beads were isolated in oil droplets using a 10x Genomics Chromium instrument. Single-nucleus suspensions were counted and adjusted to a range of 500–1,800 nuclei per µl using a haemocytometer. Reverse transcription was subsequently performed to incorporate cell and transcript-specific barcodes. All snRNA-seq samples were run using a Chromium Next GEM Single Cell 3′ Library and Gel Bead kit v.3.1 (10x Genomics). 10x Genomics was used for the multiome kit. Nuclei were then subjected to downstream protocols by 10x (Next GEM Single Cell Multiome ATAC + Gene Expression: https://cdn.10xgenomics.com/image/upload/v1666737555/support-documents/CG000338_ChromiumNextGEM_Multiome_ATAC_GEX_User_Guide_RevF.pdf. Next single cell. The user guide for the chromium single cell destroyer kits can be found at support.10xgenomics.com. Single-cell suspensions were subject to the Next GEM Single Cell 3′ Kit v.3.1 protocol. Barcoded libraries were then pooled and sequenced on an Illumina NovaSeq 6000 system with associated flow cells.

About 100–250 ng of genomic DNA was fragmented on a Covaris LE220 instrument targeting 250-bp inserts. A dual-indexed library was created using a KAPA Hyper library prep kit. Up to ten libraries were pooled at an equimolar ratio by mass before the hybrid capture targeting a 5-µg library pool. The library pools were hybridized using xGen Exome Research Panel v.1.0 reagent (IDT Technologies), which spans a 39-Mb target region (19,396 genes) of the human genome. The libraries were hybridized in 16 h at 65 C and then cleansed to remove spuriously hybridized library fragments. Enriched library fragments were eluted and PCR cycle optimization was performed to prevent overamplification. The enriched libraries were amplified. The cluster counts were determined through a KAPA kit using a manufacturer’s protocol and were appropriate for the illumined NovaSeq-6000 instrument. Next, 150 pairs of reads were generated to achieve around 100 coverage per library.

Automated data analysis and analysis of GDC human reference genome GRCh38 in FASTQ, MSK-IMPACT and samstat

Data analyses were done in R and Python. Details of specific functions and libraries are provided in the relevant methods sections above. Significance was determined by using various tests, including Wilcoxon rank-sum test, proportion test, hypergeometric test, and Pearson correlation test. Significant is the P value 0.05. There are some details of the tests in the figure legends.

The preprocessed FASTQ files had parameters, including length 36 and all other ones, set to default. FASTQ files were then aligned to the GDC’s GRCh38 human reference genome (GRCh38.d1.vd1) using BWA-mem (v.0.7.17) with parameter -M and all others set to default. The output SAM file was converted to a BAM file using the samtools (https://github.com/samtools/samtools; v.1.14) view with parameters -Shb, and all others set to default. BAM files were sorted and duplicates were marked using Picard (v.2.6.26) SortSam tool with the following parameters: CREATE_INDEX=true, SORT_ORDER=coordinate, VALIDATION_STRINGENCY=STRICT, and all others set to default; and MarkDuplicates with parameter REMOVE_DUPLICATES=true, and all others set to default. The final BAM files were then indexed using the samtools (v.1.14) index with all parameters set to default.

Copy-number alterations in solid tumours were computed from MSK-IMPACT using the FACETS (Fraction and Allele-Specific Copy Number Estimates from Tumour Sequencing) algorithm56, which provides allele-specific copy-number estimates at the level of both gene and chromosome arm. FACETS was also used to generate purity-corrected segmentation files, for detection of whole-genome duplication events, to infer the clonality of somatic mutations, to assess arm-level copy-number changes and to generate mutant allele copy-number estimates.

To determine the error rate of long-read sequencing in this context, four samples were technically duplicated. duplication of calls in technical replicates was accomplished because the minimum VAF threshold was chosen. This was found to be 0.02. Additionally, mutation calls were filtered to only include nonsense mutations with ≥100 read depth.

FASTQ files were aligned against the Genome Reference Consortium mouse genome 39 (GRCm39)59 using BWA-MEM (https://github.com/lh3/bwa). VarDict was variant caller and the minimum allele fraction was 0.01 whenmutation calling was performed. Ensembl VEP60 was used to do variant annotations. The list of called mutations was filtered to remove variants that did not pass internal noise filters. ENU’s propensity to cause single-nucleotide variants resulted in the removal of indels. Finally, variants were retained only if they were called in at least two amplicons per sample and supported by at least five mutant reads. If at least five different reads or a VAF of less than 0.01 were used, the codons 73–84 and 122–1 39 would be inspected and retained. Only one sample was affected by this.

The primer is a product of NEB Q5 High-Fdelity DNA polymerase, which has a number of exons. Each sample had a different barcode placed on the end of the forward primer. The expected size distribution was confirmed by the gel electrophoresis of the minor clone sample, and then the products were purified, quantified, and pooled in an equimolar ratio. The Earlham Institute constructed and sequenced a library on aPacbio Revio SMRT Cell.

The micro-dissected minor tumours clones, that contained mRNA, were isolated. First strand cDNA synthesis was carried out using a NEB ProtoScript II First Strand cDNA synthesis kit (E6560) according to the manufacturer’s instructions. Both oligo-dT or a gene specific reverse primer (rev_3699 GCCTTTTGGCATTAGATGGA) were used.

The iScript cDNA synthesis kit was used for the synthesis of the cDNA. Real-time quantitative PCR for Notum was performed using a Taqman gene expression assay (Mm01253273_m1) according to the manufacturer’s instructions on a QuantStudio 6 (Applied Biosystems). Relative fold change in gene expression was calculated using the ({2}^{-{{\rm{\Delta \Delta }}C}_{{\rm{t}}}}) method. All ΔΔCt values were normalized to the housekeeping genes Gapdh (Mm99999915_g1) and Rpl37 (Mm00782745_s1).

The knockout of Apc was performed by Skoufou-Papoutastic and his team. The mouse’s small intestinal organoids were single-celled in TrypLE Express for 30 minutes. One-hundred thousand cells were then incubated with a Cas9 enzyme (TrueCut Cas9 Protein v2, Invitrogen, A36497) and single guide RNA (sgRNA) complex (Synthego). sgRNAs were designed using Benchling (https://www.benchling.com/) and Indelphi (https://indelphi.giffordlab.mit.edu/)74. Guides were meant to lead to out of frame indels and thus truncations at codons S 96, T619, and F1378. The sgRNA sequence were pre-Armadillo, CCTTCGCTCCTACGG AAGTC. The cells were then put into a 16-well LOnza strip and held at room temperature for 10 minutes. Electroporation was performed on an Amaxa 4D Nucleofector (Lonza) using the DS138 programme. After 10 min at 37 °C, cells were transferred to a 0.5 ml tube, suspended in 20 μl Cultrex and plated as described above. Single organoids were picked using a micropipette, after they were formed, due to their small size. In order to make clonal organoids, they placed single picked organoids in TrypLE Express for 10 minutes at room temperature. DNA was extracted using the PicoPure DNA extraction kit according to the manufacturer’s instructions. The custom designed primers and product were used. PCR primers used were: Pre_Armadillo forward, GGCAGATGGGTTCAAAGGGGTAGAG; Is there a pre_armadillo reverse? Arm forward, TGACTCATAGAAACAGCACTGACCCA; Arm reverse, GCATGGCTGGATTTCTCAACTACCA; MCR forward, TCAGACAACACAGGAAGCAGA; and MCR reverse, GGCCCACTCTCTCTCTTCTC. Deconvolution was performed using the ICE Synthego platform (https://ice.synthego.com/) to determine the knockout score and clonality.

Primary and metastatic CRC and normal colon organoid lines were established as previously described10,43,54. Cells processed as described above were centrifuged at 600g for 5 min at 4 °C and resuspended at 2,000 cells per 40 μl of Matrigel. The Matrigel domes solidified at 37 C, so they added the HISC medium to wells. Organoids were passaged every 7–10 days, and were considered established after three passages. For non-tumour organoid culture, HISC medium was supplemented with human R-spondin 1 (1 μg ml−1; Peprotech) and NGS-WNT (0.5 M, ImmunePrecise N000). The medium was changed every 3–4 days. Organoid lines were expanded and early-passage stock vials were cryopreserved in liquid nitrogen.

For experiments, organoids were collected from Matrigel with 3 mM EDTA in DPBS and, where indicated, were treated with TrypLE (Thermo Fisher Scientific) for 5–10 min at 37 °C and filtered through a 40-μm cell strainer to generate single cells. Cells were plated at a density of 20,000 cells per 40 l of Matrigel and were then kept in a dome for 30 min until the domes solidified. Organoids were cultured in Hisc using Advanced Dme/F12 (AdDF12; Thermo Fisher Scientific), Glutamax (2 mM), HEPES (10 mM) and N-acetyl-l-chloride (1 mM). The medium was changed every few days. Organoids were collected at 7 days for downstream assays. Small intact organoids were treated with 250 nM irinotecan added to the HISC medium for 7 days and then collected for downstream assays. The production and transduction of organoids was done previously. The HEK293T cells were cultured in a mixture of penicillin and Strepomycin (100 IU mJ, 0.1 g ml1, Thermo Fisher Scientific). All the cells were negative for mycoplasma. Matrigel had organoids which were collected with 3 mM. EDTA in DPBS, dissociated to single cells with Accutase (Sigma-Aldrich) for 30–45 min at room temperature, washed with IGFF medium and processed as described above.

Sequence read quality was assessed using FastQC (v0.11.9; http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Adapter content was trimmed from the reads using Trimmomatic (v0.39)62. Trimmed reads were aligned to GRCm39 Ensembl release 103 for quality control purposes using STAR version 2.7.7a63 and quality control of the aligned reads was carried out using Picard tools (v2.27.3). Gene expression quantification was carried out using Salmon (v1.9.0) against indexes generated from Gencode Mouse release M30. Differential gene expression was performed using the DESeq2 package64. Genes were determined to be statistically differentially expressed at an adjusted P value of 0.05. The clusterProfiler package has a function called the GSEA that was used to perform the Gene set enrichment analysis. The mouse small gut and colonic secretory signatures were found in Tomic et al.69 and the mouse gene sets were published by Muoz et al.67. Additional published Wnt pathway and Apc knockout gene sets, as well as an unpublished mouse intestinal-specific KrasG12D list of genes were used70. Mouse intestinal cell-type signatures were derived from a compendium of single-cell RNA-sequencing experiments hosted at PanglaoDB71. The MmCMS package has a code for ConsensusMolecular Subtyping which can be found in the C of the package. The default prediction probability of 0.6 is recommended by Malla et al.72 for the path-derived subtyping that was performed.

The target amplicon library was prepared with the help of the 8.8.6 IFC and the Juno system according to the Standard BioTools protocol. The highly multiplexed interrogation of 48 samples against 8 independent panels of primers resulted in the generation of 286 amplicons for each sample. The harvested amplicons from each IFC were quantified using a Bioanalyzer 2100 (Agilent) and pooled equimolarly. The Illumina platform was used to perform the Sequencing was performed as a pair.

Standard BioTools’ D3 Assay Design software was used to design a targeted panel of primers covering ten genes (Apc, Ctnnb1, Kras, Nras, Hras, Braf, Pten, Fbxw7, Smad4 and Trp53). Apc, Ctnnb1, Kras and Trp53 had 100% of their exonic regions covered by the panel, whereas the coverage for the other genes was limited to previously identified hotspots on an exome hybridization panel (unpublished). All the targets in the panel were covered by two amplicons, Apc and Pten, which had 98% and 85% dual coverage, respectively.

DNA extraction from bulk or micro-dissected tumours was performed using a QIAmp DNA FFPE Tissue kit (Qiagen, 56404) according to the manufacturer’s instructions, apart from a longer lysis incubation time of 12 h at 56 °C and omission of the 90 °C incubation step. The purified DNA was quantified using a NanoDrop spectrophotometer. The DNA was kept at 20 C.

The mice were culled. The whole intestine was dissected, flushed with cold PBS, cut longitudinally, and wholemounted. The tissue was washed in PBS and randomly chose segments of the bowel were excised after fixation in 4% paraformaldehyde. The clearing was done using optical technology. In brief, excised segments were incubated with CUBIC-1a solution (10% urea, 5% N,N,N′,N′-tetrakis(2-hydroxypropyl) ethyl-enediamine, 10% Triton X-100 and 25 mM For a period of seven to ten days, Na Cl in distilled water can be found at 37 C. DAPI was used for nuclear counterstaining at a dilution of 1:1,000. PBS washed the cleared tissue for 24 hours. Additional clearing and refractive index matching were performed with Rapiclear 1.52 (SunJin Labs 152002) for 24 h. Finally, the samples were mounted in a 0.25 mm i-Spacer (Sunjin Labs) for confocal imaging.

All histological quantification was performed using QuPath (v.0.4.3; https://github.com/qupath/qupath)58. Annotations based on Confetti were first created for different types of tumours using a section stained for RFP. Positive cells for other markers were then identified using the positive cell detection feature with intensity threshold of 5 and a nucleus background radius of 8 μm, using DAPI as nuclear marker. The number of positive cells per unit area of the annotations were reported for the analysis of the chromogenic or fluorescent duplex staining. For Ki67 staining, results were reported as percentage of DAPI-positive cells.

Simultaneous detection of Lgr5 and Anxa1 and detection of Notum were performed on paraffin embedded sections using Advanced Cell Diagnostics (ACD) RNAscope 2.5 LS Duplex Reagent Kit (322440), RNAscope 2.5 LS Probe- Mm- Anxa1 (509298), RNAscope 2.5 LS Probe-Mm-Lgr5-C2 (312178-C2), and RNAscope 2.5 LS Probe-Mm-Notum-C1 (428988-C1) (ACD). The section was baked for 1 h at 60 C before being loaded on the Bond RX instrument. The slides were deparaffinized and rehydrated on the board before pre-treatments with Epitope Retrieval solution 2 and ACD from the Duplex Reagent kit. The instructions were followed for the probes and amplification. Fast red detection of C2 was performed on the Bond Rx using the Bond Polymer Refine Red Detection Kit (Leica Biosystems, DS9390) according to ACD protocol. Slides were then removed from the Bond Rx and detection of the C1 signal was performed using the RNAscope 2.5 LS Green Accessory Pack (ACD, 322550) according to kit instructions. Slides were heated at 60 °C for 1 h, dipped in Xylene and mounted using VectaMount Permanent Mounting Medium (Vector Laboratories, H-5000). The slides were imaged to create whole images. Images were captured at 40× magnification, with a resolution of 0.25 μm per pixel.

The images were taken with a confocal microscope that had a 10 objective and a 0.7 optical zoom throughout the thickness of the tissue. The image analysis was done using a software program. All identified tumours had their Confetti status manually assessed at their acquired Z positions. A tumour was only identified as heterotypic if it showed evidence of glands of at least two Confetti colours or one Confetti colour in the presence of unlabelled glands. Heterotypic status is determined by the use of single intermixed glands, which are most likely entrapped normal crypts.

A single injection of 4mt of tamoxifen dissolved in ethanol and sun flower oil triggered the growth of the tumour suppressor, and/or oncogene fields. Chemical mutagenesis was performed exactly 10 days after field induction using 200 mg kg−1 ENU dissolved in ethanol/phosphate-citrate buffer (1:9) given intraperitoneally.

The mice were at least 8 weeks old. Under controlled conditions, the mice were housed in individually ventilated cages in a specific pathogen-free facility which was tested by the Federation of European Laboratory Animal Science Associations. Water and food were provided. Prior to the study, no mice had been involved in any procedures. The aging of mice resulted in pre-defined signs of tumours burden such as anaemia and loss of body condition. No mice were allowed to exceed these pre-defined endpoints. No randomization or blinding was used. The sample sizes were determined by the preliminary experiments. The Animal Welfare and Ethical Review Body at the CRUK Cambridge Institute, University of Cambridge had approval to issue a Home Office project licence for all animal experiments.

Modelling and predicting the growth dynamics of heterotypic tumours using the Spatstat package for statistical analysis of real-time PCR data

The inducible Cre line was crossed with the other lines on the background. Some experiments used the LSL-KrasG12D54 and Trp53FL alleles. Genotyping was performed by Transnetyx using real-time PCR.

The Confetti label was randomly assigned to crypts in a 10 100 field of unmarked crypts using a custom PERL script. A total of 1,000 simulations were used to calculate the number of patches and the distribution of coloured crypts within patch sizes.

A simulation was made of the observed growth rates of tumours. For each confocally imaged region of intestine, a number of points were initialized at t = 0 after ENU with each point attributed a Confetti label (based on observed frequencies of CFP, RFP, YFP and uncoloured crypts) and a growth rate sampled from the distribution of observed growth rates. These points were then allowed to expand until the simulation was stopped at the humane endpoint reached by the mouse. The number of collisions resulting in heterotypic tumours for each of these simulations was tallied. 10,000 seeds were created for each segment to be scored. The number of tumors was compared with the expected number using a pair of t-tests.

To investigate whether heterotypic tumours arise in regions of highest density, the spatstat package was used to calculate local spatial tumour density within each imaged bowel segment. Heterotypic tumours were compared to non-Heterotypic tumours using a Q-Q plot and a test called the Kolmogorov–Smirnov test. The idea is that if clustering occurs in high density regions, the density distributions would be higher at the Heterotypic tumours locations.

A mixed-effects model was built using mouse identity as a random effects term to quantify the growth dynamics of tumours. The rate of growth was quantified during the exponential growth phase between 40 and 63 days after the ENU.

The R statistical computing environment was used for visualization and statistical analysis of data. Multiple testing correction of P values was carried out using the Benjamini–Hochberg method77 for the RNA sequencing. All immunostaining and RNAscope experiments were performed on at least three independent biological replicates (three different mice). The data was derived from at least three independent biological replicates. Statistical tests and P values can be found in figures and legends. The lower whisker shows the smallest observation greater than or equal to the lower hinge, and the centre line shows the median, 50% quantile.

We applied MAGIC (v.3.0.0) imputation59 to normalized, log-transformed count matrices to denoise and recover missing transcript counts due to dropout. Imputation was performed using conservative parameters (t = 3, ka = 5, k = 15). The values are used for visualization and analysis of both the expression and the correlation of genes in patients who are resistant to certain drugs and who have not been previously treated.

Patients undergoing synchronous colorectal resection and metastasectomy at MSKCC were identified by chart review, and those who had signed pre-procedure informed consent to MSK IRB protocols 06-107, 12-245, 14-244 and 22-404 for biospecimen collection were selected for this study. No statistical method was used to pre-determine sample size. Freshly resected surgical tissue in surplus of clinical diagnostic requirements was processed into single-cell suspensions for scRNA-seq analysis and, where sufficient tissue was available, processed to generate organoids. Portions were also fixed in formalin and embedded in paraffin. The tissue was typically processed within a few hours after the surgery. Archival formalin-fixed, paraffin-embedded (FFPE) clinical tissue blocks for immunostaining were identified by database search and chart review. An expert gastrointestinal pathologist oversaw the processing and interpretation of histopathological data. If trios of cancer were successfully collected, the patient is tracked through their clinical course at MSK and surplus tissue from any subsequent procedures is collected.

Clinical data, including baseline demographic data and previous treatments (Supplementary Table 1 and Supplementary Fig. 1), were abstracted through manual review of patient electronic medical records by board certified medical oncologists (M.L. and K.G.), collected as part of institutional review board approved protocols (MSK IRB, 14-244 and 22-404). The time to the treatment event was calculated from when the patient was diagnosed. Study data were collected and managed using REDCap electronic data capture tools hosted at MSKCC on secure central servers. Out of 31 patients, 17 had multiple sites at the time of surgery and 50% of the sites were still there after surgery. 17 out of 31 patients had early-onsetCRC. Clinical MSK-IMPACT targeted exon sequencing was performed on tumour/normal tissue from 27 out of 31 patients and revealed expected mutations53 (Extended Data Fig. 1a). Consistent with the low percentage (<5%) of metastatic CRC that is mismatch repair deficient/microsatellite instability high, only one patient in our cohort had an microsatellite instability indeterminate tumour. Clinical data collection was stopped by the government.

50–300 mg of surgical tissue was collected in 5 ml of the organoid medium known as AdDF12 or Advanced dm/f12 from Thermo Fisher Scientific. For primary and metastatic tumours, specimens were placed into a 15 cm Petri dish using sterile forceps and washed three times with DPBS (Thermo Fisher Scientific) supplemented with the above-described antibiotic cocktail, and minimally chopped with sharp sterile blades to enable transfer of tumour fragments using a pre-wet 25 ml serological pipette.

Non-tumour tissue was transferred into a 50 ml tube pre-filled with 25 ml dissociation/chelation buffer (8 mM EDTA, 0.5 mM DTT, DNase I (100 U ml−1, Millipore Sigma)). Mucosal fragments were incubated with gentle rotation at 4 °C for a maximum of 30 min. The dissociation state of the tissue fragments was assessed every 10 min under an inverted microscope. Dissociation was interrupted before 30 min if at least 30% of the mucosal material appeared broken into clusters of 1 to 5 colonic crypts. Next, the crypt solution was filtered through a 1 mm cell strainer (PluriSelect) to separate individual crypts or small crypt clusters from large chunks of undissociated mucosa. An equal quantity of DPBS was supplemented with antibiotics. At this point, the 1 mm filter was flipped and inverted into a fresh 50 ml tube. Up to 25 ml of DPBS supplemented with antibiotics was flashed through the inverted filter to recover the undissociated mucosal tissue. After manually shaking the suspension of mucosal tissue fragments (approximately 5 times), the collection of clusters of colonic crypts was reattempted as described above. There were at least three additional fractions of crypt suspensions collected from the combination of both the manual and filtration steps. Crypt suspensions were washed three times and each centrifugation step was carried out at 100g for 3 min at room temperature. Based on visual inspection under an inverted microscope, one or more crypt suspensions were selected for subsequent processing according to the size and integrity of the crypts, and either processed separately or pooled together if individual suspensions were assessed to have low crypt content.

If the blood traces were visible under an inverted microscope, the cell pellet was resuspended in 1–1.57mACK lysis buffer, according to the pellet size. Quenching was performed with three volumes of DPBS supplemented with antibiotics, followed by an additional wash to remove ACK traces. The resulting cell pellet was further processed for either scRNA-seq, organoid generation or both. Tissue processing protocols were extensively and iteratively optimized to maximize retrieval of high quality (low mitochondrial and ribosomal content) viable single-cell suspensions for downstream analyses.

For validation, organoids underwent targeted exome sequencing by MSK-IMPACT53 and key oncogenic genomic alterations were identified by OncoKB55 (see below). Diagnostic tissue from originating tumours was sequenced to confirm that these alterations were conserved in each derived organoid line. Organoids were verified on the basis of short tandem repeats at the time of establishment and before every experiment, and were routinely tested for mycoplasma contamination (MycoALERT PLUS detection kit, Lonza).

Within the epithelial compartment, we then recomputed HVGs (2,097 HVGs), re-performed PCA (210 PCs, 75% variance explained) and clustered cells with PhenoGraph (v.1.5.7) (k = 30) and removed the four remaining outlier clusters containing cells belonging to patients KG103, KG105 and KG66. The clusters were characterized by low library size, sparse block structure and a strong overlap between cells from both the primary and non-tumour samples. Many patients with previous disease conditions have poor sample quality, and our observations of histological images of non-tumour samples suggest an association with previous disease conditions. The observations suggested that some of the clusters may represent stressed or dying disease associated cells that would not be useful to our study. 47,437 cells remained after removing them, and all downstream analysis of the epithelial compartment was performed on them.

To confirm that the set of genes is robust to the number of features, we created Hotspot modules for 1,500-2,500 HVGs with minimum-gene threshold set to 20 and core-only set. Pearson correlations were calculated using the Hotspot modules and the set obtained with 2,000 HVGs. For each original Hotspot module obtained with 2,000 HVGs, we report the correlation of the module used in this study and the best-matching module—the module that is most highly correlated with the original module in this study (Supplementary Fig. 2). Every set of gene features identified a subset of modules showing close correspondence to our final set of modules.

The nearest neighbor and between-layer graphs were converted from their previous state to within-sample and between-layer affinity matrices. The resulting matrices were used to get an augmented cell–cell affinity matrix that consists of three main components. This matrix was input to PhenoGraph (v.1.5.7) classification (see below) to propagate labels from the reference (KG146) dataset to the unlabelled dataset (organoid) and generate UMAP co-embeddings of patient and organoid datasets.

Synthesis and PCR amplified 97-mer mirE shRNA sequences for doxycycline inducible knockdown experiments with organoids in HISC and IGFF medium

For doxycycline inducible PROX1 knockdown experiments, de novo 97-mer mirE shRNA sequences were synthesized (IDT Ultramers) and PCR amplified using the primers miRE-Xho-fw (5′-TGAACTCGAGAAGGTATATTGCTGTTGACAGTGAGCG-3′) and miRE-EcoOligo-rev (5′-TCTCGAATTCTAGCCCCTTGAAGTCCGAGGCAGTAGGC-3′) as described previously83. TGCTGT is the sequence (sh PROX1-2) The original shRNA sequence was used as a control. Transduced organoids were selected using HISC medium supplemented with 2 μg ml−1 puromycin for 7 days. For inducible knockdown experiments, organoids were dissociated into single cells, plated at a density of 2,000 cells per 40 μl of Matrigel and maintained in HISC or IGFF medium supplemented with 2 μg ml−1 doxycycline (Thermo Fisher Scientific) unless otherwise specified for 7 days before downstream assays. For organoid initiation and outgrowth assays, organoids containing inducible PROX1 or control shRNA cultured in HISC medium supplemented with 2 μg ml−1 doxycycline for 7 days were dissociated into single cells, stained with DAPI (1 μg ml−1, Thermo Fisher Scientific), and live cell (DAPI−) and GFP+ sorted to select for healthy cells expressing the shRNA construct, and plated at a density of 750 cells per 15 μl Matrigel in HISC medium without Y-27632 and supplemented with 2 μg ml−1 doxycycline, and imaged at 7 days (BioTek).

Approximately 4,000 organoids (3–4 million cells) were recovered from Matrigel using 3 mM EDTA in DPBS, washed, centrifuged (200g, 5 min, 4 °C) and lysed with 1× RIPA buffer supplemented with PPI (1:100, Sigma-Aldrich, 04693132001) and benzonase (1:100, Thermo Fisher Scientific, 70-664-3) on ice for 30 min. The protein concentration was determined using Pierce BCA assay (Thermo Fisher Scientific, 23227). A total of 10 μg protein per sample was separated by SDS–PAGE on Bis-Tris polyacrylamide gels (Thermo Fisher Scientific, NW04120BOX), transferred to activated PVDF membranes (Millipore, IPFL00010) and blocked in 3% BSA-TBST solution for 30 min. The membranes were incubated overnight at 4 °C with the following antibodies: mouse anti-β-actin (1:1,000, Thermo Fisher Scientific, AM4302) and rabbit anti-PROX1 (1:1,000, Abcam, ab199359), followed by secondary antibody incubation with 488 anti-mouse and 680 anti-rabbit secondary antibodies (1:5,000, LI-COR Biosciences, 1 h, room temperature) before imaging (Odyssey CLx). Western blots were quantified using ImageJ (v.1.53t)84.

We used Mesmer (v.0.12)81, a deep-learning cell segmentation algorithm, to identify cell boundaries in all COMET and Vectra images. One nucleus-stained image and one Membrane or cytoplasm-stained image can be used to define the extent of the nucleus and cell. We used DAPI as a marker for COMET and Vectra images. To create an image that will define the boundaries of multiple cell types, we combined the channels for several cell-type-specific membrane or cytoplasmic markers into a single image by min–max scaling each channel (using the MinMaxScaler function in the sklearn.preprocessing (v.1.4.2) package with the default parameters) and summing them. For COMET, we combined CK20, HER2, CK5, SYNC (normal and tumour epithelial cells) and VIM (stromal cells). For panel 1 of Vectra we used HER2, SOX2, CK20, CDX2, and CHGA. TROP2 and TP63 were used for Vectra panel 2.

A physician manually annotated 167 Fovs out of 602 to find out if they have a high background signal. To identify high-background images in the remaining 435 FOVs, we first found the highest level of background CK5 signal (10th percentile of expression across all cells in the FOV) within FOVs labelled low-background (around 0.0068). All unlabelled FOVs had a background signal of >0.05)68, which is high background. As a result, 327 unlabelled FOVs with low background were used in later analyses, and 108 with high background were removed. In all the figures we used 454 FOVs with a high level of CK5 background expression, which was Vectra panel 2.

The first step in determining raw per-cell marker expression levels was to take brightness values from each cell boundary and divide them by the total number of cells. To ensure that the analysis was unaffected by cell size, we divided the per-cell expression by the cell boundary sizes determined by the regionprops function. Once normalized, all cells were pooled into cell-by-expression matrices within the same imaging technology and panel (Vectra panel 1, Vectra panel 2 and COMET) for downstream analyses and annotated with patient- and sample-level metadata.

We ran Mesmer (v.0.12) on these images with the default parameters to predict cell boundaries, and calculated the cell size, eccentricity and centroid of each cell boundary using the regionprops function (default parameters) in the Python skimage (v.0.23.2) package. We had to sub sample COMET images by half to make them fit in the system memory. The lower mode of the cells produced by Mesmer was mostly empty and not real cells, whereas the bimodal distribution of the cells was mostly empty and not real cells. We separated out the predicted cell boundaries below threshold values and a log 2-normalized DAPI intensity of 11. This resulted in a COMET dataset of 6,852,690 cells across 18 FOVs, a Vectra panel 1 dataset of 6,090,968 cells across 664 FOVs; and a Vectra panel 2 dataset of 5,213,051 cells across 602 FOVs.

Given the large differences between the in vitro and in vivo samples, we wanted to summarize our classification using coarser cell typing. We have aggregated the probabilities of different cell states into 3 categories, one for each organoid sample, including differentiated intestine and fetal/injury repair. These groupings can be interpreted as the likelihood of a cell belonging to any of the several cell states which were combined. The probabilities for each resulting categories are plotted using the python-ternary (v.1.0.8)79 package (Fig. 3b and Extended Data Fig. 10b).

Next, we used Palantir to compute expression trends for all the transcription factors and the fetal progenitor signature score along the canonical-to-non-canonical DC (see the ‘Delineation of canonical to non-canonical tumour axes across patients’ section). As we are interested in transcription factors that drive the transition from canonical to fetal cell states, we focused on those with peak expression just prior to entering non-canonical states along the DC of each patient. We were able to locate the position along the first maxima of the fetal progenitor signature calculated along this DC for Patients KG 146, 136, 149, and 183. 7b). The first point on the trend where the first derivative of the trend changed from positive to negative was identified as Maxima. Trends for KG150 and KG183 lack a first-derivative inflection point, so we used the position of the maximum value for these patients. We then calculated the Pearson correlation between the expression of each transcription factor and the fetal progenitor gene signature score using only cells at positions along a patient’s DC which precede the signature score peak. The correlation values for each factor were different for each patient. The correlation between transcription factors and patient KG146 was only considered, leaving 14 other factors in all four patients.

For the remaining six transcription factors, we determined their treatment response in HISC-grown organoids by computing the log-transformed fold change between irinotecan-treated and untreated conditions. The data used to calculate the log-transformed fold changes was from single-cell data only.

Source: Progressive plasticity during colorectal cancer metastasis

Multispectral imaging of fluorophores: bridge between in situ and in vitro data using the Vectra multispectral imaging system

Seven-colour multiplex-stained slides were imaged using the Vectra Multispectral Imaging System version 3 (Akoya). Scanning was performed at ×20 (×200 final magnification). Filter cubes used for multispectral imagers were DAPI, FITC, Cy3 and Texas Red. A spectral library containing the emitted spectral peaks of the fluorophores in this study was created using the Vectra image analysis software (Akoya). Using the spectral library to separate each multispectral cube into its components, the software was able to identify the seven marker channels of interest.

Finally, we supplied the augmented affinity matrix from step 2 to the PhenoGraph (v.1.5.7) classify function11 with the default parameters. This function converts the affinity matrix into a row-normalized Markov matrix and computes the probability of random walks starting from unlabelled cells from the in vitro samples, and reaching a class of labelled cells in the in vivo sample. Finally, each unlabelled cell is assigned the cell-state label with the maximum probability.

As expected given the differences between in vivo and in vitro data, using a standard co-embedding approach consisting of a joint PCA and UMAP, we observed extreme batch effects between the two datasets, making label transfer between similar cells ineffective. We therefore followed the approach outlined previously45 to bridge between datasets. We first computed the nearest neighbour graph by adding up all the data in each dataset, then we computed the mutual nearest neighbours by adding up the samples using Harmony77. The distance between cells is quantified with the cosine metric, which is more sensitive to technical artifacts and better reflects biological states in both in situ and in vitro samples. We went for a higher number of neighbours because it is more robust to sparsity.

We used the dotplot function and the clusters to plot groups of cells using the Hotspot module scores.

Palantir gene trends were visualized as described in the ‘Visualization of module trends’ section using generalized additive models to fit gene expression along Palantir-computed pseudotime (Extended Data Fig. 8c). All expression trends for individual genes were calculated on MAGIC-imputed data (see the ‘Gene denoising and imputation’ section), and the s.d. of each expression bin was represented by the s.d. of the residuals of the fit.

Four patients have their tumours contain the most cells with non-canonical fates and all of them are squamous and non-endocrine. We used Palantir to study the fates and genes behind them. As an input, Palantir requires an initial state, and as the output, it computes terminal fates and provides a cell-fate map that assigns a probability for each cell to differentiate into each terminal fate. Palantir also outputs a pseudotime alignment of cells from the initial to each of the terminal states and, therefore, by combining pseudotime and fate probability for each cell, it can provide branching gene trends leading to each terminal fate (by weighing the contribution of each cell to the gene trend based on the fate probability). There was a deficiency in the number of non-canonical cells in this patient that led to the separate run of Palantir. We selected cells with the highest imputed expression of LGR5 as the initial state, motivated by their identification as cells of origin in CRC studies. Notably, Palantir has been shown to be robust to the exact choice of starting cell. There were 3 non-intestinal terminal cells when running with 500 waypoints and 6 eigengap-based number of DCs. We ignored three other branches in the same place, which probably represent the state of play, as terminal cells express CDX2 and differentiated markers.

A group of highly enriched samples and a group of low- enriched samples were collected for each module. We then performed a log-rank test on DFS between these groups using the lifelines (v.0.27.4) package in Python74. The survival curves we generated for all modules yielded significant results. TheExtended Data fig. 6k is for the data after the 6k. There were correlations between OS and module expression evaluated by logistic regression models. Each sample was annotated as high or low depending on the ssGSEA score. The signatures of the samples were not included in the analysis. Cox proportional Hazards tests were used for analysis. The forest plots for all the modules yielded results that were significant. R packages survival (v.3.6-4) and survminer (v.0.4.9) were used for the survival analysis.

Our fetal signature was compared to previous published dedifferentiation signatures. For each pair of signatures, we calculated the Jaccard index (number of genes shared between signatures divided by total number of genes in both signatures), demonstrating that existing signatures are clearly distinct from our fetal signature and lack consensus (Extended Data Fig. 7b). The number of core fetal signature genes in each dedifferentiation signature is determined through the number of genes in that signature.

We found that the first trimester samples consist predominantly of progenitor cells; proximal progenitor, distal progenitor and stem cells comprise 88% of all cells. The second-trimester samples consist of mature colon mucosal cell types only, and exhibit strong expression of TFF3/ NEUROD1/ POU2F3. We concluded that the differences between progenitor-like cell types and colonic crypts can be seen by the separation of the first and second-trimester samples.

Source: Progressive plasticity during colorectal cancer metastasis

Trend analysis of cancer progression using GAMs with smoothing functions. Part 1: Data normalization, dimensionality reduction, and ssGSEA analysis

We used GAMs and smoothing functions to analyse module score trends along DC axes. 7a). GAMs increase robustness and reduce sensitivity to densities, and are good for capturing non- linear relationships. The module score values were fitted with trends using a regression model on the DC values. The resulting smoothed trend was derived by dividing the data into 500 equally sized bins along the DCs and predicting the module score at each bin using the regression fit. We visualized module score trends from the 20th percentile value to the highest saturation. 7a).

To characterize trends in cancer progression, we analysed the four patients with a sufficient number of cells in non-canonical states for robust characterization, namely KG146 (3,351 cells), KG182 (935 cells), KG183 (1,203 cells) and KG150 (2,574 cells). We reprocessed each patient individually to most faithfully capture trends within each individual patient; data from primary tumour, synchronous metastatic tumour and metachronous metastatic tumour samples were pooled for each patient and processed as described in the ‘Data normalization and dimensionality reduction’ section. DC analysis has shown to be effective in capturing cell-state transitions in scRNA-seq data75, which was used to identify the largest axes of nonlinear variation in the data. DCs were calculated for each patient independently to avoid artificially imposing trends from patients with larger samples on smaller samples.

The enrichment scores were separated into two different groups, based on status of the patient, and compared to each other using the Mann–Whitney U-test.

There were 745 tumours samples from the TCGA-COAD study. The genes with a count of zero were removed, as well as genes with more than one symbol and that had no symbol at all. The VST transformation was performed using the DESeq2 (v.1.38.3) package72. Subsequently, ssGSEA analysis was conducted utilizing the R package GSVA (v.1.46.0)73. In this cohort, 13% of patients have a survival status of alive and an OS follow-up time of <12 months (7% with OS follow-up of <6 months); 0.9% of patients have no new tumour event and a DFS follow-up time of <12 months (0.7% with a DFS follow-up of <6 months).

We visualized the prevalence of non-canonical or canonical modules in patients with matching primary tumours. The hotspot module groupings and annotations are described in the section. Since cells can express a wide range of non-canonical modules, we have labelled them non-canonical if they don’t express both of them highly.

Consistency of gene autocorrelation to number of HVGs: we evaluated whether the input cell–cell similarity matrix faithfully captures the structure of the data across input features (highly variable genes). Given an input gene set, Hotspot removes genes with low autocorrelation along the k-NN graph, ensuring that only genes that vary along the manifold in an informative manner are selected for module detection. We re computed modules with a certain number of HVGs in order to determine whether genes are robust to the number of selected HVGs. For each combination, we calculated the difference in Hotspot local autocorrelations for each gene against its autocorrelation score when we input 2,000 HVGs (the value used in this study) and visualized the average as a box plot (Supplementary Fig. 2). The maximum difference of 0.07 suggests that the cell similarity graph stays the same regardless of how many HVGs are chosen.

The function of the gut depends on the details of the cell types and processes that are described in the modules.

We didn’t explore 14 modules (722 genes) annotated as cell cycle/fecundity (2 modules), cell stress (4 modules), etc., because we focused on 23 out of the 37 modules that represent meaningful biological gene programs. We manually grouped 19 modules into 6 groups after finding the same biological interpretation for all modules within the group and ensuring the local correlations between genes of grouped modules were high on average. 5a,b. There were ten final gene modules that came from six grouped modules and four single modules. 5a,b and Supplementary Table 4).

We performed differential expression analysis of ISC cells against all untreated tumour cells using MAST (v.1.16.0) and GSEA using relevant cell type gene sets from the literature (Supplementary Table 3) as well as all Hallmark66 and KEGG67 gene sets (Extended Data Fig. 3f). GSEA was performed using the prerank function of the Python package gseapy (v.0.14.0) with 10,000 permutations and the default parameters (Supplementary Table 3).

We identified cancer cells in the epithelial compartment (Extended Data Fig. There are 2 criteria that must be met before 3c–e can be used: evidence of copy-number alterations and clustering that is different from non-tumour cells.

We next partitioned clusters into epithelial, stromal and immune compartments based on marker gene expression (Extended Data Fig. 3a,b). We used the score_genes function to score expression of the compartment-specific genes from ref. The strategy used in the study is similar to 61. Each cluster was assigned to the compartment with the maximal score.

To generate all gene signature scores in our study, we used the Scanpy (v.1.9.1) The mean expression of genes of interest subtracted from the mean expression of genesmatched in a set of reference genes is called score_genes function. To account for expression-level differences, we provided Z-normalized expression data.