There are proteomic associations with genetics and health in the UK Biobank


Protein association analysis using the UK Biobank dataset with European ancestry data from baseline via inverse-rank normalization and quality control

Olink Explore 3072 is a part of UKB-PPP and was used to measure the 54,265 samples collected at the baseline visit. All participants provided informed consent. A large majority of the samples were randomly selected across the UK Biobank, and only those were used for the analysis presented here. Many GWASs using the UKB data have been based on the European ancestry of 409,559 participants who identify as white British. To better leverage the value of a wider range of UKB participants, we have defined three separate groups of individuals, each with their own set of genes.

There are details in the supplementary information about data processing and quality control. One protein (GLIPR1) had >80% of data failing quality control (99.4% failing quality control; Supplementary Table 3) and was excluded from analyses. We did not perform further NPX processing after the quality-control procedures described in the Supplementary Information. Each protein level was inverse-rank normalized, including NPX data below the LOD, before analyses and association testing.

We used a linear regression to estimate the association ofprotein levels with quantitative traits. Logistic regression was used to estimate the association ofProtein levels with a prior or past disease. All analyses were adjusted for the sex and age of the individual at the time of plasma collection, and in addition, quantitative measures were inverse normal transformed.

To estimate how the number of associations scaled with sample size, we took random samples without replacement of [100, 250, 500, 1,000, 5,000, 10,000, 15,000, 20,000, 25,000, 30,000, 35,000, 40,000, 45,000 and 50,000] from the full cohort, then performed the association testing and examined the proteomic variance explained in the exact same manner as for the main analyses described above. We looked at how associations were adjusted against the number of proteins measured, for the likelihood that additional protein would be less abundant in the blood. We performed random subsampling of [100, 250, 500, 1,000, 1,500, 2,000, 2,500, 2,800] proteins starting preferentially from the most expected abundant dilution, a priori, (1:100,000) to the least abundant dilution (1:1). We did multiple samples to make sure the results were consistent and stable.

We adjusted the age, sex, and sample age of the measuring cups for each test. We standardized the residuals using rank-inverse normal transformation and used the standardized values as phenotypes for genome-wide association testing using a linear mixed model (BOLT-LMM71). There isInflation in test statistics because of cryptic relatedness and stratification.

Before the analysis, individual NPX were normalized with data below the LOD. The discovery cohort had associations with age, sex, age2 and time between blood sampling and measurement. The covariates in the replication and full cohort with genetic ancestry analyses included the participant’s preselectedness, either by the UKB-PPP consortium members or part of the COVID 19 repeat-imaging study.

We used a conservative multiple-comparison-corrected threshold of P < 1.7 × 10−11 (5 × 10−8 adjusted for 2,923 unique proteins) to define significance. We defined primary associations through clumping ±1 Mb around the significant variants using PLINK60, excluding the HLA region (chromosome 6: 25.5–34.0 Mb), which is treated as one locus owing to complex and extensive LD patterns. Overlapping regions were merged into one, deeming the variant with the lowest P value as the sentinel primary associated variant. When we first determine the most significant association, we group it together with the primary associations that overlap with the major marginal associations of all the proteins. In cases in which the primary associations contained marginal associations that overlapped across multiple groups, we grouped together these regions iteratively until convergence.

Annotation was done using the EnsemblVariant Effect Predictor,WGS Annotator, and UCSC Genome Browser. The gene/protein consequence was based on RefSeq and Ensembl. We reported exon and intron numbers that a variant falls in as in the canonical transcripts. For synonymous mutations, we estimated the rank of genic intolerance and consequent susceptibility to disease based on the ratio of loss of function. PolyPhen and SIFT were used for coding variant scores. The Encyclopedia of DNA Elements Project has histone marks and histone marks on open and closed chromatin areas for non-coding variants. We mapped out the nearby genes and provided a distance from the 5′ transcription start site to the variant. The combined annotation dependent depletion score (https://cadd.gs.washington.edu) was estimated for non-coding variants. An enrichment analysis hypergeometric test was performed to figure out the amount of pQTL in specific regions.

We investigated evidence of shared genetic associations between variants directly affecting circulating protein expression levels and tissue-level gene expression using the coloc with SuSiE framework61. We used SuSiE regression to define independent e QTS signals for 49 tissues from GTEx31 v.8 with genes that have significant results in the marginal e QTS associations. We used default priors to conduct colocalization analyses between cis pQTL and eQPL signals and found a correlation between the two signals. The directionality of the estimates on gene expression was compared to the directionality of the top variant of each pQTL signal.

The effect of PCSK9 abundance on genetic liability was estimated using a two- sample randomization. The pooled effect of Wald ratios was estimated using the two-term Taylor series expansion of the Wald ratio as well as using the weighted delta inverse-variance weighted method to estimate individual SNP effects. Results from the Mendelian randomization analyses were analysed using standard sensitivity analyses. We used filters to find evidence of the estimated effect from PCSK9 abundance to the outcome not being due to reverse causality.

The genes on the Olink panel were used to test the tissue enrichment of associated genes. We used data from a previous study to enrichment human genes, while using the data from the Human Protein Atlas70 for orthologous mouse genes. The number of tissues tested where applicable caused the enrichment P-value thresholds to be corrected.

The UKB has a blood-type imputation method that imputes the ABO blood group using three single-syllable genes. The secretor status was determined using the inactivating Mutation rs603338, with GG or GA and AA being non secretors. The blood group and secretor status were adjusted for the same variables as in the analyses of the other protein’s covariates. A multiple-testing threshold of P < 1.7 × 10−5 (0.05/2,923 proteins) for the interaction terms was used to define statistically significant interaction effects.

A comparative study of the effect of blood cell composition on the gene associations with plasma proteins in hospitalized COVID-19 cases and reported infections

For colocalization with COVID-19 loci, the top loci reported by the COVID-19 Host Genetics consortium (https://app.covid19hg.org/variants) were updated with estimates from the R7 summary results (https://www.covid19hg.org/results/r7/) for hospitalized cases of COVID-19 and reported COVID-19 infections compared with population controls. We used a region association threshold to perform multi-trait colocalization across all of the significant genes.

We investigated the effect of blood cell composition on the genetic association with plasma proteins through sensitivity analyses of pQTLs from the discovery analyses. The top hits from the discovery analyses were reanalysed adjusting for the following blood cell covariates: monocyte count; basophil count; lymphocyte count; neutrophil count; eosinophil count; leukocyte count; platelet count; haematocrit percentage; and haemoglobin concentration. These blood cell covariates were selected to represent blood cell composition due to their common clinical use. We followed the previously described method to exclude blood cell measures from individuals with certain medical conditions. Relevant medical conditions for exclusion included pregnancy at the time the complete blood count was performed, congenital or hereditary anaemia, HIV, end-stage kidney disease, cirrhosis, blood cancer, bone marrow transplant and splenectomy. Extreme measures were defined as leukocyte count, >200 × 109 per l or >100 × 109 per l with 5% immature reticulocytes; haemoglobin concentration, >20 g dl−1; haematocrit, >60%; and platelet count, >1,000 × 109 per l. Following these exclusions and quality control, genetic analyses of the sentinel variant–protein associations adjusted for blood cell covariates were performed using the same approach as described for the main analysis.

We further tested whether blood cell composition is partially or fully mediating variant–protein associations (genotype → blood cell measure → protein) for genetic associations that were significant within the discovery (P < 1.7 × 10−11) and not in the sensitivity analyses (P > 1.7 × 10−11). We started with the blood cell phenotypes and identified them as the predictors, which were then adjusted for all other co-variates. We have found that there was an association between the genetic variation and the blood cell phenotypes and each of the blood cell genes. We compared the strength of the associations to the strength of the association in a multivariable model, to see if they are either fully or partially related.

To create dynamic test regions that accounted for potential long-range LD, we performed a two-step clumping procedure using PLINK with the parameters (1) –clump-r2 0.1 –clump-kb 10000 –clump-p1 1.7×10−11 –clump-p2 0.05 on the marginal association summary statistics and (2) –clump-kb 500 on the results of the first clumping step. We defined the test regions as the clumps with the left and rightmost variant coordinates extended to a minimum size of 1 MB.

Each test region was applied with SuSiE regression using the first parameters min-abs_corr and L. When SuSiE found the maximum number of independent credible sets, we incremented L by 1 until there were no additional credible sets. We applied a post hoc filter to remove credible sets in high LD with another credible set in the same region (lead variants r2 > 0.8). For regions with multiple credible sets, we assessed statistical independence by performing multiple linear regression with the most probable variants for each credible set and the same genotype and phenotype residuals.

For pleiotropic pQTL loci and multiple associated trans pQTL proteins, gene-set enrichment analyses were performed by ingenuity pathway analysis to identify enrichment of biological functions relevant to cell-to-cell signalling, cellular development, development and process. Gene pathways and networks annotated based on STRING-db and KEGG pathway databases were also used for enrichment analyses. Estimation of statistical significance and clustering trees and networks were generated using hyperbolic tests. The false discovery rate is a factor that can be adjusted to correct for multiple testing. FDR 0.07 was considered to be significant.

Samples likely to be incorrectly labelled were identified based on individual predictions of sex by protein levels, and of protein levels by genotypes. Whole plates and rows were excluded from the data because they were likely to be wrongly labelled. 13 whole plates and five rows of samples were not included in the 1,179 samples due to the Expansion set of assays. From the 1536 set of assays, this resulted in the exclusion of four whole plates and seven rows or columns of samples, in total 404 samples. In the 1536 set of assays, a single panel was excluded for two plates, impacting 174 samples.

In Iceland health care records were used to create lists of disease diagnoses. This resulted in 275 case–control phenotypes. We furthermore had measurements of 110 quantitative traits from various sources, in general not measured at the same time as the plasma was collected.

We used 1,474 and 419 additional duplicate measurements of samples to evaluate assay precision for the Olink Explore (UKB sets) and SomaScan (Iceland 36K) platforms, respectively. Two of the samples were chosen at random for measuring more than twice.

Genome based data processing using Olink Explore and DeCODE for large-scale plasma proteomics comparisons through genetics and disease associations

URLs for external data used are as follows: the GWAS catalogue (https://www.ebi.ac.uk/gwas/), the GTEx project (https://gtexportal.org/home/), the Human Protein Atlas (https://www.proteinatlas.org/), STRING database (https://string-db.org/; file name: 9606.protein.actions.v11.txt.gz) and UniProt (https://www.uniprot.org/).

We used the following software in conjunction with the above methods. GATK resource bundle is available at http://www.DecodeGenetics.co.uk/. Thegenomics and pQTL processing can be described in detail. We used Olink Explore to process data that was generated on the Olink platform. Data was analysed and figures were created using a number of programs, along with Python and R.

We assigned genomic coordinates to assay targets using UniProt IDs70 for each assay provided by the manufacturer. Out of 4,963 valid assays on the SomaScan platform (excluding non-human proteins and assays marked as defective by the manufacturer), this resulted in 4,961 assays getting assigned the genomic coordinates of their intended targets. Out of 2,941 valid assays on the Olink Explore platform, this resulted in 2,923 assays getting assigned the genomic coordinates of their intended targets.

DeCODE performed the Olink samples at their facility in accordance with the Olink Explore manual. The same quality control measures were used by Olink for the UK Biobank samples.

Source: Large-scale plasma proteomics comparisons through genetics and disease associations

Comparison of normalized and non-normalized data in SomaScan v4 based protein-protein microarray studies using the robust median absolute deviation estimator

The SomaScan platform utilizes a surface bound enrichment of proteins alongside a universal polyanionic competitor to prevent transient non-specific interactions62. SomaScan v4 consists of 4,907 aptamer-based assays targeting 4,719 proteins. A aptamer is a short, single-linked oligonucleotide thatbinds to a target. The bound aptamers are quantified with a DNA microarray. In most studies performed using SomaScan, the last step in the normalization process adjusts the median protein levels for each individual to a reference5,64. As this can affect the correlation of protein levels to other factors, some studies omit this step2. We refer to the former data as normalized and the latter as non-normalized. In addition to using non-normalized SomaScan protein measurements as we had done before2, we also applied SomaLogic’s SMP normalization64 and performed all analyses using both non-normalized and normalized data. Comparison of the two normalization methods can be found in Supplementary Note 4.

Accuracy refers to how similar repeated measurements will be while CV is the s.d. of measurements divided by their mean. CV is a measure of accuracy, it is not in all cases. If the platform was to produce random values in a tight range, it would have a low CV but no accuracy.

We assumed that the two duplicate measurements were independent of each other, so we estimated the CV for the available duplicate measurement and compared it to the expectation of the CV. We used the robust median absolute deviation estimator to estimate the s.d. of the repeated measurements on the log-scale and inserted this estimate into the formula for the CV above, obtaining CVrep. The CV of allprotein levels is equal to the CV of CVrep and it will be expected to be zero if repeated measurements show a different ratio.

Both Olink and SomaScan use repeated measurements of control samples, specific to the platform, for quality control. When using two measurements of the same control sample on the same plate to evaluate the CV, the evaluation does not include the inter-plate variation, while the CV estimated assuming that the samples are not measured on the same plate but chosen at random from the set of all samples does include inter-plate variation. Comparing the CV ratio computed from the repeated control samples between the two platforms can therefore help comparison in terms of batch effects, with values closer to one suggesting that the platform is less susceptible to batch effects and closer to zero that the platform is more so (Supplementary Note 2, Supplementary Fig. 9).

Source: Large-scale plasma proteomics comparisons through genetics and disease associations

Bayesian colocalization in the GWAS data: Sentinel vs top pQTL variants and associated factor ontologies

In the previous study, we adjusted P values for multiple tests by using the same significance threshold as in our previous study, and we used a likelihood ratio test to calculate P values.

We calculated r2 values (based on the Icelandic population for SomaScan-Iceland and the UKB-BI population for Olink-UKB) between all sentinel pQTL variants and top (most significantly associated) variants per Mb bin and per experimental factor ontology (EFO) term reported in the NHGRI-EBI GWAS catalogue74 (downloaded 7 April 2022), using the same methods as described previously2.

The threshold for replicating between platforms was 0.05 with requirement that the initial and replication associations were in the same direction.

We considered sequence variants from the conditional analysis to belong to the same region if they were within 2 Mb of each other. The major histocompatibility complex is a 38 chr region. 6:25.5-34.0MB) as a single region. The most significant variant in each region is referred to as the sentinel variant, while the other variant is called a secondary variant.

We used the ‘LD-based clumping approach’ proposed by Sun et al.6 to identify pQTLs associating with multiple assays: we considered variants associating with a different assay to belong to the same pQTL if they are in high LD (r2 > 0.8).

For a given P value threshold P, sample size N, effect size β, and MAF f, the probability of rejecting the null hypothesis of no association is given by 1 − F(X − 1(1 − P), 2Nβ2f(1 − f)), where X–1(·) denotes the inverse cumulative distribution function (inverse CDF) of the chi-squared distribution with one degree of freedom, while F(a, b) denotes the CDF of the non-central chi-squared distribution with one degree of freedom for quantile a and non-centrality parameter73 b.

The COLOC software package was used to test the colocalization of the pQTL signals with other signals. We used statistics to calculate Bayes factors for each of the variations in the associated region for pQTL A and B and COLOC to calculate the probability that the two features are related. Prior probabilities for COLOC were left at default.

We note that the cis pQTLs are probably in high LD with a PAV or cis eQTL on both platforms compared to trans. There were no differences between the results on both platforms and when limited to the specific targets of both platforms.

The same approach was used as in Sun et al.6 whereProteins annotated as “membrane” by the Human Protein Atlas were considered to be secret.

Source: Large-scale plasma proteomics comparisons through genetics and disease associations

Plasma processing and incubation of bead-immunocomplexes and resorufin -d-galactopyranosides

Blood was collected in inverted tubes and then put into centrifuges for 10 min at 4,000g. The samples were frozen to 80 C. The aliquors were kept away from light while they thaw on ice. The aliquots were mixed by inverting the tubes a few times, and thencentrifugationd for 10min at 3,224g at 4 C.

Plasma samples were measured in duplicates with commercially available Simoa NF-light Advantage (SR-X) kit (Quanterix, cat. 103400). samples were given a 2:1 ratio and incubated with 25 l anti-NF- light immunocapture beads and 20 l biotinylated detector antibodies for 30 min. Following the incubation, the bead-immunocomplexes were washed and resuspended before being incubated with 100 µl streptavidin-labelled β-galactosidase at 30 °C and 800 rpm for 10 min. After a second washing step, the bead-immunocomplexes and resorufin β-d-galactopyranoside were loaded onto an SR-X instrument (Quanterix) for processing and analysis.