BPRMeth 1.8.1
## try http:// if https:// URLs are not supported
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("BPRMeth")
## Or download from Github repository
# install.packages("devtools")
devtools::install_github("andreaskapou/BPRMeth", build_vignettes = TRUE)
DNA methylation is probably the best studied epigenomic mark, due to its well established heritability and widespread association with diseases. Yet its role in gene regulation, and the molecular mechanisms underpinning its association with diseases, are still imperfectly understood. While the methylation status of individual cytosines can sometimes be informative, several recent papers have shown that the functional role of DNA methylation is better captured by a quantitative analysis of the spatial variation of methylation across a genomic region.
The BPRMeth
package is a probabilistic method to quantify explicit features of methylation profiles, in a way that would make it easier to formally use such profiles in downstream modelling efforts, such as predicting gene expression levels or clustering genomic regions according to their methylation profiles. The original implementation [1] has now been enhanced in two important ways: we introduced a fast, variational inference approach which enables the quantification of Bayesian posterior confidence measures on the model, and we adapted the method to use several observation models, making it suitable for a diverse range of platforms including single-cell analyses and methylation arrays. Technical details of the updated version are explained in [2].
In addition to being a flexible tool for methylation data, the proposed framework is in principle deployable to other measurements with a similar structure, and indeed the method was already used for single-cell chromatin accessibility data in [3].
Mathematically, BPRMeth
is based on a basis function generalised linear model. The basic idea is as follows: the methylation profile associated to a genomic region \(D\) is defined as a (latent) function \(f\colon D\rightarrow (0,1)\) which takes as input the genomic coordinate along the region and returns the propensity for that locus to be methylated. In order to enforce spatial smoothness, and to obtain a compact representation for this function in terms of interpretable features, we represent the profile function as a linear combination of basis functions
\[ f(x)=\Phi\left(\mathbf{w}^Th(x)\right) \]
where \(h(x)\) are the basis functions (Gaussian bells by default), and \(\Phi\) is a probit transformation (Gaussian cumulative distribution function) needed in order to map the function output to the \((0,1)\) interval. The latent function is observed at specific loci through a noise model which encapsulates the experimental technology.
The optimal weight parameters \(\mathbf{w}\) can be recovered either by Bayesian inference or maximum likelihood estimation, providing a set of quantitative features which can be used in downstream models: in [1] these features were used in a machine learning predictor of gene expression, and to cluster genomic regions according to their methylation profiles. Modelling details and mathematical derivations for the different models can be found online: http://rpubs.com/cakapourani.
The workflow diagram and functionalities of the BPRMeth
package for analysis of methylation profiles are shown in Figure 2.
To illustrate the functionality of the BPRMeth
package we will use real datasets that are publicly available from the ENCODE project consortium [4]. More specifically we will focus on the K562 immortalized cell line, with GEO: GSE27584 for the bulk RRBS data and GEO: GSE33480 for the RNA-Seq data. We will use directly the preprocessed files,
where we have kept only the protein coding genes, and for the purpose of this vignette we focus only on chr12 and chr13. Full details of where to download the data and how to preprocess them are included in the Supplementary Material [1].
Due to its general approach, BPRMeth
can be used for a wide range of technologies such as array based, bulk sequencing (both RRBS and WGBS) and single cell sequencing methylation datasets. It is only required that the data are in the right format and we call the appropriate observation model depending on the type of technology. Since the encode data are generated from bulk RRBS experiments, the observation model for the BPRMeth
package will be Binomial.
The BPRMeth
package provides methods for reading files for methylation, annotation and expression data, however they do not cover all possible data formats available for describing biological datasets. The user can implement his own methods for reading files with different formats, provided that he can create an object similar to what is described below. First, we load the preprocessed methylation data as follows:
library(BPRMeth) # Load package
met_region <- encode_met
The met_region
object contains the following important slots:
met
: A list containing the methylation region, where each element of the list is a different genomic region. More specifically, each methylation promoter region is an \(L_{i} \times 3\) dimensional matrix, where \(L_{i}\) denotes the number of CpGs found in region \(i\). The columns contain the following information:
anno
: Corresponding annotation data for each genomic region as GRanges
object. The anno
slot is only required for downstream analysis, e.g. to predict gene expression levels from methylation profiles, since we need to map the genomic regions with the gene IDs.Now let’s observe the format of the met_region
object. For example, we can access the \(20^{th}\) promoter methylation region as follows
head(met_region$met[[20]])
## [,1] [,2] [,3]
## [1,] -0.8787 42 1
## [2,] -0.8779 42 0
## [3,] -0.8770 42 2
## [4,] -0.8759 42 0
## [5,] -0.8754 42 0
## [6,] -0.8667 9 1
Below are the annotation data
head(met_region$anno)
## GRanges object with 6 ranges and 2 metadata columns:
## seqnames ranges strand | id centre
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr12 562529-576529 + | ENSG00000139044 569529
## [2] chr12 745147-759147 + | ENSG00000177406 752147
## [3] chr12 1051888-1065888 - | ENSG00000002016 1058888
## [4] chr12 1696331-1710331 - | ENSG00000171823 1703331
## [5] chr12 2020870-2034870 - | ENSG00000151062 2027870
## [6] chr12 2897118-2911118 + | ENSG00000004478 2904118
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
and the total number of genomic regions are
NROW(met_region$anno)
## [1] 339
In the previous section for simplicity we provided a pre-processed object which we then can directly use for inferring methylation profiles. The steps required to create this object are really simple and follow the diagram in Figure 2. First we need to read the methylation data using the read_met
function. Also we need read annotation data using the read_anno
file. Note that the annotation file can contain any genomic context: from promoters and gene bodies to enhancers, Nanog regulatory regions and CTCF regions; hence BPRMeth
can be used for a plethora of analyses that want to take spatial genomic correlations into account. Finally, the create_region_object
will create the methylation regions object which is the main object for storing methylation data. The following lines of code provide an example of creating this object ( Do not run )
# Read bulk methylation data
met_dt <- read_met(file = "met_file", type = "bulk_seq")
# Read annotation file, which will create annotation regions as well
anno_dt <- read_anno(file = "anno_file")
# Create methylation regions that have CpG coverage
met_region <- create_region_object(met_dt = met_dt, anno_dt = anno_dt)
If the goal is to relate methylation profiles to gene expression levels, we need also to store expression data. Again, here we have pre-processed expression data which can be loaded as follows
expr <- encode_expr
The structure of the expression data is rather simple, it is a data.table with two columns as shown below
head(expr)
## id expr
## 1: ENSG00000230032 -3.321928
## 2: ENSG00000226210 3.833477
## 3: ENSG00000225143 -1.967380
## 4: ENSG00000234465 -3.321928
## 5: ENSG00000206114 -3.321928
## 6: ENSG00000235591 -3.321928
To create the expression data (which are \(log_{2}\) transformed) we could call ( Do not run )
# Read expression data and log2 transform them
expr <- read_expr(file = "expr_file", log2_transf = TRUE)
For each genomic region, we will learn a methylation profile using the Binomial observation model with a specified number of Radial Basis Functions (RBFs) \(M\). For a single input variable \(x\), the RBF takes the form \(h_{j}(x) = exp(-\gamma || x - \mu_{j} ||^2)\), where \(\mu_{j}\) denotes the location of the \(j^{th}\) basis function in the input space and \(\gamma\) controls the spatial scale. The case when \(M = 0\) is equivalent to learning the mena methylation rate for the given region (i.e. learn a constant function).
For our running example, we will create two RBF objects, one with 5 basis functions and the other with 0 basis functions denoting the mean methylation rate
# Create RBF basis object with 5 RBFs
basis_profile <- create_rbf_object(M = 5)
# Create RBF basis object with 0 RBFs, i.e. constant function
basis_mean <- create_rbf_object(M = 0)
The rbf
object contains information such as the centre locations \(\mu_{j}\) and the value of the spatial scale parameter \(\gamma\)
# Show the slots of the 'rbf' object
basis_profile
## $M
## [1] 5
##
## $mus
## [1] -0.6666667 -0.3333333 0.0000000 0.3333333 0.6666667
##
## $gamma
## [1] 6.25
##
## $eq_spaced_mus
## [1] TRUE
##
## $whole_region
## [1] TRUE
##
## attr(,"class")
## [1] "rbf"
The \(\gamma\) is computed by the number of bases \(M\), however the user can tune it according to his liking. Except from RBF basis, the BPRMeth
pacakge provides polynomial and Fourier basis functions which can be created with the create_polynomial_object
and create_fourier_object
functions, respectively.
Learning the methylation profiles is equivalent to inferring the model parameters \(\mathbf{W}\) of the generalised linear regression model. These parameters can be considered as the extracted features which quantitate precisely notions of shape of a methylation profile.
For performing parameter inference, the BPRMeth
package is enhanced to provide both maximum likelihood estimation and Bayesian inference for the parameters. Specifically, for Bayesian estimation it provides an efficient mean-field variational inference (Variational Bayes) and a Gibbs sampling algorithm. For computational efficiency one should choose the VB implementation (Note that for Beta observation models, one should use MLE to infer the profiles, since currently the package does not provide Bayesian treatment for Beta likelihood).
To infer the methylation profiles using VB we need to call the infer_profiles_vb
function
fit_profiles <- infer_profiles_vb(X = met_region$met, model = "binomial",
basis = basis_profile, is_parallel = FALSE)
fit_mean <- infer_profiles_vb(X = met_region$met, model = "binomial",
basis = basis_mean, is_parallel = FALSE)
This results in a infer_profiles
object whose most important slot is the inferred weight matrix W
, below we show the structure of the fit_profiles
object
# Show structure of fit_profiles object
str(fit_profiles, max.level = 1)
## List of 9
## $ W : num [1:339, 1:6] 0.546 6.128 -0.235 -0.389 -0.172 ...
## ..- attr(*, "dimnames")=List of 2
## $ W_Sigma :List of 339
## $ alpha : num [1:339, 1] 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 ...
## $ beta : num [1:339, 1] 35 35 3.887 1.03 0.393 ...
## $ basis :List of 5
## ..- attr(*, "class")= chr "rbf"
## $ nll_feat : num [1:339, 1] 61.6 101 88.9 90 30.2 ...
## $ rmse_feat : num [1:339, 1] 0.1055 0.0922 0.2108 0.1311 0.2783 ...
## $ coverage_feat: int [1:339, 1] 29 26 17 41 13 47 16 16 38 21 ...
## $ lb_feat : num [1:339, 1] -506 -185.3 -149.6 -229 -98.7 ...
## - attr(*, "class")= chr [1:3] "infer_profiles_vb_binomial" "infer_profiles_vb" "infer_profiles"
Below we show an example promoter region together with the fitted methylation profiles, where the points represent the DNA methylation level of each CpG site. The shape of the methylation profiles is captured by the blue curve, whereas the red line denotes the mean methylation rate
# Choose promoter region 21 -> i.e. LEPREL2 gene
gene_id <- met_region$anno$id[21]
p <- plot_infer_profiles(region = 21, obj_prof = fit_profiles,
obj_mean = fit_mean, obs = met_region$met,
title = paste0("Gene ID ", gene_id))
print(p)
To quantitatively predict expression at each region, we construct a regression model by taking as input the higher-order methylation features learned from the infer_profiles_vb
. In addition to these features, we consider two supplementary sources of information: (1) the goodness of fit in RMSE and (2) the CpG density. For our analysis an SVM regression model is considered. We will use \(70\%\) of the data for training, and we will test the model’s performance on the remaining \(30\%\).
All the aforementioned steps are assembled in the predict_expr
wrapper function:
# Set seed for reproducible results
set.seed(22)
# Perform predictions using methylation profiles
pred_profile <- predict_expr(prof_obj = fit_profiles, expr = expr,
anno = met_region$anno, model_name = "svm",
fit_feature = "RMSE", cov_feature = TRUE,
is_summary = FALSE)
# Perform predictions using mean methylation rate
pred_mean <- predict_expr(prof_obj = fit_mean, expr = expr,
anno = met_region$anno, model_name = "svm",
is_summary = FALSE)
We can now compare the Pearson’s correlation coefficient \(r\) for both models and observe that the higher-order methylation features achieve test correlations twice as large compared to mean methylation rate. Note that someone should use cross-validaion to get a better estimate of the correlation performance.
print(paste("Profile r: ", round(pred_profile$test_errors$pcc, 2)))
## [1] "Profile r: 0.66"
print(paste("Mean r:", round(pred_mean$test_errors$pcc, 2)))
## [1] "Mean r: 0.31"
Below in Figure 4 we show a scatter plot of the predicted and measured expression values for the and of the K562 cell line.
g1 <- plot_predicted_expr(pred_obj = pred_profile,
title = "Methylation profile")
g2 <- plot_predicted_expr(pred_obj = pred_mean,
title = "Methylation rate")
g <- cowplot::plot_grid(g1, g2, ncol = 2, nrow = 1)
print(g)
Another application of the BPRMeth
model is to use the higher-order methylation features to cluster DNA methylation patterns across genomic regions and examine whether distinct methylation profiles are associated to different gene expression levels. To cluster methylation profiles, a mixture modelling approach is considered and we perform inference either via EM algorithm (MLE estimates) or using the mean-field variational inference (Variational Bayes VB) model. In this example we will use the VB model to perform variational inference.
The BPRMeth
package provides the cluster_profiles_vb
function for performing Bayesian clustering, where the user needs to provide the number of clusters \(K\), the methylation regions \(X\), the observation model and a basis object. For efficiency we run on 20 VB iterations.
# Set seed for reproducible results
set.seed(12)
# Create basis object
basis_obj <- create_rbf_object(M = 3)
# Set number of clusters K
K <- 4
# Perform clustering
cl_obj <- cluster_profiles_vb(X = met_region$met, K = K, model = "binomial",
alpha_0 = .5, beta_0 = .1,
basis = basis_obj, vb_max_iter = 20)
Figure 5 shows the fitted methylation profiles for each cluster.
cluster_plot <- plot_cluster_profiles(cluster_obj = cl_obj)
print(cluster_plot)
The structure of the package makes it straighforward to work with methylation data generated from different technologies. For example, with methylation array data, the most common approach is to use the Beta distribution to capture the methylation level from the array experiment. Below we provide a minimal analysis step for inferring methylation profiles on synthetic data; note that we just need to define that model = beta
. (In some cases, researchers work with M-values, where they transform the Beta values to \((-\inf, \inf)\), i.e. they make them Gaussian random variables, if that is the case the BPRMeth
package can handle this by setting model = gaussian
.) Note that for the Beta model, we currently provide only maximum likelihood estimation (MLE).
basis_obj <- create_rbf_object(M = 5)
# Infer beta profiles
beta_prof <- infer_profiles_mle(X = beta_data, model = "beta",
basis = basis_obj, is_parallel = FALSE)
p <- plot_infer_profiles(region = 15, obj_prof = beta_prof,
obs = beta_data, title = "Beta profile")
print(p)
Similarly, someone might analyse single cell methylation data, where now would need to set model = bernoulli
as shown in the code snippet below using again synthetic data.
basis_prof <- create_rbf_object(M = 5)
basis_mean <- create_rbf_object(M = 0)
bern_prof <- infer_profiles_vb(X = bernoulli_data, model = "bernoulli",
basis = basis_prof, is_parallel = FALSE)
bern_mean <- infer_profiles_vb(X = bernoulli_data, model = "bernoulli",
basis = basis_mean, is_parallel = FALSE)
p <- plot_infer_profiles(region = 50, obj_prof = bern_prof,
obj_mean = bern_mean, obs = bernoulli_data,
title = "Bernoulli profile")
print(p)
We can finally cluster the single-cell methylation profiles as follows
# Perform clustering
cl_obj <- cluster_profiles_vb(X = bernoulli_data, K = 3, model = "bernoulli",
basis = basis_obj, vb_max_iter = 40)
cluster_plot <- plot_cluster_profiles(cluster_obj = cl_obj)
print(cluster_plot)
This vignette was compiled using:
sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] BPRMeth_1.8.1 GenomicRanges_1.34.0 GenomeInfoDb_1.18.1
## [4] IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0
## [7] BiocStyle_2.10.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.4 purrr_0.2.5
## [4] colorspace_1.3-2 htmltools_0.3.6 yaml_2.2.0
## [7] rlang_0.3.0.1 e1071_1.7-0 pillar_1.3.1
## [10] glue_1.3.0 RColorBrewer_1.1-2 bindrcpp_0.2.2
## [13] GenomeInfoDbData_1.2.0 foreach_1.4.4 plyr_1.8.4
## [16] bindr_0.1.1 stringr_1.3.1 zlibbioc_1.28.0
## [19] munsell_0.5.0 gtable_0.2.0 codetools_0.2-16
## [22] evaluate_0.12 labeling_0.3 knitr_1.21
## [25] class_7.3-15 highr_0.7 Rcpp_1.0.0
## [28] scales_1.0.0 BiocManager_1.30.4 XVector_0.22.0
## [31] ggplot2_3.1.0 png_0.1-7 digest_0.6.18
## [34] stringi_1.2.4 bookdown_0.9 dplyr_0.7.8
## [37] grid_3.5.2 cowplot_0.9.3 tools_3.5.2
## [40] bitops_1.0-6 magrittr_1.5 lazyeval_0.2.1
## [43] RCurl_1.95-4.11 tibble_2.0.0 crayon_1.3.4
## [46] pkgconfig_2.0.2 data.table_1.11.8 matrixcalc_1.0-3
## [49] assertthat_0.2.0 rmarkdown_1.11 iterators_1.0.10
## [52] R6_2.3.0 compiler_3.5.2
[1] Kapourani, C. A., & Sanguinetti, G. (2016). Higher order methylation features for clustering and prediction in epigenomic studies. Bioinformatics, 32(17), i405-i412.
[2] Kapourani, C. A. and Sanguinetti, G. (2018). BPRMeth: a flexible Bioconductor package for modelling methylation profiles. Bioinformatics, DOI: https://doi.org/10.1093/bioinformatics/bty129
[3] Clark, S.J., Argelaguet, R., Kapourani, C.A., Stubbs, T.M., Lee, H.J., Alda-Catalinas, C., Krueger, F., Sanguinetti, G., Kelsey, G., Marioni, J.C., Stegle, O. and Reik, W., (2018). scNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nature communications, 9(1), p.781.
[4] ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57-74.
This package was developed at the University of Edinburgh in the School of Informatics, with support from Guido Sanguinetti.
This study was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh, and by the European Research Council through grant MLCS306999.