igvR 1.2.0
The igvR package provides easy programmic access in R to the web-based javascript library igv.js in order to create and display genome tracks in its richly interactive web browser visual interface. I am grateful to Jim Robinson, Douglass Turner and colleagues for their fine work.
We introduce igvR via a case study of genetic variants associated with Alzheimer’s Disease as described in the 2014 Nature Genetics paper
Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease1
Theses 11 loci bring the total susceptibility loci count to 20, of which MEF2C is one. We focus on a 1 megabase region surrounding the MEF2C gene which contains 208 variants whose presence is associated with the expression of MEF2C in 17,008 Alzheimer’s disease cases and 37,154 controls. Interestingly, these variants do not occur in coding regions of the gene. Some possible insight into their function – of a highly speculative and preliminary sort - is provided by this case study. We pursue the question: could these non-coding variants alter MEF2C gene regulation by disrupting transcription factor binding sites?
We will proceed as follows:
load the igvR package
connect to the igv.js genome browser
set the reference genome (hg19)
zoom into the MEF2C region
expand to display the entire 1MB region of interest
create our first track - the 208 SNPs - using the most primitive approach: from a base R data.frame
employ Bioconductor’s AnnotationHub to retrieve three methylation datasets - which are associated with open chromatin and transcription factor binding sites. From each of these datasets we create an igvR track of an approriate type - igvR’s Track class has a variety of subclasses depending on data type and data source location. Annotation, Quantitative and Variant tracks are supported here in this first (1.0) version of igvR.
for good measure we conclude by loading a very small slice of a different variant dataset, in VCF format, from another Alzheimer’s study of 818 participants. VCF files typically contain fine-grained variant data, as is seen in the igv.js popup dialog.
The result from running the the code shown:
library(igvR)
library(VariantAnnotation)
library(AnnotationHub)
Create the igvR instance, with all default parameters (portRange, quiet, title). Javascript and HTML is loaded into your browser, igv.js is initialized, a websocket connection between your R process and that web page is constructed, over which subsequent commands and data will travel.
Note that, until the next uprev of the httpuv library, from RStudio, a non-consequential error message appears on your console.
igv <- igvR()
setBrowserWindowTitle(igv, "MEF2C")
setGenome(igv, "hg19")
Display MEF2C
Sys.sleep(5) # wait a few seconds before zooming into MEF2C
showGenomicRegion(igv, "MEF2C")
chrom <- "chr5"
shoulder <- 50000
start.loc <- 88013975 - shoulder
end.loc <- 88199922 + shoulder
mef2c.region <- GRanges(seqnames=chrom, IRanges(start=start.loc, end=end.loc))
showGenomicRegion(igv, list(chrom=chrom, start=start.loc, end=end.loc))
These next few steps expose the nuts and bolts of creating and displaying tracks from tabular data. None of the power and convenience of the Biocondcutor AnnotationHub, of rtracklayer and GenomicRanges is used here - that comes later.
Add the IGAP GWAS track twice, first as an annotation track, so that names are avaialble by clicking, then again as a quantitative track, using a -log10(pValue) score, conveying strength of association of each SNP with the expression of the target gene, MEF2C.
IGAP, the International Genomics of Alzheimer’s Project, provide these data. See Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease, Nature Genetics, Dec 2013.
load(system.file(package="igvR", "extdata", "tbl.mef2cGWAS.variants.RData"))
tbl.mef2cGWAS.variants.bed <- tbl.mef2cGWAS.variants[, c("CHR", "oldPos", "oldPos", "SNP", "P")]
tbl.mef2cGWAS.variants.bed$P <- -log10(tbl.mef2cGWAS.variants.bed$P)
colnames(tbl.mef2cGWAS.variants.bed) <- c("chrom", "start", "end", "name", "score")
track.gwas <- DataFrameAnnotationTrack("IGAP.gwas", tbl.mef2cGWAS.variants.bed, trackHeight=20, color="darkBlue")
displayTrack(igv, track.gwas)
tbl.mef2cGWAS.variants.bedGraph <- tbl.mef2cGWAS.variants.bed[, -4]
track.gwas.numeric <- DataFrameQuantitativeTrack("IGAP.gwas.scored",tbl.mef2cGWAS.variants.bedGraph)
displayTrack(igv, track.gwas.numeric)
A location-only calculation is used to identify gene promoters, rather than more nuanced experimental data, of which we will seem more later in this vignette. We use default, conservative parameters (2kb upstream of TSS, 200bp downstream).
ah <- AnnotationHub()
ah.human <- subset(ah, species == "Homo sapiens")
#----------------------------------------------------------------------------------------------------
# add refseq promoters, available from RefSeq for each transcript which has been identified
#----------------------------------------------------------------------------------------------------
ah.human.refseq <- query(ah.human, "RefSeq", "hg19", "RefSeq Genes")
# download the first set
human.refseq <- ah.human.refseq[[1]]
gr.promoters <- promoters(human.refseq, upstream=2000, downstream=200)
# get rid of score, itemRgb, thick, blocks columns in the mcols, keeping just the transcript name.
# these attributes are meaningful for transcript tracks since those include the represenation
# of UTRs, introns and exons. but a promoter is a stretch of DNA for which those distinctions
# do not apply
mcols(gr.promoters) <- mcols(gr.promoters)[,1]
colnames(mcols(gr.promoters)) <- "name"
ov <- findOverlaps(gr.promoters, mef2c.region)
gr.mef2c.promoters <- gr.promoters[queryHits(ov)]
track.promoters <- UCSCBedAnnotationTrack("promoters", gr.mef2c.promoters, color="darkGreen")
displayTrack(igv, track.promoters)
Non-coding SNPs which fall in promoter regions are plausible candidates for further examination. They may disrupt or create a transcription factor binding site, and thus possibly contribute to the disease under study.
gr.variants <- GRanges(tbl.mef2cGWAS.variants.bed)
ov <- findOverlaps(gr.variants, gr.promoters)
gr.variantsInPromoters <- gr.variants[queryHits(ov)]
track.variantsInPromoters <-GRangesAnnotationTrack("snpInPromoters", gr.variantsInPromoters,
color="red", displayMode="EXPANDED")
displayTrack(igv, track.variantsInPromoters)
H3K4me3 histone methylation marks are often associated with transcription initiation in promoter regions. We will add three tracks from the AnnotationHub for H3K4me3 methylation in Gm12878 (lymphoblastoid) cells.
histone.tracks <- query(ah.human, c("H3K4me3", "Gm12878", "Peak", "narrow")) # 3 tracks
descriptions <- histone.tracks$description
titles <- histone.tracks$title
colors <- rep(terrain.colors(6), 4)
color.index <- 0
for(i in seq_len(length(histone.tracks))){
name <- names(histone.tracks)[i]
color.index <- color.index + 1
gr <- histone.tracks[[name]]
ov <- findOverlaps(gr, mef2c.region)
mef2c.histones <- gr[queryHits(ov)]
track.histones <- GRangesQuantitativeTrack(titles[i], mef2c.histones[, "pValue"],
color=colors[color.index], trackHeight=50)
displayTrack(igv, track.histones)
Sys.sleep(5)
} # for track
We include a very small slice of the ADNI whole genome sequencing data. 349 samples (filtered from a larger set of 128 AD, 415 Mild Cognitive Impairment, 267 controls, 8 uncertains) are included, all in a ~4kb region including methylation marks, and IGAP GWAS SNPs in promoter regions.
In order to illustrate some of genomic data exploration capabilities of igvR we are therefore mixing apples and oranges. We have GWAS variants from IGAP (a very large meta-analysis), promoters from RefSeq (conservatively defined), methylation marks from a lymphoblastoid cell line, and WGS variants from ADNI. Note also that the ADNI variants in the VCF track are not stratified by disease phenotype. Therefore, examining rsid rs80043958 at chr5:88,179,576 with an IGAP GWAS pvalue of 0.002, we see the ADNI heterozygosity is 0.269 - this and other information about that variant is displayed in a popup by clicking on the pinkish “anchor block” - keep in mind that that number would be more informative if broken out by phenotype.
vcfFilename <- system.file(package="igvR", "extdata", "mef2c-4kb.vcf")
vcf <- readVcf(vcfFilename, "hg19")
track.vcf <- VariantTrack("AMPAD VCF", vcf, trackHeight=1000)
Sys.sleep(5)
displayTrack(igv, track.vcf)
Sys.sleep(10) # allow plenty of time for this to complete, and the track to appear, before ending this chunk
sessionInfo()