CAGEr 1.24.0
This document briefly how to use CAGEr CAGEr, a Bioconductor package designed to process, analyse and visualise Cap Analysis of Gene Expression (CAGE)sequencing data. CAGE (Kodzius et al. 2006) is a high-throughput method for transcriptome analysis that utilizes cap trapping (Carninci et al. 1996), a technique based on the biotinylation of the 7-methylguanosine cap of Pol II transcripts, to pulldown the 5′-complete cDNAs reversely transcribed from the captured transcripts. A linker sequence is ligated to the 5′ end of the cDNA and a specific restriction enzyme is used to cleave off a short fragment from the 5′ end. Resulting fragments are then amplified and sequenced using massive parallel high-throughput sequencing technology, which results in a large number of short sequenced tags that can be mapped back to the referent genome to infer the exact position of the transcription start sites (TSSs) used for transcription of captured RNAs (Figure 1). The number of CAGE tags supporting each TSS gives the information on the relative frequency of its usage and can be used as a measure of expression from that specific TSS. Thus, CAGE provides information on two aspects of capped transcriptome: genome-wide 1bp-resolution map of TSSs and transcript expression levels. This information can be used for various analyses, from 5′ centered expression profiling (Takahashi et al. 2012) to studying promoter architecture (Carninci et al. 2006).
CAGE samples derived from various organisms (genomes) can be analysed by CAGEr and the only limitation is the availability of the referent genome as a BSgenome package in case when raw mapped CAGE tags are processed. CAGEr provides a comprehensive workflow that starts from mapped CAGE tags and includes reconstruction of TSSs and promoters and their visualisation, as well as more specialized downstream analyses like promoter width, expression profiling and differential TSS usage. It can use both Binary Sequence Alignment Map (BAM) files of aligned CAGE tags or files with genomic locations of TSSs and number of supporting CAGE tags as input. If BAM files are provided CAGEr constructs TSSs from aligned CAGE tags and counts the number of tags supporting each TSS, while allowing filtering out low-quality tags and removing technology-specific bias. It further performs normalization of raw CAGE tag count, clustering of TSSs into tag clusters (TC) and their aggregation across multiple CAGE experiments into promoters to construct the promoterome. Various methods for normalization and clustering of TSSs are supported. Exporting data into different types of track files allows various visualisations of TSSs and clusters (promoters) in the UCSC Genome Browser, which facilitate generation of hypotheses. CAGEr manipulates multiple CAGE experiments at once and performs analyses across datasets, including expression profiling and detection of differential TSS usage (promoter shifting). Multicore option for parallel processing is supported on Unix-like platforms, which significantly reduces computing time.
Here are some of the functionalities provided in this package:
Reading in multiple CAGE datasets from various sources; user provided BAM or TSS input files, public CAGE datasets from accompanying data package.
Correcting systematic G nucleotide addition bias at the 5′ end of CAGE tags.
Plotting pairwise scatter plots, calculating correlation between datasets and merging datasets.
Normalizing raw CAGE tag count: simple tag per million (tpm) or power-law based normalization (Balwierz et al. 2009).
Clustering individual TSSs into tag clusters (TCs) and aggregating clusters across multiple CAGE datasets to create a set of consensus promoters.
Making bedGraph or BED files of individual TSSs or clusters for visualisation in the genome browser.
Expression clustering of individual TSSs or consensus promoters into distinct expression profiles using common clustering algorithms.
Calculating promoter width based on the cumulative distribution of CAGE signal along the promoter.
Scoring and statistically testing differential TSS usage (promoter shifting) and detecting promoters that shift between two samples.
Several data packages are accompanying CAGEr package. They contain majority of the up-to-date publicly available CAGE data produced by major consortia including FANTOM and ENCODE. These include FANTOM3and4CAGE package available from Bioconductor, as well as ENCODEprojectCAGE and ZebrafishDevelopmentalCAGE packages available from http://promshift.genereg.net/CAGEr/. In addition, direct fetching of TSS data from FANTOM5 web resource (the largest collection of TSS data for human and mouse) from within CAGEr is also available. These are all valuable resources of genome-wide TSSs in various tissue/cell types for various model organisms that can be used directly in R. Section 5 in this vignette describes how these public datasets can be included into a workflow provided by CAGEr. For further information on the content of the data packages and the list of available CAGE datasets please refer to the vignette of the corresponding data package.
For further details on the implemented methods and for citing the CAGEr package in your work please refer to (Haberle et al. 2015).
CAGEr package supports three types of CAGE data input:
Sequenced CAGE tags mapped to the genome: either BAM (Binary Sequence Alignment Map) files of sequenced CAGE tags aligned to the referent genome (including the paired-end data such as CAGEscan) or BED files of CAGE tags (fragments).
Publicly available CAGE datasets from R data package: Several data packages containing CAGE data for various organisms produced by major consortia are accompanying this package. Selected subset of these data can be used as input for .
The type and the format of the input files is specified at the beginning of the workflow, when the
CAGEset
object is created (section 4.2). This is done by setting the inputFilesType
argument,
which accepts the following self-explanatory options referring to formats mentioned above:
"bam", "bamPairedEnd", "bed", "ctss", "CTSStable"
.
In addition, the package provides a method for coercing a data.frame
object containing single
base-pair TSS information into a CAGEset
object (as described in section 4.2.1), which can be
further used in the workflow described below.
We start the workflow by creating a CAGEexp object, which is a container for storing CAGE datasets and all the results that will be generated by applying specific functions. The CAGEexp objects are an extension of the MultiAssayExperiment class, and therefore can use all their methods. The expression data is stored in CAGEexp using SummarizedExperiment objects, and can also access their methods.
To load the CAGEr package and the other libraries into your R envirnoment type:
library("MultiAssayExperiment")
library("SummarizedExperiment")
library(CAGEr)
In this tutorial we will be using data from zebrafish Danio rerio that was
mapped to the danRer7
assembly of the genome. Therefore, the corresponding
genome package BSgenome.Drerio.UCSC.danRer7 has to be installed.
It will be automatically loaded by CAGEr commands when needed.
In case the data is mapped to a genome that is not readily available through
BSgenome package (not in the list returned by BSgenome::available.genomes()
function), a custom BSgenome package has to be build and installed first.
(See the vignette within the BSgenome package for instructions on how to build
a custom genome package). The genomeName
argument can then be set to the name
of the build genome package when creating a CAGEexp
object (see the section
Creating CAGEexp
object below).
The BSgenome package is used to access information about the chromosomes
(sequence, length, circularity). By default, CAGEr will discard alignments
that are not on chromosomes named in the BSgenome package. The genomeName
argument can be set to NULL
in order to suppress this behaviour, or as a last
resort when no BSgenome package is available. However, CAGEr functions that
need access to the genome sequence, for instance for G-correction will not
work in that case.
The subset of zebrafish (Danio rerio) developmental time-series CAGE data generated by (Nepal et al. 2013) will be used in the following demonstration of the CAGEr workflow.
Files with genomic coordinates of TSSs detected by CAGE in 4 zebrafish
developmental stages are included in this package in the extdata
subdirectory.
The files contain TSSs from a part of chromosome 17 (26,000,000-46,000,000), and
there are two files for one of the developmental stages (two independent
replicas). The data in files is organized in four tab-separated columns as
described above in section 2.
inputFiles = list.files( system.file("extdata", package = "CAGEr")
, "ctss$"
, full.names = TRUE)
The CAGEexp object is crated with the CAGEexp
contstructor, that requries
information on file path an type, sample names and reference genome name.
ce <- CAGEexp( genomeName = "BSgenome.Drerio.UCSC.danRer7"
, inputFiles = inputFiles
, inputFilesType = "ctss"
, sampleLabels = sub( ".chr17.ctss", "", basename(inputFiles))
)
To display the created object type:
ce
## A CAGEexp object of 0 listed
## experiments with no user-defined names and respective classes.
## Containing an ExperimentList class object of length 0:
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample availability DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
The supplied information can be seen in the Input data information section, whereas all other slots are still empty, since no data has been read yet and no analysis conducted.
In case when the CAGE / TSS data is to be read from input files, an empty CAGEexp object with
information about the files is first created as described above in section 3.2.
To actually read in the data into the object we use getCTSS()
function, that will add
an experiment called tagCountMatrix
to the CAGEexp object.
getCTSS(ce)
ce
## A CAGEexp object of 1 listed
## experiment with a user-defined name and respective class.
## Containing an ExperimentList class object of length 1:
## [1] tagCountMatrix: RangedSummarizedExperiment with 23343 rows and 5 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample availability DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
This function reads the provided files in the order they were specified in the
inputFiles
argument. It creates a single set of all TSSs detected across all
input datasets (union of TSSs) and a table with counts of CAGE tags supporting
each TSS in every dataset. (Note that in case when a CAGEr object is
created by coercion from an existing expression table there is no need to call
getCTSS()
).
Genomic coordinates of all TSSs and numbers of supporting CAGE tags in every input
sample can be retrieved using the CTSStagCountSE()
function. CTSScoordinatesGR()
accesses
the CTSS coordinates and CTSStagCountDF()
accesses the CTSS expression values.1 Data can also
be accessed directly using the native methods of the MultiAssayExperiment
and
SummarizedExperiment
classes, for example ce[["tagCountMatrix"]]
,
rowRanges(ce[["tagCountMatrix"]])
and assay(ce[["tagCountMatrix"]])
.
CTSStagCountSE(ce)
## class: RangedSummarizedExperiment
## dim: 23343 5
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames(5): Zf.30p.dome Zf.high Zf.prim6.rep1 Zf.prim6.rep2
## Zf.unfertilized.egg
## colData names(0):
CTSScoordinatesGR(ce)
## CTSS object with 23343 positions and 0 metadata columns:
## seqnames pos strand
## <Rle> <integer> <Rle>
## [1] chr17 26027430 +
## [2] chr17 26050540 +
## [3] chr17 26118088 +
## [4] chr17 26142853 +
## [5] chr17 26166954 +
## ... ... ... ...
## [23339] chr17 45975041 -
## [23340] chr17 45975540 -
## [23341] chr17 45975544 -
## [23342] chr17 45982697 -
## [23343] chr17 45999921 -
## -------
## seqinfo: 26 sequences (1 circular) from danRer7 genome
CTSStagCountDF(ce)
## DataFrame with 23343 rows and 5 columns
## Zf.30p.dome Zf.high Zf.prim6.rep1 Zf.prim6.rep2 Zf.unfertilized.egg
## <Rle> <Rle> <Rle> <Rle> <Rle>
## 1 0 0 1 0 0
## 2 0 0 0 0 1
## 3 0 0 1 0 0
## 4 0 0 0 1 0
## 5 0 0 1 0 0
## ... ... ... ... ... ...
## 23339 1 0 0 0 0
## 23340 0 2 0 0 0
## 23341 0 1 0 0 0
## 23342 0 0 1 0 0
## 23343 1 0 0 0 0
For compatiblity with earlier CAGEr works using CAGEset objects, and to provide simpler data
formats, the coordinates and expression values can also be accessed as simple data.frames
.
Note howerver that with large data sets it can cause extreme performance issues.
head(CTSScoordinates(ce))
## chr pos strand
## 1 chr17 26027430 +
## 2 chr17 26050540 +
## 3 chr17 26118088 +
## 4 chr17 26142853 +
## 5 chr17 26166954 +
## 6 chr17 26222417 +
head(CTSStagCountDf(ce))
## Zf.30p.dome Zf.high Zf.prim6.rep1 Zf.prim6.rep2 Zf.unfertilized.egg
## 1 0 0 1 0 0
## 2 0 0 0 0 1
## 3 0 0 1 0 0
## 4 0 0 0 1 0
## 5 0 0 1 0 0
## 6 1 1 0 0 0
head(CTSStagCount(ce))
## chr pos strand Zf.30p.dome Zf.high Zf.prim6.rep1 Zf.prim6.rep2
## 1 chr17 26027430 + 0 0 1 0
## 2 chr17 26050540 + 0 0 0 0
## 3 chr17 26118088 + 0 0 1 0
## 4 chr17 26142853 + 0 0 0 1
## 5 chr17 26166954 + 0 0 1 0
## 6 chr17 26222417 + 1 1 0 0
## Zf.unfertilized.egg
## 1 0
## 2 1
## 3 0
## 4 0
## 5 0
## 6 0
Note that the samples are ordered in the way they were supplied when creating the CAGEexp object and will be presented in that order in all the results and plots. To check sample labels and their ordering type:
sampleLabels(ce)
## #FF0000FF #CCFF00FF #00FF66FF
## "Zf.30p.dome" "Zf.high" "Zf.prim6.rep1"
## #0066FFFF #CC00FFFF
## "Zf.prim6.rep2" "Zf.unfertilized.egg"
In addition, a colour is assigned to each sample, which is consistently used to depict that sample
in all the plots. By default a rainbow palette of colours is used and the hexadecimal format of
the assigned colours can be seen as names attribute of sample labels shown above. The colours can
be changed to taste at any point in the workflow using the setColors()
function.
By design, CAGE tags map transcription start sites and therefore detect promoters. Quantitatively, the proportion of tags that map to promoter regions will depend both on the quality of the libraries and the quality of the genome annotation, which may be incomplete. Nevertheless, strong variations between libraries prepared in the same experiment may be used for quality controls.
CAGEr can intersect CTSSes with reference transcript models and annotate them with
the name(s) of the models, and the region categories promoter, exon, intron and
unknown, by using the annotateCTSS
function. The reference models can be GENCODE
loaded with the import.gff
function of the rtracklayer package,
or any other input that has the same structure, see help("annotateCTSS")
for details.
In this example, we will use a sample annotation for zebrafish (see help("exampleZv9_annot")
).
annotateCTSS(ce, exampleZv9_annot)
The annotation results are stored as tag counts in the sample metadata, and as new columns in the CTSS genomic ranges
colData(ce)[,c("librarySizes", "promoter", "exon", "intron", "unknown")]
## DataFrame with 5 rows and 5 columns
## librarySizes promoter exon intron unknown
## <integer> <integer> <integer> <integer> <integer>
## Zf.30p.dome 41814 37843 2352 594 1025
## Zf.high 45910 41671 2848 419 972
## Zf.prim6.rep1 34053 29531 2714 937 871
## Zf.prim6.rep2 34947 30799 2320 834 994
## Zf.unfertilized.egg 56140 51114 2860 400 1766
CTSScoordinatesGR(ce)
## CTSS object with 23343 positions and 2 metadata columns:
## seqnames pos strand | genes annotation
## <Rle> <integer> <Rle> | <Rle> <Rle>
## [1] chr17 26027430 + | unknown
## [2] chr17 26050540 + | grid1a promoter
## [3] chr17 26118088 + | grid1a exon
## [4] chr17 26142853 + | grid1a intron
## [5] chr17 26166954 + | grid1a exon
## ... ... ... ... . ... ...
## [23339] chr17 45975041 - | unknown
## [23340] chr17 45975540 - | unknown
## [23341] chr17 45975544 - | unknown
## [23342] chr17 45982697 - | unknown
## [23343] chr17 45999921 - | unknown
## -------
## seqinfo: 26 sequences (1 circular) from danRer7 genome
A function plotAnnot
is provided to plot the annotations as stacked bar plots.
Here, all the CAGE libraries look very promoter-specific.
plotAnnot(ce, "counts")
## Warning: Removed 20 rows containing missing values (geom_segment).
## Warning: Removed 20 rows containing missing values (geom_point).
As part of the basic sanity checks, we can explore the data by looking at the
correlation between the samples. The plotCorrelation2()
function will plot
pairwise scatter plots of expression scores per TSS or consensus cluster and
calculate correlation coefficients between all possible pairs of
samples2 Alternatively, the plotCorrelation()
function does the same and
colors the scatterplots according to point density, but is much slower.. A
threshold can be set, so that only regions with an expression score (raw or
normalized) above the threshold (either in one or both samples) are
considered when calculating correlation. Three different correlation measures
are supported: Pearson’s, Spearman’s and Kendall’s correlation coefficients.
Note that while the scatterplots are on a logarithmic scale with pseudocount
added to the zero values, the correlation coefficients are calculated on
untransformed (but thresholded) data.
corr.m <- plotCorrelation2( ce, samples = "all"
, tagCountThreshold = 1, applyThresholdBoth = FALSE
, method = "pearson")
Based on calculated correlation we might want to merge and/or rearrange some of the datasets. To
rearrange the samples in the temporal order of the zebrafish development (unfertilized egg -> high
-> 30 percent dome -> prim6) and to merge the two replicas for the prim6 developmental stage we use
the mergeSamples()
function:
mergeSamples(ce, mergeIndex = c(3,2,4,4,1),
mergedSampleLabels = c("zf_unfertilized_egg", "zf_high", "zf_30p_dome", "zf_prim6"))
annotateCTSS(ce, exampleZv9_annot)
The mergeIndex
argument controls which samples will be merged and how the final dataset will be
ordered. Samples labeled by the same number (in our case samples three and four) will be merged
together by summing number of CAGE tags per TSS. The final set of samples will be ordered in the
ascending order of values provided in mergeIndex
and will be labeled by the labels provided in
the mergedSampleLabels
argument. Note that mergeSamples
function resets all slots with results
of downstream analyses, so in case there were any results in the CAGEexp object prior to merging,
they will be removed. Thus, annotation has to be redone.
Library sizes (number of total sequenced tags) of individual experiments differ, thus
normalization is required to make them comparable. The librarySizes
function returns the total
number of CAGE tags in each sample:
librarySizes(ce)
## [1] 56140 45910 41814 69000
The CAGEr package supports both simple tags per million normalization and power-law based normalization. It has been shown that many CAGE datasets follow a power-law distribution (Balwierz et al. 2009). Plotting the number of CAGE tags (X-axis) against the number of TSSs that are supported by <= of that number of tags (Y-axis) results in a distribution that can be approximated by a power-law. On a log-log scale this reverse cumulative distribution will manifest as a monotonically decreasing linear function, which can be defined as
\[y = -1 * \alpha * x + \beta\]
and is fully determined by the slope \(\alpha\) and total number of tags T (which together with \(\alpha\) determines the value of \(\beta\)).
To check whether our CAGE datasets follow power-law distribution and in which range of values, we
can use the plotReverseCumulatives
function:
plotReverseCumulatives(ce, fitInRange = c(5, 1000), onePlot = TRUE)
In addition to the reverse cumulative plots (Figure 3), a power-law distribution will be fitted to each reverse cumulative using values in the specified range (denoted with dashed lines in Figure 3) and the value of \(\alpha\) will be reported for each sample (shown in the brackets in the Figure 3 legend). The plots can help in choosing the optimal parameters for power-law based normalization. We can see that the reverse cumulative distributions look similar and follow the power-law in the central part of the CAGE tag counts values with a slope between -1.1 and -1.3. Thus, we choose a range from 5 to 1000 tags to fit a power-law, and we normalize all samples to a referent power-law distribution with a total of 50,000 tags and slope of -1.2 (\(\alpha = 1.2\)).3 Note that since this example dataset contains only data from one part of chromosome 17 and the total number of tags is very small, we normalize to a referent distribution with a similarly small number of tags. When analyzing full datasets it is reasonable to set total number of tags for referent distribution to one million to get normalized tags per million values.
To perform normalization we pass these parameters to the normalizeTagCount
function.
normalizeTagCount(ce, method = "powerLaw", fitInRange = c(5, 1000), alpha = 1.2, T = 5*10^4)
ce[["tagCountMatrix"]]
## class: RangedSummarizedExperiment
## dim: 23343 4
## metadata(0):
## assays(2): counts normalizedTpmMatrix
## rownames: NULL
## rowData names(2): genes annotation
## colnames(4): zf_unfertilized_egg zf_high zf_30p_dome zf_prim6
## colData names(0):
The normalization is performed as described in (Balwierz et al. 2009):
alpha
(slope in the
log-log representation) and T
(total number of tags) parameters. Setting T
to
1 million results in normalized tags per million (tpm) values.In addition to the two provided normalization methods, a pass-through option none
can be set as
method
parameter to keep using raw tag counts in all downstream steps. Note that
normalizeTagCount()
has to be applied to CAGEr
object before moving to next steps. Thus, in
order to keep using raw tag counts run the function with method="none"
. In that case, all
results and parameters in the further steps that would normally refer to normalized CAGE signal
(denoted as tpm), will actually be raw tag counts.
Transcription start sites are found in the promoter region of a gene and reflect the transcriptional activity of that promoter (Figure 5). TSSs in the close proximity of each other give rise to a functionally equivalent set of transcripts and are likely regulated by the same promoter elements. Thus, TSSs can be spatially clustered into larger transcriptional units, called tag clusters (TCs) that correspond to individual promoters. CAGEr supports three methods for spatial clustering of TSSs along the genome, two ab initio methods driven by the data itself, as well as assigning TSSs to predefined genomic regions:
Simple distance-based clustering in which two neighbouring TSSs are joined together if they are closer than some specified distance (greedy algorithm);
Parametric clustering of data attached to sequences based on the density of the signal (Frith et al. 2007), http://www.cbrc.jp/paraclu/;
Counting TSSs and their signal in a set of user supplied genomic regions (e.g. annotation derived promoter regions or other regions of interest).
These functionalities are provided in the function clusterCTSS()
, which accepts additional
arguments for controlling which CTSSs will be included in the clustering as well as for
refining the final set of tag clusters.
We will perform a simple distance-based clustering using 20 bp as a maximal allowed distance between two neighbouring TSSs. Prior to clustering we will filter out low-fidelity TSSs - the ones supported by less than 1 normalized tag counts in all of the samples.
clusterCTSS( object = ce
, threshold = 1
, thresholdIsTpm = TRUE
, nrPassThreshold = 1
, method = "distclu"
, maxDist = 20
, removeSingletons = TRUE
, keepSingletonsAbove = 5)
Our final set of tag clusters will not include singletons (clusters with only one TSS), unless the
normalized signal is above 5, it is a reasonably supported TSS. The clusterCTSS
function creates a set of clusters for each sample separately; for each cluster it returns the
genomic coordinates, counts the number of TSSs within the cluster, determines the position of the
most frequently used (dominant) TSS, calculates the total CAGE signal within the cluster and CAGE signal supporting the dominant TSS only. We can extract tag clusters for a desired sample from CAGEexp
object by calling the tagClusters
function:
head(tagClusters(ce, sample = "zf_unfertilized_egg"))
## cluster chr start end strand tpm nr_ctss dominant_ctss
## 1 1 chr17 26453631 26453708 + 27.0 12 26453667
## 2 2 chr17 26564507 26564610 + 128.6 24 26564585
## 3 3 chr17 26595636 26595793 + 217.0 35 26595750
## 4 4 chr17 26596032 26596091 + 10.4 9 26596070
## 5 5 chr17 26596117 26596127 + 12.2 4 26596118
## 6 6 chr17 26596149 26596175 + 11.0 5 26596153
## tpm.dominant_ctss
## 1 8.25
## 2 29.28
## 3 100.97
## 4 3.22
## 5 5.74
## 6 3.85
Genome-wide mapping of TSSs using CAGE has initially revealed two major classes of promoters in mammals (Carninci et al. 2006), with respect to the number and distribution of TSSs within the promoter. They have been further correlated with differences in the underlying sequence and the functional classes of the genes they regulate, as well as the organization of the chromatin around them.
Thus, the width of the promoter is an important characteristic that distinguishes different
functional classes of promoters. CAGEr analyzes promoter width across all samples present
in the CAGEexp
object. It defines promoter width by taking into account both the positions
and the CAGE signal at TSSs along the tag cluster, thus making it more robust with respect
to total expression and local level of noise at the promoter. Width of every tag cluster is
calculated as following:
qLow
) and an “upper” (qUp
) quantile are selected by the user.The procedure is schematically shown in Figure 4.
Required computations are done using cumulativeCTSSdistribution
and quantilePositions
functions, which calculate cumulative distribution for every tag cluster in each of the
samples and determine the positions of selected quantiles, respectively:
cumulativeCTSSdistribution(ce, clusters = "tagClusters", useMulticore = T)
quantilePositions(ce, clusters = "tagClusters", qLow = 0.1, qUp = 0.9)
Tag clusters and their interquantile width can be retrieved by calling tagClusters
function:
tagClustersGR( ce, "zf_unfertilized_egg"
, returnInterquantileWidth = TRUE, qLow = 0.1, qUp = 0.9)
## TagClusters object with 517 ranges and 7 metadata columns:
## seqnames ranges strand | score nr_ctss dominant_ctss
## <Rle> <IRanges> <Rle> | <Rle> <integer> <integer>
## 1 chr17 26453632-26453708 + | 27.0 12 26453667
## 2 chr17 26564508-26564610 + | 128.6 24 26564585
## 3 chr17 26595637-26595793 + | 217.0 35 26595750
## 4 chr17 26596033-26596091 + | 10.4 9 26596070
## 5 chr17 26596118-26596127 + | 12.2 4 26596118
## ... ... ... ... . ... ... ...
## 513 chr17 45700182-45700187 - | 9.62 3 45700182
## 514 chr17 45901329-45901330 - | 1.96 2 45901329
## 515 chr17 45901698-45901710 - | 27.65 4 45901698
## 516 chr17 45901732-45901784 - | 119.97 15 45901749
## 517 chr17 45901814-45901816 - | 3.25 2 45901816
## tpm.dominant_ctss q_0.1 q_0.9 interquantile_width
## <Rle> <Rle> <Rle> <Rle>
## 1 8.25 36 72 37
## 2 29.28 17 81 65
## 3 100.97 37 114 78
## 4 3.22 1 50 50
## 5 5.74 1 10 10
## ... ... ... ... ...
## 513 6.37 1 6 6
## 514 1.30 1 2 2
## 515 23.75 1 2 2
## 516 83.45 2 21 20
## 517 1.94 1 3 3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Once the cumulative distributions and the positions of quantiles have been calculated, the histograms of interquantile width can be plotted to globally compare the promoter width across different samples (Figure ??:
plotInterquantileWidth(ce, clusters = "tagClusters", tpmThreshold = 3, qLow = 0.1, qUp = 0.9)
Significant difference in the promoter width might indicate global differences in the modes of gene regulation between the two samples. The histograms can also help in choosing an appropriate width threshold for separating sharp and broad promoters.
Tag clusters are created for each sample individually and they are often sample-specific, thus can
be present in one sample but absent in another. In addition, in many cases tag clusters do not
coincide perfectly within the same promoter region, or there might be two clusters in one sample
and only one larger in the other. To be able to compare genome-wide transcriptional activity
across samples and to perform expression profiling, a single set of consensus clusters needs to
be created. This is done using the aggregateTagClusters
function, which aggregates tag clusters
from all samples into a single set of non-overlapping consensus clusters:
aggregateTagClusters(ce, tpmThreshold = 5, qLow = 0.1, qUp = 0.9, maxDist = 100)
ce$outOfClusters / ce$librarySizes *100
## zf_unfertilized_egg zf_high zf_30p_dome
## 33.3 32.9 32.1
## zf_prim6
## 34.4
Tag clusters can be aggregated using their full span (from start to end) or using positions of
previously calculated quantiles as their boundaries. Only tag clusters above given tag count
threshold will be considered and two clusters will be aggregated together if their boundaries
(i.e. either starts and ends or positions of quantiles) are <= maxDist
apart. Final set
of consensus clusters can be retrieved by:
consensusClustersGR(ce)
## ConsensusClusters object with 253 ranges and 3 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## chr17:26379388-26379679:- chr17 26379388-26379679 - |
## chr17:26446559-26446651:- chr17 26446559-26446651 - |
## chr17:26452451-26452571:- chr17 26452451-26452571 - |
## chr17:26453648-26453757:+ chr17 26453648-26453757 + |
## chr17:26555294-26555359:- chr17 26555294-26555359 - |
## ... ... ... ... .
## chr17:45700093-45700189:- chr17 45700093-45700189 - |
## chr17:45806843-45806894:+ chr17 45806843-45806894 + |
## chr17:45901696-45901758:- chr17 45901696-45901758 - |
## chr17:45905581-45905664:+ chr17 45905581-45905664 + |
## chr17:45975253-45975297:+ chr17 45975253-45975297 + |
## score consensus.cluster
## <numeric> <integer>
## chr17:26379388-26379679:- 179.803895280803 196
## chr17:26446559-26446651:- 48.1654641622959 197
## chr17:26452451-26452571:- 184.204291454949 198
## chr17:26453648-26453757:+ 129.733525462687 1
## chr17:26555294-26555359:- 19.2388260848655 199
## ... ... ...
## chr17:45700093-45700189:- 9.21466887879838 368
## chr17:45806843-45806894:+ 108.34520829738 190
## chr17:45901696-45901758:- 789.81397211728 372
## chr17:45905581-45905664:+ 321.125346126617 192
## chr17:45975253-45975297:+ 35.8163066635267 193
## tpm
## <numeric>
## chr17:26379388-26379679:- 287.247912339468
## chr17:26446559-26446651:- 85.7757116833182
## chr17:26452451-26452571:- 269.689488829474
## chr17:26453648-26453757:+ 185.967802102194
## chr17:26555294-26555359:- 47.6676600011559
## ... ...
## chr17:45700093-45700189:- 23.2038219025773
## chr17:45806843-45806894:+ 173.614212890553
## chr17:45901696-45901758:- 875.831472947622
## chr17:45905581-45905664:+ 392.236067622669
## chr17:45975253-45975297:+ 55.6241013192924
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
which will return genomic coordinates and sum of CAGE signal across all samples for each consensus
cluster (the tpm
column).
Analogously to tag clusters, analysis of promoter width can be performed for consensus clusters
as well, using the same cumulativeCTSSdistribution
, quantilePositions
and plotInterquantileWidth
functions as described above, but by setting
the clusters
parameter to "consensusClusters"
. Like the CTSS, the consensus clusters can
also be annotated:
annotateConsensusClusters(ce, exampleZv9_annot)
cumulativeCTSSdistribution(ce, clusters = "consensusClusters", useMulticore = TRUE)
quantilePositions(ce, clusters = "consensusClusters", qLow = 0.1, qUp = 0.9, useMulticore = TRUE)
Although consensus clusters are created to represent consensus across all samples, they obviously have different CAGE signal and can have different width or position of the dominant TSS in the different samples. Sample-specific information on consensus clusters can be retrieved with the function, by specifying desired sample name (analogous to retrieving tag clusters):
consensusClustersGR( ce, sample = "zf_unfertilized_egg"
, returnInterquantileWidth = TRUE, qLow = 0.1, qUp = 0.9)
## ConsensusClusters object with 253 ranges and 8 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## chr17:26379388-26379679:- chr17 26379388-26379679 - |
## chr17:26446559-26446651:- chr17 26446559-26446651 - |
## chr17:26452451-26452571:- chr17 26452451-26452571 - |
## chr17:26453648-26453757:+ chr17 26453648-26453757 + |
## chr17:26555294-26555359:- chr17 26555294-26555359 - |
## ... ... ... ... .
## chr17:45700093-45700189:- chr17 45700093-45700189 - |
## chr17:45806843-45806894:+ chr17 45806843-45806894 + |
## chr17:45901696-45901758:- chr17 45901696-45901758 - |
## chr17:45905581-45905664:+ chr17 45905581-45905664 + |
## chr17:45975253-45975297:+ chr17 45975253-45975297 + |
## score consensus.cluster
## <numeric> <integer>
## chr17:26379388-26379679:- 49.514684740936 196
## chr17:26446559-26446651:- 10.9138520447868 197
## chr17:26452451-26452571:- 61.8658559845496 198
## chr17:26453648-26453757:+ 19.1354399087033 1
## chr17:26555294-26555359:- 6.37015756741174 199
## ... ... ...
## chr17:45700093-45700189:- 6.37015756741174 368
## chr17:45806843-45806894:+ 23.0404872540061 190
## chr17:45901696-45901758:- 132.654636120977 372
## chr17:45905581-45905664:+ 106.7961601026 192
## chr17:45975253-45975297:+ 0 193
## tpm annotation genes
## <numeric> <Rle> <Rle>
## chr17:26379388-26379679:- 49.514684740936 promoter CCSER2 (1 of 2)
## chr17:26446559-26446651:- 10.9138520447868 promoter fam149b1
## chr17:26452451-26452571:- 61.8658559845496 promoter mrp63
## chr17:26453648-26453757:+ 19.1354399087033 promoter ttc7b
## chr17:26555294-26555359:- 6.37015756741174 exon calm1a
## ... ... ... ...
## chr17:45700093-45700189:- 6.37015756741174 exon susd4
## chr17:45806843-45806894:+ 23.0404872540061 promoter vcpkmt
## chr17:45901696-45901758:- 132.654636120977 promoter arf6b
## chr17:45905581-45905664:+ 106.7961601026 promoter aspg
## chr17:45975253-45975297:+ 0 unknown
## q_0.1 q_0.9 interquantile_width
## <Rle> <Rle> <Rle>
## chr17:26379388-26379679:- 20 56 37
## chr17:26446559-26446651:- 31 65 35
## chr17:26452451-26452571:- 48 77 30
## chr17:26453648-26453757:+ 37 271 235
## chr17:26555294-26555359:- 3 109 107
## ... ... ... ...
## chr17:45700093-45700189:- 1 1 1
## chr17:45806843-45806894:+ 26 74 49
## chr17:45901696-45901758:- 25 31 7
## chr17:45905581-45905664:+ 3 92 90
## chr17:45975253-45975297:+ 3 54 52
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
This will, in addition to genomic coordinates of the consensus clusters, which are constant across all samples, also return the position of the dominant TSS, the CAGE signal (tpm) and the interquantile width specific for a given sample. Note that when specifying individual sample, only the consensus clusters that have some CAGE signal in that sample will be returned (which will be a subset of all consensus clusters).
CAGE data can be visualized in the genomic context by exporting raw or normalized CAGE tag counts
to a bedGraph (or BigWig) file and uploading (or linking) the file to a genome browser. Positions
of TSSs and tag counts supporting them are exported using exportCTSStoBedGraph()
4 Note that
the ZENBU genome browser can display natively data from BAM
or BED files as coverage tracks.:
exportCTSStoBedGraph(ce, values = "normalized", format = "bedGraph", oneFile = TRUE)
This will produce a single bedGraph file with multiple annotated tracks that can be directly visualized as custom tracks in the genome browser (Figure 5).
There are two tracks per sample; one for TSSs on the plus strand and the other for the minus strand. Values for TSSs on minus strand are shown as negative and are pointing downwards in the browser.
Alternatively, the tracks can be exported to a binary BigWig format:
exportCTSStoBedGraph(ce, values = "normalized", format = "BigWig")
which will produce two BigWig files per sample (one for TSSs on the plus strand and the other for the minus strand) and an accompanying text file with track headers.
Interquantile width can also be visualized in a gene-like representation in the UCSC genome browser by exporting the data into a BED file:
exportToBed(object = ce, what = "tagClusters", qLow = 0.1, qUp = 0.9, oneFile = TRUE)
In this gene-like representation (Figure 6), the oriented line shows the full span of the cluster, filled block marks the interquantile width and a single base-pair thick block denotes the position of the dominant TSS.
Since CAGE signal reflects level of transcription from a given TSS or promoter it can be used for 5′ centered expression profiling. Expression clustering can be done at level of individual CTSSs or at level of entire promoters (consensus clusters). In the former case, feature vector containing log transformed and scaled normalized CAGE signal at individual TSS across multiple samples is used as input for clustering algorithm, whereas in the latter case CAGE signal within the entire consensus cluster is used. CAGEr supports two unsupervised clustering algorithms: kmeans and self-organizing maps (SOM). Both algorithms require to specify a number of clusters in advance.
We will perform expression clustering at the level of entire promoter using SOM algorithm and applying it only to promoters with normalized CAGE signal \(>= 15\) in at least one sample.
# Not supported yet for CAGEexp objects, sorry.
# getExpressionProfiles(ce, what = "consensusClusters", tpmThreshold = 10,
# nrPassThreshold = 1, method = "som", xDim = 4, yDim = 2)
Distribution of expression across samples for 8 clusters returned by SOM (4 \(\times\) 2 map) can
be visualized using the plotExpressionProfiles
function
as shown in Figure 7:
# Not supported yet for CAGEexp objects, sorry.
# plotExpressionProfiles(ce, what = "consensusClusters")
Each cluster is shown in different color and is marked by its label and the number of elements (promoters) in the cluster. We can extract promoters belonging to a specific cluster by typing:
# Not supported yet for CAGEexp objects, sorry.
# class3_1 <- extractExpressionClass(ce,
# what = "consensusClusters", which = "3_1")
# head(class3_1)
Consensus clusters and information on their expression profile can be exported to a BED file, which allows visualization of the promoters in the genome browser colored in the color of the expression cluster they belong to (Figure 8:
# Not supported yet for CAGEexp objects, sorry.
# exportToBed(ce, what = "consensusClusters",
# colorByExpressionProfile = TRUE)
Expression profiling of individual TSSs is done using the same procedure as
described above for consensus clusters, only by setting wha = "CTSS"
in all
of the functions.
The raw expression table for the consensus clusters can be exported to the DESeq2 package for differential expression analysis. For this, the column data needs to contain factors that can group the samples. They can have any name.
ce$group <- factor(c("a", "a", "b", "b"))
dds <- consensusClustersDESeq2(ce, ~group)
As shown in Figure 6, TSSs within the same promoter region can be used differently in different samples. Thus, although the overall transcription level from a promoter does not change between the samples, the differential usage of TSSs or promoter shifting may indicate changes in the regulation of transcription from that promoter, which cannot be detected by expression profiling. To detect this promoter shifting, a method described in @[Haberle:2014] has been implemented in CAGEr. Shifting can be detected between two individual samples or between two groups of samples. In the latter case, samples are first merged into groups and then compared in the same way as two individual samples. For all promoters a shifting score is calculated based on the difference in the cumulative distribution of CAGE signal along that promoter in the two samples. In addition, a more general assessment of differential TSS usage is obtained by performing Kolmogorov-Smirnov test on the cumulative distributions of CAGE signal, as described below. Thus, prior to shifting score calculation and statistical testing, we have to calculate cumulative distribution along all consensus clusters:
cumulativeCTSSdistribution(ce, clusters = "consensusClusters")
Next, we calculate a shifting score and P-value of Kolmogorov-Smirnov test for all promoters comparing two specified samples:
# Not supported yet for CAGEexp objects, sorry.
# scoreShift(ce, groupX = "zf_unfertilized_egg", groupY = "zf_prim6",
# testKS = TRUE, useTpmKS = FALSE)
This function will calculate shifting score as illustrated in
Figure 9. Values of shifting score are in range between
-Inf
and 1
. Positive values can be interpreted as the proportion of
transcription initiation in the sample with lower expression that is happening
“outside” (either upstream or downstream) of the region used for transcription
initiation in the other sample. In contrast, negative values indicate no
physical separation, i.e. the region used for transcription initiation in the
sample with lower expression is completely contained within the region used for
transcription initiation in the other sample. Thus, shifting score detects only
the degree of upstream or downstream shifting, but does not detect more general
changes in TSS rearrangement in the region, e.g. narrowing or broadening of
the region used for transcription.
To assess any general change in the TSS usage within the promoter region,
a two-sample Kolmogorov-Smirnov (K-S) test on cumulative sums of CAGE signal
along the consensus cluster is performed. Cumulative sums in both samples are
scaled to range between 0 and 1 and are considered to be empirical cumulative
distribution functions (ECDF) reflecting sampling of TSS positions during
transcription initiation. K-S test is performed to assess whether the two
underlying probability distributions differ. To obtain a P-value i.e. the
level at which the null-hypothesis can be rejected), sample sizes that generated
the ECDFs are required, in addition to actual K-S statistics calculated from
ECDFs. These are derived either from raw tag counts, i.e. exact number of
times each TSS in the cluster was sampled during sequencing (when
useTpmKS = FALSE
), or from normalized tpm values (when useTpmKS = TRUE
).
P-values obtained from K-S tests are further corrected for multiple testing
using Benjamini and Hochenberg (BH) method and for each P-value a corresponding
false-discovery rate (FDR) is also reported.
We can select a subset of promoters with shifting score and/or FDR above specified threshold:
# Not supported yet for CAGEexp objects, sorry.
# shifting.promoters <- getShiftingPromoters(ce,
# tpmThreshold = 5, scoreThreshold = 0.6,
# fdrThreshold = 0.01)
# head(shifting.promoters)
The getShiftingPromoters
function returns genomic coordinates, shifting score
and P-value (FDR) of the promoters, as well as the value of CAGE signal and
position of the dominant TSS in the two compared (groups of) samples.
Figure 10 shows the difference in the CAGE signal
between the two compared samples for one of the selected high-scoring shifting
promoters.
CAGEexp
object by coercing a data frameA CAGEexp object can also be created directly by coercing a data frame containing single base-pair TSS information. To be able to do the coercion into a CAGEexp, the data frame must conform with the following:
The data frame must have at least 4 columns;
the first three columns must be named chr
, pos
and strand
, and contain chromosome name,
1-based genomic coordinate of the TSS (positive integer) and TSS strand information (+
or
-
), respectively;
these first three columns must be of the class character
, integer
and character
,
respectively;
all additional columns must be of the class integer
and should contain raw CAGE tag counts
(non-negative integer) supporting each TSS in different samples (columns). At least one such
column with tag counts must be present;
the names of the columns containing tag counts must begin with a letter, and these column names are used as sample labels in the resulting CAGEexp object.
An example of such data frame is shown below:
TSS.df <- read.table(system.file( "extdata/Zf.unfertilized.egg.chr17.ctss"
, package = "CAGEr"))
# make sure the column names are as required
colnames(TSS.df) <- c("chr", "pos", "strand", "zf_unfertilized_egg")
# make sure the column classes are as required
TSS.df$chr <- as.character(TSS.df$chr)
TSS.df$pos <- as.integer(TSS.df$pos)
TSS.df$strand <- as.character(TSS.df$strand)
TSS.df$zf_unfertilized_egg <- as.integer(TSS.df$zf_unfertilized_egg)
head(TSS.df)
## chr pos strand zf_unfertilized_egg
## 1 chr17 26050540 + 1
## 2 chr17 26074127 - 2
## 3 chr17 26074129 - 3
## 4 chr17 26222545 - 1
## 5 chr17 26322780 - 1
## 6 chr17 26322832 - 2
This data.frame can now be coerced to a CAGEexp object, which will fill the corresponding slots of the object with provided TSS information:
ce.coerced <- as(TSS.df, "CAGEexp")
ce.coerced
## A CAGEexp object of 1 listed
## experiment with a user-defined name and respective class.
## Containing an ExperimentList class object of length 1:
## [1] tagCountMatrix: RangedSummarizedExperiment with 8395 rows and 1 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample availability DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
Originally there was one accessor per slot in CAGEset objects (the original CAGEr format), but now that I added the CAGEexp class, that uses different underlying formats, the number of accessors increased because a) I provide the old ones for backwards compatibility and b) there multiple possible output formats.
Before releasing this CAGEr update in Bioconductor, I would like to be sure that the number of accessors and the name scheme are good enough.
Please let me know your opinion about the names and scope of the accessors below:
Name | Output |
---|---|
CTSScoordinates | Coordinate table in ad-hoc data.frame format. |
CTSScoordinatesGR | Coordinate table in GRanges format. |
CTSStagCount | Raw CTSS counts in ad-hoc data.frame format (with coordinates). |
CTSStagCountDA | Raw CTSS counts in DelayedArray format wrapping a integer Rle DataFrame. |
CTSStagCountDF | Raw CTSS counts in integer Rle DataFrame format. |
CTSStagCountDf | Raw CTSS counts in data.frame format (without coordinates). |
CTSStagCountGR | Raw CTSS counts in GRanges format (single samples). |
CTSStagCountSE | RangedSummarizedExperiment containing an assay for the Raw CTSS counts. |
CTSStagCountTable | Returns CTSStagCount for CAGEsets and CTSStagCountDF for CAGEexps. |
CTSSnormalizedTpm | Normalised CTSS values in ad-hoc data.frame format (with coordinates). |
CTSSnormalizedTpmDF | Normalised CTSS values in Rle DataFrame format. |
CTSSnormalizedTpmDf | Normalised CTSS values in ad-hoc data.frame format (without coordinates). |
CTSSnormalizedTpmGR | Normalised CTSS values in GRanges format (single samples). |
Name | Output |
---|---|
consensusClusters | Consensus cluster coordinates in ad-hoc data.frame format. |
consensusClustersDESeq2 | Consensus cluster coordinates and expression matrix in DESeq2 format. |
consensusClustersGR | Consensus cluster coordinates in GRanges format. |
consensusClustersSE | Consensus cluster coordinates and expression matrix in RangedSummarizedExperiment format. |
consensusClustersTpm | Consensus cluster coordinates and raw expression matrix. |
tagClusters | Per-sample list of tag cluster coordinates in ad-hoc data.frame format. |
tagClustersGR | Per-sample GRangesList of tag cluster coordinates. |
Name | Output |
---|---|
GeneExpDESeq2 | Gene expression data in DESeq2 format. |
GeneExpSE | Gene expression data in SummarizedExperiment format. |
A CAGEexp object can contain the following experiments.
Please let me know your opinion about the names
Name | Assays | Comment |
---|---|---|
tagCountMatrix | counts, normalizedTpmMatrix | RangedSummarizedExperiment |
seqNameTotals | counts, norm | SummarizedExperiment |
consensusClusters | counts, normalized, q_x, q_y | RangedSummarizedExperiment |
geneExpMatrix | counts | SummarizedExperiment |
Name | Experiment | Comment |
---|---|---|
counts | tagCountMatrix | Integer Rle DataFrame of CTSS raw counts. |
counts | seqNameTotals | Numeric matrix of total counts per chromosome. |
counts | consensusClusters | Integer matrix of consensus cluster expression counts. |
counts | geneExpMatrix | Integer matrix of gene expression counts. |
normalizedTpmMatrix | tagCountMatrix | Numeric matrix of normalised CTSS expression scores. |
norm | seqNameTotals | Numeric matrix of percent total counts per chromosome. |
normalized | consensusClusters | Numeric matrix of normalised CC expression scores. |
q_x, q_y, q_z, … | consensusClusters | Interger Rle DataFrame of relative quantile positions |
The CTSS, CTSS.chr, TagCluster and ConsensusClsuters are mostly used internally or type safety and preventing me (Charles) from mixing up inputs. They are visible from the outside. Should they be used more extensively ? Can they be replaced by more “core” Bioconductor classes ?
Name | Comment |
---|---|
CAGEset | The original CAGEr class, based on data frames and matrices. |
CAGEexp | The new CAGEr class, based on GRanges, DataFrames, etc. |
CAGEr | Union classs for functions that accept both CAGEset and CAGEexp. |
CTSS | Wraps GRanges and guarantees that width equals 1. |
CTSS.chr | Same as CTSS but also guaranteers the there is only one chromosome (useful in some loops) |
TagClusters | Wraps GRanges, represents the fact that each sample has their own tag clusters. |
ConsensusClusters | Wraps GRanges, represents the fact that they are valid for all the samples. |
CAGErCluster | Union class for functions that accept both TagClusters and ConsensusClusters. |
Note: this still uses the CAGEset
class.
There are several large collections of CAGE data available that provide single base-pair resolution TSSs for numerous human and mouse primary cells, cell lines and tissues. Together with several minor datasets for other model organisms (Drosophila melanogaster, Danio rerio) they are a valuable resource that provides cell/tissue type and developmental stage specific TSSs essential for any type of promoter centred analysis. By enabling direct and user-friendly import of TSSs for selected samples into R, CAGEr facilitates the integration of these precise TSS data with other genomic data types. Each of the available CAGE data resources accessible from within CAGEr is explained in more detail further below.
FANTOM consortium provides single base-pair resolution TSS data for numerous
human and mouse primary cells, cell lines and tissues. The main FANTOM5
publication (Consortium 2014) released ~1000 human and ~400 mouse CAGE
samples that cover the vast majority of human primary cell types, mouse
developmental tissues and a large number of commonly used cell lines. These data
is available from FANTOM web resource http://fantom.gsc.riken.jp/5/data/ in
the form of TSS files with genomic coordinates and number of tags mapping to
each TSS detected by CAGE. The list of all available samples for both human and
mouse (as presented in the Supplementary Table 1 of the publication) has been
included in CAGEr to facilitate browsing, searching and selecting samples
of interest. TSSs for selected samples are then fetched directly form the web
resource and imported into a CAGEset
object enabling their further
manipulation with CAGEr.
Previous FANTOM projects (3 and 4) (Consortium 2005,Faulkner et al. (2009),Suzuki et al. (2009)))
produced CAGE datasets for multiple human and mouse tissues as well as several
timecourses, including differentiation of a THP-1 human myeloid leukemia cell
line. All this TSS data has been grouped into datasets by the organism and
tissue of origin and has been collected into an R data package named
FANTOM3and4CAGE, which is available from Bioconductor https://bioconductor.org/packages/FANTOM3and4CAGE.
The vignette accompanying the package provides information on available datasets
and lists of samples. When the data package is installed, CAGEr can import the
TSSs for selected samples directly into a CAGEset
object for further
manipulation.
ENCODE consortium produced CAGE data for common human cell lines (Djebali et al. 2012), which were used by ENCODE for various other types of genome-wide analyses. The advantage of this dataset is that it enables the integration of precise TSSs from a specific cell line with many other genome-wide data types provided by ENCODE for the same cell line. However, the format of CAGE data provided by ENCODE at UCSC (http://genome.ucsc.edu/ENCODE/dataMatrix/encodeDataMatrixHuman.html) includes only raw mapped CAGE tags and their coverage along the genome, and coordinates of enriched genomic regions (peaks), which do not take advantage of the single base-pair resolution provided by CAGE. To address this, we have used the raw CAGE tags to derive single base-pair resolution TSSs and collected them into an R data package named ENCODEprojectCAGE. This data package is available for download from CAGEr web site at http://promshift.genereg.net/CAGEr and includes TSSs for 36 different cell lines fractionated by cellular compartment. The vignette accompanying the package provides information on available datasets and lists of individual samples. Once the package has been downloaded and installed, CAGEr can access it to import TSS data for selected subset of samples for further manipulation and integration.
Precise TSSs are also available for zebrafish (Danio Rerio) from CAGE data
published by (Nepal et al. 2013). The timecourse covering early
embryonic development of zebrafish includes 12 developmental stages. The TSS
data has been collected into an R data package named ZebrafishDevelopmentalCAGE,
which is available for download from CAGEr web site at http://promshift.genereg.net/CAGEr.
As with other data packages mentioned above, once the package is installed
CAGEr can use it to import stage-specific single base pair TSSs into a
CAGEset
object.
Note: this still uses the CAGEset
class.
The data from above mentioned resources can be imported into a CAGEset
object
using the importPublicData()
function. It function has four arguments: source
,
dataset
, group
and sample
. Argument source
accepts one of the following
values: "FANTOM5"
, "FANTOM3and4"
, "ENCODE"
, or "ZebrafishDevelopment"
,
which refer to one of the four resources listed above. The following sections
explain how to utilize this function for each of the four currently supported
resources.
Lists of all human and mouse CAGE samples produced within FANTOM5 project are available in CAGEr. To load the information on human samples type:
data(FANTOM5humanSamples)
head(FANTOM5humanSamples)
## sample type
## 143 acantholytic_squamous_carcinoma_cell_line_HCC1806 cell line
## 46 acute_lymphoblastic_leukemia__B-ALL__cell_line_BALL-1 cell line
## 75 acute_lymphoblastic_leukemia__B-ALL__cell_line_NALM-6 cell line
## 24 acute_lymphoblastic_leukemia__T-ALL__cell_line_HPB-ALL cell line
## 48 acute_lymphoblastic_leukemia__T-ALL__cell_line_Jurkat cell line
## 250 acute_myeloid_leukemia__FAB_M0__cell_line_Kasumi-3 cell line
## description library_id
## 143 acantholytic squamous carcinoma cell line:HCC1806 CNhs11844
## 46 acute lymphoblastic leukemia (B-ALL) cell line:BALL-1 CNhs11251
## 75 acute lymphoblastic leukemia (B-ALL) cell line:NALM-6 CNhs11282
## 24 acute lymphoblastic leukemia (T-ALL) cell line:HPB-ALL CNhs10746
## 48 acute lymphoblastic leukemia (T-ALL) cell line:Jurkat CNhs11253
## 250 acute myeloid leukemia (FAB M0) cell line:Kasumi-3 CNhs13241
## data_url
## 143 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/human.cell_line.hCAGE/acantholytic%2520squamous%2520carcinoma%2520cell%2520line%253aHCC1806.CNhs11844.10717-109I6.hg19.ctss.bed.gz
## 46 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/human.cell_line.hCAGE/acute%2520lymphoblastic%2520leukemia%2520%2528B-ALL%2529%2520cell%2520line%253aBALL-1.CNhs11251.10455-106G5.hg19.ctss.bed.gz
## 75 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/human.cell_line.hCAGE/acute%2520lymphoblastic%2520leukemia%2520%2528B-ALL%2529%2520cell%2520line%253aNALM-6.CNhs11282.10534-107G3.hg19.ctss.bed.gz
## 24 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/human.cell_line.hCAGE/acute%2520lymphoblastic%2520leukemia%2520%2528T-ALL%2529%2520cell%2520line%253aHPB-ALL.CNhs10746.10429-106D6.hg19.ctss.bed.gz
## 48 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/human.cell_line.hCAGE/acute%2520lymphoblastic%2520leukemia%2520%2528T-ALL%2529%2520cell%2520line%253aJurkat.CNhs11253.10464-106H5.hg19.ctss.bed.gz
## 250 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/human.cell_line.LQhCAGE/acute%2520myeloid%2520leukemia%2520%2528FAB%2520M0%2529%2520cell%2520line%253aKasumi-3.CNhs13241.10789-110H6.hg19.ctss.bed.gz
nrow(FANTOM5humanSamples)
## [1] 988
There are 988 human samples in total and for each the following information is provided:
sample
: a unique name/label of the sample which should be provided to
importPublicData()
function to retrieve given sample
type
: type of sample, which can be “cell line”, “primary cell” or “tissue”
description
: short description of the sample as provided in FANTOM5 main
publication (Consortium 2014)
library_id
: unique ID of the CAGE library within FANTOM5
data_url
: URL to corresponding CTSS file at FANTOM5 web resource from which
the data is fetched
Provided information facilitates searching for samples of interest, e.g. we can search for astrocyte samples:
astrocyteSamples <- FANTOM5humanSamples[grep("Astrocyte",
FANTOM5humanSamples[,"description"]),]
astrocyteSamples
## sample type
## 343 Astrocyte_-_cerebellum__donor1 primary cell
## 537 Astrocyte_-_cerebellum__donor2 primary cell
## 562 Astrocyte_-_cerebellum__donor3 primary cell
## 286 Astrocyte_-_cerebral_cortex__donor1 primary cell
## 429 Astrocyte_-_cerebral_cortex__donor2 primary cell
## 472 Astrocyte_-_cerebral_cortex__donor3 primary cell
## description library_id
## 343 Astrocyte - cerebellum, donor1 CNhs11321
## 537 Astrocyte - cerebellum, donor2 CNhs12081
## 562 Astrocyte - cerebellum, donor3 CNhs12117
## 286 Astrocyte - cerebral cortex, donor1 CNhs10864
## 429 Astrocyte - cerebral cortex, donor2 CNhs11960
## 472 Astrocyte - cerebral cortex, donor3 CNhs12005
## data_url
## 343 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/human.primary_cell.hCAGE/Astrocyte%2520-%2520cerebellum%252c%2520donor1.CNhs11321.11500-119F6.hg19.ctss.bed.gz
## 537 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/human.primary_cell.hCAGE/Astrocyte%2520-%2520cerebellum%252c%2520donor2.CNhs12081.11580-120F5.hg19.ctss.bed.gz
## 562 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/human.primary_cell.hCAGE/Astrocyte%2520-%2520cerebellum%252c%2520donor3.CNhs12117.11661-122F5.hg19.ctss.bed.gz
## 286 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/human.primary_cell.hCAGE/Astrocyte%2520-%2520cerebral%2520cortex%252c%2520donor1.CNhs10864.11235-116D2.hg19.ctss.bed.gz
## 429 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/human.primary_cell.hCAGE/Astrocyte%2520-%2520cerebral%2520cortex%252c%2520donor2.CNhs11960.11316-117D2.hg19.ctss.bed.gz
## 472 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/human.primary_cell.hCAGE/Astrocyte%2520-%2520cerebral%2520cortex%252c%2520donor3.CNhs12005.11392-118C6.hg19.ctss.bed.gz
data(FANTOM5mouseSamples)
head(FANTOM5mouseSamples)
## sample
## 144 J2E_erythroblastic_leukemia_response_to_erythropoietin__00hr00min__biol_rep1
## 145 J2E_erythroblastic_leukemia_response_to_erythropoietin__00hr00min__biol_rep2
## 146 J2E_erythroblastic_leukemia_response_to_erythropoietin__00hr00min__biol_rep3
## 91 Atoh1+_Inner_ear_hair_cells_-_organ_of_corti__pool1
## 111 CD326+_enterocyte_isolated_from_mice__treated_with_RANKL__day03__pool1
## 109 CD326++_enterocyte_isolated_from_mice__treated_with_RANKL__day03__pool1
## type
## 144 cell line
## 145 cell line
## 146 cell line
## 91 primary cell
## 111 primary cell
## 109 primary cell
## description
## 144 J2E erythroblastic leukemia response to erythropoietin, 00hr00min, biol_rep1
## 145 J2E erythroblastic leukemia response to erythropoietin, 00hr00min, biol_rep2
## 146 J2E erythroblastic leukemia response to erythropoietin, 00hr00min, biol_rep3
## 91 Atoh1+ Inner ear hair cells - organ of corti, pool1
## 111 CD326+ enterocyte isolated from mice, treated with RANKL, day03, pool1
## 109 CD326++ enterocyte isolated from mice, treated with RANKL, day03, pool1
## library_id
## 144 CNhs12449
## 145 CNhs12668
## 146 CNhs12770
## 91 CNhs12533
## 111 CNhs13242
## 109 CNhs13236
## data_url
## 144 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/mouse.timecourse.hCAGE/J2E%2520erythroblastic%2520leukemia%2520response%2520to%2520erythropoietin%252c%252000hr00min%252c%2520biol_rep1.CNhs12449.13063-139I3.mm9.ctss.bed.gz
## 145 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/mouse.timecourse.hCAGE/J2E%2520erythroblastic%2520leukemia%2520response%2520to%2520erythropoietin%252c%252000hr00min%252c%2520biol_rep2.CNhs12668.13129-140G6.mm9.ctss.bed.gz
## 146 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/mouse.timecourse.hCAGE/J2E%2520erythroblastic%2520leukemia%2520response%2520to%2520erythropoietin%252c%252000hr00min%252c%2520biol_rep3.CNhs12770.13195-141E9.mm9.ctss.bed.gz
## 91 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/mouse.primary_cell.LQhCAGE/Atoh1%252b%2520Inner%2520ear%2520hair%2520cells%2520-%2520organ%2520of%2520corti%252c%2520pool1.CNhs12533.12158-128G7.mm9.ctss.bed.gz
## 111 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/mouse.primary_cell.LQhCAGE/CD326%252b%2520enterocyte%2520isolated%2520from%2520mice%252c%2520treated%2520with%2520RANKL%252c%2520day03%252c%2520pool1.CNhs13242.11850-124I5.mm9.ctss.bed.gz
## 109 http://fantom.gsc.riken.jp/5/datafiles/latest/basic/mouse.primary_cell.LQhCAGE/CD326%252b%252b%2520enterocyte%2520isolated%2520from%2520mice%252c%2520treated%2520with%2520RANKL%252c%2520day03%252c%2520pool1.CNhs13236.11852-124I7.mm9.ctss.bed.gz
nrow(FANTOM5mouseSamples)
## [1] 395
To import TSS data for samples of interest from FANTOM5 we use the importPublicData()
function and set the argument source = "FANTOM5"
. The dataset
argument can
be set to either "human"
or "mouse"
, and the sample
argument is provided
by a vector of sample lables/names. For example, names of astrocyte samples from
above are:
astrocyteSamples[,"sample"]
## [1] "Astrocyte_-_cerebellum__donor1"
## [2] "Astrocyte_-_cerebellum__donor2"
## [3] "Astrocyte_-_cerebellum__donor3"
## [4] "Astrocyte_-_cerebral_cortex__donor1"
## [5] "Astrocyte_-_cerebral_cortex__donor2"
## [6] "Astrocyte_-_cerebral_cortex__donor3"
and to import first three samples type:
astrocyteCAGEset <- importPublicData(source = "FANTOM5", dataset = "human",
sample = astrocyteSamples[1:3,"sample"])
astrocyteCAGEset
##
## S4 Object of class CAGEset
##
## =======================================
## Input data information
## =======================================
## Reference genome (organism): BSgenome.Hsapiens.UCSC.hg19
## Input file type: FANTOM5
## Input file names: FANTOM5__Astrocyte_-_cerebellum__donor1, FANTOM5__Astrocyte_-_cerebellum__donor2, FANTOM5__Astrocyte_-_cerebellum__donor3
## Sample labels: Astrocyte_-_cerebellum__donor1, Astrocyte_-_cerebellum__donor2, Astrocyte_-_cerebellum__donor3
## =======================================
## CTSS information
## =======================================
## CTSS chromosome: chr1, chr1, chr1, ...
## CTSS position: 564634, 564637, 564645, ...
## CTSS strand: +, +, +, ...
## Tag count:
## -> Astrocyte_-_cerebellum__donor1: 0, 0, 0, ...
## -> Astrocyte_-_cerebellum__donor2: 0, 0, 0, ...
## -> Astrocyte_-_cerebellum__donor3: 1, 1, 2, ...
## Normalized tpm:
## =======================================
## Tag cluster (TC) information
## =======================================
## CTSS clustering method:
## Number of TCs per sample:
## =======================================
## Consensus cluster information
## =======================================
## Number of consensus clusters:
## Consensus cluster chromosome:
## Consensus cluster start:
## Consensus cluster end:
## Consensus cluster strand:
## Normalized tpm:
## =======================================
## Expression profiling
## =======================================
## Expression clustering method:
## Expression clusters for consensus clusters:
## =======================================
## Promoter shifting
## =======================================
## GroupX:
## GroupY:
## Shifting scores:
## KS p-values (FDR adjusted):
The resulting astrocyteCAGEset
is a CAGEset
object that can be included in
the CAGEr workflow described above to perform normalisation, clustering,
visualisation, etc.
To use TSS data from FANTOM3 and FANTOM4 projects, a data package FANTOM3and4CAGE has to be installed and loaded. This package is available from Bioconductor and can be installed by calling:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("FANTOM3and4CAGE")
For the list of available datasets with group and sample labels for specific human or mouse samples load the data package and get list of samples:
library(FANTOM3and4CAGE)
data(FANTOMhumanSamples)
head(FANTOMhumanSamples)
## dataset group sample
## 1 FANTOMtissueCAGEhuman cerebrum cerebrum
## 2 FANTOMtissueCAGEhuman renal_artery renal_artery
## 3 FANTOMtissueCAGEhuman ureter ureter
## 4 FANTOMtissueCAGEhuman urinary_bladder urinary_bladder
## 5 FANTOMtissueCAGEhuman kidney malignancy
## 6 FANTOMtissueCAGEhuman kidney kidney
data(FANTOMmouseSamples)
head(FANTOMmouseSamples)
## dataset group
## 1 FANTOMtissueCAGEmouse brain
## 2 FANTOMtissueCAGEmouse brain
## 3 FANTOMtissueCAGEmouse brain
## 4 FANTOMtissueCAGEmouse brain
## 5 FANTOMtissueCAGEmouse cerebral_cortex
## 6 FANTOMtissueCAGEmouse hippocampus
## sample
## 1 brain
## 2 CCL-131_Neuro-2a_control
## 3 CCL-131_Neuro-2a_treatment_for_6hr_with_MPP+
## 4 CCL-131_Neuro-2a_treatment_for_12hr_with_MPP+
## 5 cerebral_cortex
## 6 hippocampus
In the above data frames, the columns dataset
, group
and sample
provide
values that should be passed to corresponding arguments in importPublicData()
function. For example to import human kidney normal and malignancy samples call:
kidneyCAGEset <- importPublicData(source = "FANTOM3and4",
dataset = "FANTOMtissueCAGEhuman",
group = "kidney", sample = c("kidney", "malignancy"))
kidneyCAGEset
##
## S4 Object of class CAGEset
##
## =======================================
## Input data information
## =======================================
## Reference genome (organism): BSgenome.Hsapiens.UCSC.hg18
## Input file type: FANTOM3and4
## Input file names: FANTOM3and4__kidney__kidney, FANTOM3and4__kidney__malignancy
## Sample labels: kidney__kidney, kidney__malignancy
## =======================================
## CTSS information
## =======================================
## CTSS chromosome: chr1, chr1, chr1, ...
## CTSS position: 10006, 554450, 703878, ...
## CTSS strand: -, +, -, ...
## Tag count:
## -> kidney__kidney: 1, 1, 1, ...
## -> kidney__malignancy: 0, 0, 0, ...
## Normalized tpm:
## =======================================
## Tag cluster (TC) information
## =======================================
## CTSS clustering method:
## Number of TCs per sample:
## =======================================
## Consensus cluster information
## =======================================
## Number of consensus clusters:
## Consensus cluster chromosome:
## Consensus cluster start:
## Consensus cluster end:
## Consensus cluster strand:
## Normalized tpm:
## =======================================
## Expression profiling
## =======================================
## Expression clustering method:
## Expression clusters for consensus clusters:
## =======================================
## Promoter shifting
## =======================================
## GroupX:
## GroupY:
## Shifting scores:
## KS p-values (FDR adjusted):
When the samples belong to different groups or different datasets, it is necessary to provide the dataset and group assignment for each sample separately:
mixedCAGEset <- importPublicData(source = "FANTOM3and4",
dataset = c("FANTOMtissueCAGEmouse", "FANTOMtissueCAGEmouse",
"FANTOMtimecourseCAGEmouse"), group = c("liver", "liver",
"liver_under_constant_darkness"),
sample = c("cloned_mouse", "control_mouse", "4_hr"))
mixedCAGEset
##
## S4 Object of class CAGEset
##
## =======================================
## Input data information
## =======================================
## Reference genome (organism): BSgenome.Mmusculus.UCSC.mm9
## Input file type: FANTOM3and4
## Input file names: FANTOM3and4__liver__cloned_mouse, FANTOM3and4__liver__control_mouse, FANTOM3and4__liver_under_constant_darkness__4_hr
## Sample labels: liver__cloned_mouse, liver__control_mouse, liver_under_constant_darkness__4_hr
## =======================================
## CTSS information
## =======================================
## CTSS chromosome: chr1, chr1, chr1, ...
## CTSS position: 3153104, 3218269, 3467930, ...
## CTSS strand: +, +, +, ...
## Tag count:
## -> liver__cloned_mouse: 1, 1, 1, ...
## -> liver__control_mouse: 0, 0, 0, ...
## -> liver_under_constant_darkness__4_hr: 0, 0, 0, ...
## Normalized tpm:
## =======================================
## Tag cluster (TC) information
## =======================================
## CTSS clustering method:
## Number of TCs per sample:
## =======================================
## Consensus cluster information
## =======================================
## Number of consensus clusters:
## Consensus cluster chromosome:
## Consensus cluster start:
## Consensus cluster end:
## Consensus cluster strand:
## Normalized tpm:
## =======================================
## Expression profiling
## =======================================
## Expression clustering method:
## Expression clusters for consensus clusters:
## =======================================
## Promoter shifting
## =======================================
## GroupX:
## GroupY:
## Shifting scores:
## KS p-values (FDR adjusted):
For more details about datasets available in the data package please refer to the vignette accompanying the package.
TSS data derived from ENCODE CAGE datasets has been collected into
ENCODEprojectCAGE data package, which is available for download from the
CAGEr web site (http://promshift.genereg.net/CAGEr/). Downloaded package can
be installed from local using install.packages()
function from within R and
used with CAGEr as described below.
List of datasets available in this data package can be obtained like this:
library(ENCODEprojectCAGE)
data(ENCODEhumanCellLinesSamples)
The information provided in this data frame is analogous to the one in previously discussed data package and provides values to be used with importPublicData()
function. The command to import whole cell CAGE samples for three different cell lines would look like this:
ENCODEset <- importPublicData(source = "ENCODE",
dataset = c("A549", "H1-hESC", "IMR90"),
group = c("cell", "cell", "cell"), sample = c("A549_cell_rep1",
"H1-hESC_cell_rep1", "IMR90_cell_rep1"))
For more details about datasets available in the ENCODEprojectCAGE data package please refer to the vignette accompanying the package.
The zebrafish TSS data for 12 developmental stages is collected in
ZebrafishDevelopmentalCAGE data package, which is also available for download
from the CAGEr web site (http://promshift.genereg.net/CAGEr/). It can be
installed from local using install.packages()
function. To get a list of
samples within the package type:
library(ZebrafishDevelopmentalCAGE)
data(ZebrafishSamples)
In this package there is only one dataset called ZebrafishCAGE
and all samples
belong to the same group called development
. To import selected samples from
this dataset type:
zebrafishCAGEset <- importPublicData(source = "ZebrafishDevelopment",
dataset = "ZebrafishCAGE", group = "development",
sample = c("zf_64cells", "zf_prim6"))
For more details please refer to the vignette accompanying the data package.
Importing TSS data from any of the four above explained resources results in the
CAGEset
object that can be directly included into the workflow provided by
CAGEr to perform normalisation, clustering, promoter width analysis,
visualisation, etc. This high-resolution TSS data can then easily be
integrated with other genomic data types to perform various promoter-centred
analyses, which does not rely on annotation but uses precise and matched
cell/tissue type TSSs.
sessionInfo()
## R version 3.5.1 Patched (2018-07-12 r74967)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] FANTOM3and4CAGE_1.18.0 CAGEr_1.24.0
## [3] MultiAssayExperiment_1.8.1 SummarizedExperiment_1.12.0
## [5] DelayedArray_0.8.0 BiocParallel_1.16.0
## [7] matrixStats_0.54.0 Biobase_2.42.0
## [9] GenomicRanges_1.34.0 GenomeInfoDb_1.18.0
## [11] IRanges_2.16.0 S4Vectors_0.20.0
## [13] BiocGenerics_0.28.0 BiocStyle_2.10.0
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-137 bitops_1.0-6
## [3] bit64_0.9-7 RColorBrewer_1.1-2
## [5] rprojroot_1.3-2 BSgenome.Drerio.UCSC.danRer7_1.4.0
## [7] tools_3.5.1 backports_1.1.2
## [9] R6_2.3.0 vegan_2.5-3
## [11] rpart_4.1-13 KernSmooth_2.23-15
## [13] DBI_1.0.0 Hmisc_4.1-1
## [15] lazyeval_0.2.1 mgcv_1.8-25
## [17] colorspace_1.3-2 nnet_7.3-12
## [19] permute_0.9-4 gridExtra_2.3
## [21] tidyselect_0.2.5 DESeq2_1.22.0
## [23] bit_1.1-14 compiler_3.5.1
## [25] htmlTable_1.12 rtracklayer_1.42.0
## [27] labeling_0.3 bookdown_0.7
## [29] checkmate_1.8.5 scales_1.0.0
## [31] genefilter_1.64.0 stringr_1.3.1
## [33] digest_0.6.18 Rsamtools_1.34.0
## [35] foreign_0.8-71 rmarkdown_1.10
## [37] stringdist_0.9.5.1 XVector_0.22.0
## [39] base64enc_0.1-3 pkgconfig_2.0.2
## [41] htmltools_0.3.6 BSgenome_1.50.0
## [43] highr_0.7 htmlwidgets_1.3
## [45] rlang_0.3.0.1 RSQLite_2.1.1
## [47] rstudioapi_0.8 VGAM_1.0-6
## [49] bindr_0.1.1 gtools_3.8.1
## [51] acepack_1.4.1 dplyr_0.7.7
## [53] RCurl_1.95-4.11 magrittr_1.5
## [55] GenomeInfoDbData_1.2.0 Formula_1.2-3
## [57] Matrix_1.2-15 Rcpp_0.12.19
## [59] munsell_0.5.0 stringi_1.2.4
## [61] yaml_2.2.0 MASS_7.3-51.1
## [63] zlibbioc_1.28.0 plyr_1.8.4
## [65] blob_1.1.1 grid_3.5.1
## [67] formula.tools_1.7.1 crayon_1.3.4
## [69] lattice_0.20-35 Biostrings_2.50.0
## [71] splines_3.5.1 annotate_1.60.0
## [73] locfit_1.5-9.1 knitr_1.20
## [75] beanplot_1.2 pillar_1.3.0
## [77] geneplotter_1.60.0 codetools_0.2-15
## [79] XML_3.98-1.16 glue_1.3.0
## [81] evaluate_0.12 latticeExtra_0.6-28
## [83] data.table_1.11.8 BiocManager_1.30.3
## [85] operator.tools_1.6.3 gtable_0.2.0
## [87] purrr_0.2.5 reshape_0.8.8
## [89] assertthat_0.2.0 ggplot2_3.1.0
## [91] xfun_0.4 xtable_1.8-3
## [93] survival_2.43-1 tibble_1.4.2
## [95] som_0.3-5.1 AnnotationDbi_1.44.0
## [97] GenomicAlignments_1.18.0 memoise_1.1.0
## [99] bindrcpp_0.2.2 cluster_2.0.7-1
Balwierz, Piotr J, Piero Carninci, Carsten O Daub, Jun Kawai, Yoshihide Hayashizaki, Werner Van Belle, Christian Beisel, and Erik van Nimwegen. 2009. “Methods for analyzing deep sequencing expression data: constructing the human and mouse promoterome with deepCAGE data.” Genome Biology 10 (7):R79.
Carninci, Piero, Albin Sandelin, Boris Lenhard, Shintaro Katayama, Kazuro Shimokawa, Jasmina Ponjavic, Colin A M Semple, et al. 2006. “Genome-wide analysis of mammalian promoter architecture and evolution.” Nature Genetics 38 (6):626–35.
Carninci, P, C Kvam, A Kitamura, T Ohsumi, Y Okazaki, M Itoh, M Kamiya, et al. 1996. “High-efficiency full-length cDNA cloning by biotinylated CAP trapper.” Genomics 37 (3):327–36.
Consortium, The FANTOM. 2005. “The Transcriptional Landscape of the Mammalian Genome.” Science 309 (5740):1559–63.
———. 2014. “A promoter-level mammalian expression atlas.” Nature 507 (7493):462–70.
Djebali, Sarah, Carrie A Davis, Angelika Merkel, Alex Dobin, Timo Lassmann, Ali Mortazavi, Andrea Tanzer, et al. 2012. “Landscape of transcription in human cells.” Nature 488 (7414):101–8.
Faulkner, Geoffrey J, Yasumasa Kimura, Carsten O Daub, Shivangi Wani, Charles Plessy, Katharine M Irvine, Kate Schroder, et al. 2009. “The regulated retrotransposon transcriptome of mammalian cells.” Nature Genetics 41 (5):563–71.
Frith, M C, E Valen, A Krogh, Y Hayashizaki, P Carninci, and A Sandelin. 2007. “A code for transcription initiation in mammalian genomes.” Genome Research 18 (1):1–12.
Haberle, Vanja, Alistair R R Forrest, Yoshihide Hayashizaki, Piero Carninci, and Boris Lenhard. 2015. “CAGEr: precise TSS data retrieval and high-resolution promoterome mining for integrative analyses.” Nucleic Acids Research Epub ahead of print (2015 Feb 4). https://doi.org/10.1093/nar/gkv054.
Kodzius, Rimantas, Miki Kojima, Hiromi Nishiyori, Mari Nakamura, Shiro Fukuda, Michihira Tagami, Daisuke Sasaki, et al. 2006. “CAGE: cap analysis of gene expression.” Nature Methods 3 (3):211–22.
Nepal, Chirag, Yavor Hadzhiev, Christopher Previti, Vanja Haberle, Nan Li, Hazuki Takahashi, Ana Maria S. Suzuki, et al. 2013. “Dynamic regulation of coding and non-coding transcription initiation landscape at single nucleotide resolution during vertebrate embryogenesis.” Genome Research 23 (11):1938–50.
Suzuki, Harukazu, Alistair R R Forrest, Erik van Nimwegen, Carsten O Daub, Piotr J Balwierz, Katharine M Irvine, Timo Lassmann, et al. 2009. “The transcriptional network that controls growth arrest and differentiation in a human myeloid leukemia cell line.” Nature Genetics 41 (5):553–62.
Takahashi, Hazuki, Timo Lassmann, Mitsuyoshi Murata, and Piero Carninci. 2012. “5’ end-centered expression profiling using cap-analysis gene expression and next-generation sequencing.” Nature Protocols 7 (3):542–61.