if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
devtools::install_github("gaolabtools/Hypatia")Background
RNA isoforms diversify the coding capacity of the genome and have significant implications in health and disease. Long-read single-cell RNA-sequencing (LR-scRNAseq) enables the investigation of RNA isoforms at cell type resolution. Hypatia is a computational suite in R for data analysis across 3 modules: 1) isoform usage, which identifies differential isoform usage events (DIUs), 2) isoform diversity, which identifies differential isoform diversity events (DIVs), and 3) isoform expression, which identifies differentially expressed isoforms (DEIs).
Check out our paper for more details on Hypatia. If you found this tool useful, please remember to cite us:
[Citation]Setup
Installation
Hypatia is an R package available through Github:
The following dependencies will be installed if they are not already present: checkmate, dplyr (>= 1.0.0), ggnewscale, ggplot2, Matrix, patchwork, purrr, S4Vectors, SingleCellExperiment, SummarizedExperiment, tidyr, and matrixTests.
Preparing the input data
Hypatia requires 3 inputs:
Single-cell isoform expression matrix
A data frame or matrix of raw isoform counts. Row names are transcript IDs and column names are cell IDs.Cell-level metadata
A data frame with at least one column of cell annotations (cell types, clusters) necessary for differential analysis. Row names are cell IDs matching the count matrix.Transcript-level metadata
A data frame with at least one column containing gene identifiers associated with transcripts in the count matrix. Row names are transcript IDs matching the count matrix.
Tutorial data
For the tutorial, we analyze single-cell long-read RNA-seq data of a human glioblastoma (GBM) tumor. Raw reads were processed using scNanoGPS, in which IsoQuant performed isoform quantification. Gene expression data were analyzed beforehand in Seurat, revealing 8 cell types: astrocytes, oligodendrocyte precursor cells (OPCs), tumor, microglia, vascular, GABAergic neurons (GABA neuron), and glutamatergic neurons (Glut neuron):
The following input data for the tutorial is available (lazy-loaded) after the package is loaded into the environment:
library(Hypatia)
gbm_countData
gbm_colData
gbm_rowData
gbm_gtfThe raw single-cell isoform expression matrix gbm_countData consists of 56,642 isoforms from 5,043 cells.
gbm_countData
str(gbm_countData)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:934559] 77 189 393 406 499 577 719 1340 1394 1566 ...
## ..@ p : int [1:5044] 0 256 985 1365 1416 1554 1828 2021 2145 2671 ...
## ..@ Dim : int [1:2] 56642 5043
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:56642] "ENST00000008440" "ENST00000040877" "ENST00000054666" "ENST00000060969" ...
## .. ..$ : chr [1:5043] "GCCACAACATCGGTTC" "TCGGGCATCACCAGTA" "CCGAATTGTAGCTCGC" "CTGCATGGTTGGATAG" ...
## ..@ x : num [1:934559] 1 1 1 1 1 1 1 1 1 1 ...
## ..@ factors : list()The cell-level metadata gbm_colData contains the columns cell_type, umap_1, and umap_2, which were all extracted from the Seurat object from gene expression analysis. The gene expression-derived umap_1 and umap_2 will be used for visualizations later.
gbm_colData
str(gbm_colData)
## 'data.frame': 5043 obs. of 3 variables:
## $ cell_type: chr "Microglia" "Glut neuron" "GABA neuron" "Oligodendrocyte" ...
## $ umap_1 : num 7.58 -14.63 -12.36 5.86 8.17 ...
## $ umap_2 : num 1.97 -6.25 5.92 -7.23 7.03 ...The transcript-level metadata gbm_rowData has the column gene_id that contains the associated genes for each transcript. Additional columns gene_name and transcript_name are included for easy viewing in the downstream analysis. The remaining columns hold various isoform QC metrics from running the SQANTI3 pipeline. We strongly recommend providing these, or similar metrics, to improve downstream isoform filtering.
gbm_rowData
str(gbm_rowData)
## 'data.frame': 56642 obs. of 7 variables:
## $ structural_category: chr "full-splice_match" "full-splice_match" "full-splice_match" "full-splice_match" ...
## $ gene_id : chr "ENSG00000010072" "ENSG00000059588" "ENSG00000049245" "ENSG00000052723" ...
## $ RTS_stage : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ within_CAGE_peak : logi TRUE TRUE TRUE TRUE FALSE TRUE ...
## $ within_polyA_site : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ gene_name : chr "SPRTN" "TARBP1" "VAMP3" "SIKE1" ...
## $ transcript_name : chr "SPRTN-201" "TARBP1-201" "VAMP3-201" "SIKE1-201" ...For the detailed code used to prepare the individual metadata, see Utilities.
1. Create the SCE object
Hypatia utilizes the SingleCellExperiment. object, which provides a standardized framework for storing and organizing data for streamlined single-cell genomics analysis.
CreateSCE
After the input data are prepared, the SCE object can be created using the function CreateSCE().
library(Hypatia)
# Create the SCE
gbm <- CreateSCE(
countData = gbm_countData,
colData = gbm_colData,
rowData = gbm_rowData
)gbm
## class: SingleCellExperiment
## dim: 56642 5043
## metadata(3): active.gene.id active.group.id active.transcript.id
## assays(1): counts
## rownames(56642): ENST00000008440 ENST00000040877 ...
## transcript29.KI270742.1.nnic transcript71.KI270742.1.nnic
## rowData names(8): nCell structural_category ... gene_name
## transcript_name
## colnames(5043): GCCACAACATCGGTTC TCGGGCATCACCAGTA ... AAGTCACCATTAACCT
## CCCGATAGTGATTCCG
## colData names(7): project nCount ... umap_1 umap_2
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):- By default,
CreateSCE()searches for a column called “gene_id” inrowDatafor associated genes. This can be changed in the paramteractive.gene.id. - The row name orders in
rowDataandcolDatamust match the row and column name orders incountData, respectively.
Object slots
Data within SCE objects are stored inside ‘slots’. The main slots that are utilized by Hypatia are assays (count data), colData (cell-level metadata), and rowData (transcript-level metadata). They can be accessed from the object as shown:
# Raw count matrix
assay(gbm, "counts")
# Cell-level metadata
colData(gbm)
# Transcript-level metadata
rowData(gbm)Hypatia’s CreateSCE() will also create the following new variables in colData and rowData:
- (colData)
project: Project name (default:"Project"). - (colData)
nCount: Number of isoform counts for the cell. - (colData)
nTranscript: Number of unique isoforms detected in the cell. - (colData)
nGene: Number of unique genes detected in the cell. - (rowData)
nCell: Number of cells with detection of the isoform.
colData
colData(gbm) |> head()
## DataFrame with 6 rows and 7 columns
## project nCount nTranscript nGene cell_type
## <character> <numeric> <integer> <integer> <character>
## GCCACAACATCGGTTC Project 269 256 248 Microglia
## TCGGGCATCACCAGTA Project 804 729 683 Glut neuron
## CCGAATTGTAGCTCGC Project 392 380 369 GABA neuron
## CTGCATGGTTGGATAG Project 53 51 50 Oligodendrocyte
## ATCCAATAGCAGGGTA Project 157 138 135 Microglia
## GGGATATAGTATGGAA Project 286 274 264 GABA neuron
## umap_1 umap_2
## <numeric> <numeric>
## GCCACAACATCGGTTC 7.57581 1.96721
## TCGGGCATCACCAGTA -14.63275 -6.25013
## CCGAATTGTAGCTCGC -12.36459 5.92043
## CTGCATGGTTGGATAG 5.86348 -7.22947
## ATCCAATAGCAGGGTA 8.16642 7.03476
## GGGATATAGTATGGAA -11.20872 5.81996rowData
rowData(gbm) |> head()
## DataFrame with 6 rows and 8 columns
## nCell structural_category gene_id RTS_stage
## <integer> <character> <character> <logical>
## ENST00000008440 2 full-splice_match ENSG00000010072 FALSE
## ENST00000040877 72 full-splice_match ENSG00000059588 FALSE
## ENST00000054666 60 full-splice_match ENSG00000049245 FALSE
## ENST00000060969 51 full-splice_match ENSG00000052723 FALSE
## ENST00000164247 0 full-splice_match ENSG00000069424 FALSE
## ENST00000166244 8 full-splice_match ENSG00000070886 FALSE
## within_CAGE_peak within_polyA_site gene_name transcript_name
## <logical> <logical> <character> <character>
## ENST00000008440 TRUE TRUE SPRTN SPRTN-201
## ENST00000040877 TRUE TRUE TARBP1 TARBP1-201
## ENST00000054666 TRUE TRUE VAMP3 VAMP3-201
## ENST00000060969 TRUE TRUE SIKE1 SIKE1-201
## ENST00000164247 FALSE TRUE KCNAB2 KCNAB2-201
## ENST00000166244 TRUE TRUE EPHA8 EPHA8-201Object metadata
The metadata() function can be used to access the metadata of the SCE object, and dictates how the object interacts with Hypatia’s functions. These include important settings such as the default cell groups and the gene/transcript names to report. By default, the only defined setting is active.gene.id, which was set by CreateSCE().
metadata(gbm)
## $active.gene.id
## [1] "gene_id"
##
## $active.group.id
## [1] ""
##
## $active.transcript.id
## [1] ""An empty active.transcript.id indicates that the row names of the count matrix (e.g., Ensembl transcript IDs in the tutorial data) should be used. An empty active.groups.id means that cell identities for grouping (e.g., cell_type) must be specified in downstream functions. The active.groups.id can be set in advance to avoid repeatedly defining the grouping variable later on.
Below, we update the metadata for active gene IDs, transcript IDs, and cell groups.
# Set active.gene.id
gbm <- SetGenes(gbm, "gene_name")
# Set active.transcript.id
gbm <- SetTranscripts(gbm, "transcript_name")
# Set active.group.id
gbm <- SetGroups(gbm, "cell_type")metadata(gbm)
## $active.gene.id
## [1] "gene_name"
##
## $active.group.id
## [1] "cell_type"
##
## $active.transcript.id
## [1] "transcript_name"gene_name and transcript_name were both provided in the object’s rowData slot. The active cell grouping variable is set to cell_type of the object’s colData slot.
The object metadata can be changed at any point in the workflow. Providing alternative gene and transcript IDs are optional.
2. Quality control
Transcript filtering
To filter isoforms, SubsetTranscripts() can be used to subset the SCE object by the object’s rowData (transcript-level) variables. Isoforms that are not expressed in at least 3 cells are removed according to the variable nCell below.
gbm <- SubsetTranscripts(gbm, subset = nCell >= 3)Isoforms can also filtered based on the provided transcript-level SQANTI metrics:
gbm <- SubsetTranscripts(
gbm,
subset = ( (structural_category == "full-splice_match" & !RTS_stage) |
(!RTS_stage & within_CAGE_peak & within_polyA_site) )
)Cell filtering
Users can run PlotCellQC() to visualize the distributions of the default variables nTranscript, nGene, and nCount (where nCount is the number of isoform counts per cell).
PlotCellQC(gbm)Afterwards, SubsetCells() can filter cells using these or other colData (cell-level) variables. Cells without detectable expression in more than 50 isoforms are removed below.
gbm <- SubsetCells(object = gbm, subset = nTranscript >= 50)SubsetCells() can also be used to subset the object for reasons unrelated to QC. The following are examples of isolating and removing tumor cells from the cell_type annotation.
# Isolate tumor cells
gbm_tumor <- SubsetCells(object = gbm, subset = cell_type == "Tumor")
# Remove tumor cells
gbm_notumor <- SubsetCells(object = gbm, subset = cell_type == "Tumor", invert = TRUE)Vectors of cell IDs or active transcript IDs can also be used in through the arguments cells and transcripts in their respective subset functions.
3. Differential analysis
Hypatia provides 3 modules for differential isoform analysis:
Isoform Usage: Identifies differential isoform usage (DIU) events, marked by significant differences in isoform ratios of genes across cell groups.
Isoform Diversity: Computes isoform diversity, classifies genes as ‘monoform’ (low diversity) or ‘polyform’ (high diversity), and performs pairwise comparisons of classifications. Genes having diverging classifications across cell groups are considered DIVs.
Isoform Expression: Identifies differentially expressed isoforms (DEIs) by comparing the mean normalized expression of isoforms across groups of cells.
| Method | Analysis | Visualization | Data retrieval |
|---|---|---|---|
| Isoform Usage | RunDIU() |
PlotUsage() |
GetUsage() |
| Isoform Diversity | RunDIV() |
PlotDiversity() |
GetDiversity() |
| Isoform Expression | NormalizeCounts(), RunDEI() |
PlotExpression() |
GetExpression() |
Isoform Usage
Isoform usage refers to the relative abundances of transcript isoforms produced by a single gene. The usage of alternative isoforms is reported to vary across cell types, conditions, and developmental stages. Differential isoform usage (DIU) events can reveal regulatory changes, functional diversity, or shifts that are not apparent from gene-level expression alone.
RunDIU
RunDIU() identifies significant differences in isoform proportions per gene. Genes are first filtered for adequate detection and the Chi-square test is subsequently applied to identify DIU events from single-cell isoform expression data (see ?RunDIU() for more details). Group comparisons can be specified using group.1 and group.2 as shown below:
# Each cell group vs all other cells
all_diu <- RunDIU(gbm)
# Glut neurons vs all other cells
glut_diu <- RunDIU(gbm, group.1 = "Glut neuron")
# Glut and GABA neurons vs all other cells
neur_diu <- RunDIU(gbm, group.1 = c("Glut neuron", "GABA neuron"))
# Glut vs GABA neurons
glutvsgaba_diu <- RunDIU(gbm, group.1 = "Glut neuron", group.2 = "GABA neuron")RunDIU() returns a list of two data frames:
$data contains the summarized isoform usage data used to generate the statistical output:
group.1&group.2: The two cell groups being compared.gene: The gene being tested.gene.pct.1: Percentage of cells ingroup.1with expression of the gene.gene.pct.2: Percentage of cells ingroup.2with expression of the gene.transcript: The associated transcript.cts.1: Total counts of the transcript across all cells ingroup.1.cts.2: Total counts of the transcript across all cells ingroup.2.prop.1: Transcript proportion forgroup.1.prop.2: Transcript proportion forgroup.2.delta: The difference in transcript proportions between groups (group.1-group.2).
$stats contains results from statistical tests:
group.1&group.2: The two cell groups being compared.gene: The gene being tested.max.delta: The largest absolute difference in transcript proportions betweengroup.1andgroup.2.transcript: Transcript associated with the maximum proportion difference.pval: P-value from the statistical test specified inmethod.use(default: Chi-square test).padj: Adjusted p-value, calculated separately for each group comparison (default: Benjamini-Hochberg).effect.size: Effect size of the test, measured as Cramer’s V.approx: Indicates whether the Chi-square approximation is valid ("valid") or potentially unreliable ("warning"), based on whether at least 80% of transcript counts of the contingency table exceed 5.
Example (data)
head(glutvsgaba_diu$data)
## group.1 group.2 gene gene.pct.1 gene.pct.2 transcript
## 1 Glut neuron GABA neuron AAK1 0.1751592 0.2091837 AAK1-208
## 2 Glut neuron GABA neuron AAK1 0.1751592 0.2091837 AAK1-211
## 3 Glut neuron GABA neuron ABCC5 0.1735669 0.1734694 ABCC5-204
## 4 Glut neuron GABA neuron ABCC5 0.1735669 0.1734694 ABCC5-205
## 5 Glut neuron GABA neuron ABCC5 0.1735669 0.1734694 transcript38904.chr3.nic
## 6 Glut neuron GABA neuron ABCC5 0.1735669 0.1734694 transcript39489.chr3.nic
## cts.1 cts.2 prop.1 prop.2 delta
## 1 13 7 0.09848485 0.07291667 0.025568182
## 2 119 89 0.90151515 0.92708333 -0.025568182
## 3 62 40 0.48818898 0.50000000 -0.011811024
## 4 31 19 0.24409449 0.23750000 0.006594488
## 5 5 1 0.03937008 0.01250000 0.026870079
## 6 24 20 0.18897638 0.25000000 -0.061023622Example (stats)
head(glutvsgaba_diu$stats)
## group.1 group.2 gene max.prop.diff transcript pval
## 1 Glut neuron GABA neuron MEG3 -0.14658436 MEG3-209 1.007387e-37
## 2 Glut neuron GABA neuron GOLGA8A -0.22523062 GOLGA8A-202 1.889232e-08
## 3 Glut neuron GABA neuron SNAP25 -0.16139657 SNAP25-201 3.765722e-05
## 4 Glut neuron GABA neuron PCLO -0.29875887 PCLO-201 1.851101e-04
## 5 Glut neuron GABA neuron CCDC144BP 0.02809296 CCDC144BP-208 5.272233e-04
## 6 Glut neuron GABA neuron CNTN3 0.16000000 CNTN3-201 1.253123e-03
## padj effect.size approx
## 1 8.764269e-35 0.17843568 warning
## 2 8.218158e-06 0.19139313 valid
## 3 1.092059e-02 0.30064122 warning
## 4 4.026144e-02 0.31372925 warning
## 5 9.173685e-02 0.07727502 valid
## 6 1.817028e-01 0.30218983 warningPlotUsage
After DIU events are identified, the PlotUsage() function can be run to visualize differences in isoform proportions. Available settings for plot.type are ‘stackedbar’, ‘bar’, ‘pie’ and ‘heatmap’.
Let’s look the transcript proportions of the gene MEG3, which contains top DIU events from several comparisons.
# stackedbar (default)
p_stackedbar <- PlotUsage(
gbm,
gene = "MEG3",
group.subset = c("GABA neuron", "Glut neuron", "Microglia", "Oligodendrocyte", "OPC",
"Tumor"),
min.tx.prop = 0.10
)
# heatmap
p_heatmap <- PlotUsage(
gbm,
gene = "MEG3",
plot.type = "heatmap",
group.subset = c("GABA neuron", "Glut neuron", "Microglia", "Oligodendrocyte", "OPC",
"Tumor"),
min.tx.prop = 0.10,
show.prop = TRUE
)
# pie
p_pie <- PlotUsage(
gbm,
gene = "MEG3",
plot.type = "pie",
group.subset = c("GABA neuron", "Glut neuron", "Microglia", "Oligodendrocyte", "OPC",
"Tumor"),
min.tx.prop = 0.10,
nrow = 3
)
# bar
p_bar <- PlotUsage(
gbm,
gene = "MEG3",
plot.type = "bar",
group.subset = c("GABA neuron", "Glut neuron", "Microglia", "Oligodendrocyte", "OPC",
"Tumor"),
min.tx.prop = 0.10,
nrow = 3
)Patchwork
patchwork::wrap_plots(p_stackedbar, p_heatmap, p_pie, p_bar, ncol = 2)For visualization purposes, transcripts with less than 10% proportion were collapsed into ‘Other’ by setting the parameter min.tx.prop, and the smaller cell populations (astrocytes, vascular) were not shown.
GetUsage
GetUsage() quickly obtains isoform counts and relative abundance from genes of interest.
# Usage of MEG3 isoforms for all groups
meg3_usage <- GetUsage(gbm, genes = "MEG3")
# Usage of MEG3 isoforms in Glut and GABA neurons only
meg3_usage_neu <- GetUsage(gbm, genes = "MEG3", group.subset = c("Glut neuron", "GABA neuron"))Example output
head(meg3_usage_neu)
## group gene gene.pct transcript cts prop
## 1 GABA neuron MEG3 0.9438776 MEG3-203 4 0.0019753086
## 2 GABA neuron MEG3 0.9438776 MEG3-206 13 0.0064197531
## 3 GABA neuron MEG3 0.9438776 MEG3-209 1266 0.6251851852
## 4 GABA neuron MEG3 0.9438776 MEG3-219 1 0.0004938272
## 5 GABA neuron MEG3 0.9438776 MEG3-220 247 0.1219753086
## 6 GABA neuron MEG3 0.9438776 MEG3-222 2 0.0009876543Column descriptions:
group: The group being queried.gene: The gene being queried.gene.pct: Percentage of cells ingroupwith expression of the gene.transcript: The associated transcript.cts: Total counts of the transcript across cells ingroup.prop: Transcript proportion.
Isoform Diversity
Isoform diversity is a measure of the heterogeneity of transcript isoforms produced by a gene. It provides a useful way to quantify the level of isoform complexity within a gene and to compare how this complexity varies across different cell groups or conditions.
RunDIV
RunDIV() computes isoform diversity at the gene level and performs comparisons. By default, the Tsallis entropy is calculated. Genes are then classified as monoform or polyform and classifications are compared using the pairwise McNemar’s test across cell groups, and the pairwise Wilcoxon test is used to compare isoform diversity values. Refer to the help page (?RunDIV) for more details on parameters. The usage is similar to Hypatia’s other ‘Run’ functions:
# Each cell group vs all other cells
gbm_div <- RunDIV(gbm)
# Glut neurons vs all other cells
glut_div <- RunDIV(gbm, group.1 = "Glut neuron")
# Glut and GABA neurons vs all other cells
neur_div <- RunDIV(gbm, group.1 = c("Glut neuron", "GABA neuron"))
# Glut vs GABA neurons
glutvsgaba_div <- RunDIV(gbm, group.1 = "Glut neuron", group.2 = "GABA neuron")RunDIV() returns a list of two data frames:
$data contains the summarized isoform diversity data used to generate the statistical output.
group.1&group.2: The two cell groups being compared.gene: The gene for which the isoform diversity is computed.gene.pct.1: Percentage of cells ingroup.1with expression of the gene.gene.pct.2: Percentage of cells ingroup.2with expression of the gene.n.transcripts: Number of transcripts associated with the gene.div.1: Isoform diversity of the gene for cells ingroup.1.div.2: Isoform diversity of the gene for cells ingroup.2.div.diff: The difference in isoform diversity between groups (group.1-group.2).div.class.1: Isoform diversity classification of the gene forgroup.1.div.class.2: Isoform diversity classification of the gene forgroup.2.div.class.diff: A logical indicating difference between isoform diversity classifications betweengroup.1andgroup.2.
$stats contains results from the statistical tests.
group.1&group.2: The two cell groups being compared.avgDiv.1: Average isoform diversity of genes from cells ingroup.1.avgDiv.2: Average isoform diversity of genes from cells ingroup.2.wilcox.pval: P-value from the paired Wilcoxon rank sum test comparing isoform diversity across the two groups.mono.poly.1: Number of genes classified monoform to polyform from cells ingroup.1.mono.poly.2: Number of genes classified monoform to polyform from cells ingroup.2.mcnemar.pval: P-value from the pairwise McNemar test comparing isoform diversity classifications across the two groups.
Example (data)
head(gbm_div$data)
## group.1
## 1 Astrocyte
## 2 Astrocyte
## 3 Astrocyte
## 4 Astrocyte
## 5 Astrocyte
## 6 Astrocyte
## group.2
## 1 Microglia,Glut neuron,GABA neuron,Oligodendrocyte,OPC,Tumor,Vascular
## 2 Microglia,Glut neuron,GABA neuron,Oligodendrocyte,OPC,Tumor,Vascular
## 3 Microglia,Glut neuron,GABA neuron,Oligodendrocyte,OPC,Tumor,Vascular
## 4 Microglia,Glut neuron,GABA neuron,Oligodendrocyte,OPC,Tumor,Vascular
## 5 Microglia,Glut neuron,GABA neuron,Oligodendrocyte,OPC,Tumor,Vascular
## 6 Microglia,Glut neuron,GABA neuron,Oligodendrocyte,OPC,Tumor,Vascular
## gene gene.pct.1 gene.pct.2 n.transcripts div.1 div.2
## 1 RSRP1 0.1764706 0.11548048 7 0.4110000 0.43044440
## 2 MACF1 0.2941176 0.10329549 13 0.4556327 0.49008379
## 3 MIR9-1HG 0.4588235 0.10855719 19 0.3803472 0.39146140
## 4 GLUL 0.5411765 0.08418721 4 0.1499853 0.18711772
## 5 NOTCH2NLA 0.4588235 0.06729438 3 0.1065089 0.24033209
## 6 ZC3H11A 0.1647059 0.08474107 3 0.0000000 0.05079701
## div.diff div.class.1 div.class.2 div.class.diff
## 1 -0.01944440 polyform polyform FALSE
## 2 -0.03445107 polyform polyform FALSE
## 3 -0.01111418 polyform polyform FALSE
## 4 -0.03713238 monoform monoform FALSE
## 5 -0.13382321 monoform monoform FALSE
## 6 -0.05079701 monoform monoform FALSEExample (stats)
head(gbm_div$stats)
## group.1
## 1 GABA neuron
## 2 Astrocyte
## 3 Glut neuron
## 4 Microglia
## 5 Oligodendrocyte
## 6 OPC
## group.2
## 1 Microglia,Glut neuron,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular
## 2 Microglia,Glut neuron,GABA neuron,Oligodendrocyte,OPC,Tumor,Vascular
## 3 Microglia,GABA neuron,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular
## 4 Glut neuron,GABA neuron,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular
## 5 Microglia,Glut neuron,GABA neuron,OPC,Tumor,Astrocyte,Vascular
## 6 Microglia,Glut neuron,GABA neuron,Oligodendrocyte,Tumor,Astrocyte,Vascular
## avgDiv.1 avgDiv.2 wilcox.pval mono.poly.1 mono.poly.2 mcnemar.pval
## 1 0.2285638 0.2404675 2.531951e-06 173/179 156/196 0.004057136
## 2 0.1673591 0.1856375 1.081485e-02 60/29 54/35 0.113846298
## 3 0.2270427 0.2358884 7.343045e-05 123/134 116/141 0.168668619
## 4 0.2144754 0.2169131 3.201307e-01 63/55 59/59 0.342781711
## 5 0.2381169 0.2292375 5.660336e-03 66/79 70/75 0.342781711
## 6 0.2413945 0.2389667 5.864317e-01 127/153 123/157 0.583882421PlotDiversity
PlotDiversity() can be used to visualize isoform diversity across groups of cells. The plot.type parameter can be one of ‘lollipop’ (default), ‘pcoord’ for parallel coordinate, or ‘density’. The lollipop and pcoord plot types are designed to visualize the diversity of a few genes, while the density plots the distribution of isoform diversity of several genes.
Example genes
set.seed(1996)
random_genes <- gbm_div$data |>
dplyr::filter(n.transcripts > 1) |>
dplyr::slice_sample(n = 5) |>
dplyr::pull(gene)
random_genes
## [1] "NAPEPLD" "NPEPPS" "GNG7" "ANKRD10" "PNISR"# Lollipop (default)
p_loll <- PlotDiversity(
gbm,
genes = random_genes,
group.subset = c("GABA neuron", "Glut neuron", "Microglia", "Oligodendrocyte", "OPC",
"Tumor")
)
# Parallel coordinates
p_para <- PlotDiversity(
gbm,
genes = random_genes,
plot.type = "pcoord",
group.subset = c("GABA neuron", "Glut neuron", "Microglia", "Oligodendrocyte", "OPC",
"Tumor")
)
# Density
p_dens <- PlotDiversity(
gbm,
genes = unique(gbm_div$data$gene),
plot.type = "density",
group.subset = c("GABA neuron", "Glut neuron", "Microglia", "Oligodendrocyte", "OPC",
"Tumor"),
nrow = 2
)Patchwork
patchwork::wrap_plots(p_loll, p_para, p_dens, ncol = 1)GetDiversity
GetDiversity() quickly retrieves the isoform diversity data for genes of interest.
# Isoform diversity of gene MT3 for all groups
mt3_div <- GetDiversity(gbm, genes = "MT3")
# Isoform diversity of gene MT3 in OPC cells
mt3_div_opc <- GetDiversity(gbm, genes = "MT3", group.subset = "OPC")Example output
head(mt3_div_opc)
## group gene gene.pct n.transcripts transcript cts prop div
## 1 OPC MT3 0.1516854 2 MT3-201 43 0.641791 0.3448429
## 2 OPC MT3 0.1516854 2 MT3-204 24 0.358209 0.3448429
## div.class
## 1 polyform
## 2 polyformColumn descriptions:
group: The cell group being queried.gene: The gene being queried.gene.pct: Percentage of cells ingroupwith expression of the gene.n.transcripts: Number of associated transcripts for the gene.transcript: The associated transcript.cts: Total counts of the transcript ingroup.prop: The transcript proportion ingroup.div: Isoform diversity of the gene ingroup.
Isoform Expression
Isoform expression is a measure of the abundance of individual transcript isoforms produced by a gene. Measuring isoform expression allows for a higher-resolution view of gene expression activity, revealing the specific isoforms being produced in different cell types, cell states, conditions, etc. Hypatia adopts log-transformation and Wilcoxon-based testing by default.
NormalizeCounts
To compare single-cell isoform expression levels, raw counts must first be normalized with NormalizeCounts(). By default, counts are divided by the total counts per cell, multiplied by a scale factor, and log1p transformed.
gbm <- NormalizeCounts(gbm)The normalized matrix is stored as “logcounts” in the assays slot. Use assayNames() to view available count matrices and assay() to fetch a specific matrix.
RunDEI
After normalization, RunDEI() can be used to compare the isoform expression between two cell groups. Group comparisons can be specified as shown:
# Each cell group vs all other cells
all_dei <- RunDEI(gbm)
# Glut neurons vs all other cells
glut_dei <- RunDEI(gbm, group.1 = "Glut neuron")
# Glut and GABA neurons vs all other cells
neur_dei <- RunDEI(gbm, group.1 = c("Glut neuron", "GABA neuron"))
# Glut vs GABA neurons
glutvsgaba_dei <- RunDEI(gbm, group.1 = "Glut neuron", group.2 = "GABA neuron")RunDEI() outputs a data frame containing results from the group comparisons with the following columns:
group.1&group.2: The two cell groups being compared.gene: The gene associated with the transcript being tested.transcript: The transcript being tested.pct.1: Percentage of cells ingroup.1with expression of the transcript.pct.2: Percentage of cells ingroup.2with expression of the transcript.avgExpr.1: Average expression of the transcript across all cells ingroup.1.avgExpr.2: Average expression of the transcript across all cells ingroup.2.log2FC: The log2 fold change in transcript expression between the two groups (group.1-group.2).pval: P-value from the Wilcoxon rank-sum test.padj: Adjusted p-value, calculated separately for each group comparison (default: Bonferroni).
Example output
head(neur_dei)
## group.1
## 1 Glut neuron,GABA neuron
## 2 Glut neuron,GABA neuron
## 3 Glut neuron,GABA neuron
## 4 Glut neuron,GABA neuron
## 5 Glut neuron,GABA neuron
## 6 Glut neuron,GABA neuron
## group.2 gene
## 1 Microglia,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular MEG3
## 2 Microglia,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular MEG3
## 3 Microglia,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular PKD1
## 4 Microglia,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular MIR124-2HG
## 5 Microglia,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular MEG3
## 6 Microglia,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular RAP1GAP
## transcript pct.1 pct.2 avgExpr.1 avgExpr.2 log2FC
## 1 MEG3-209 0.8392157 0.09940209 3.767626 0.45517617 3.049159
## 2 MEG3-220 0.5509804 0.01121076 2.089258 0.04719830 5.468112
## 3 PKD1-201 0.5774510 0.04932735 2.212113 0.22108323 3.322763
## 4 MIR124-2HG-212 0.4117647 0.01046338 1.471338 0.04539883 5.018330
## 5 MEG3-225 0.3813725 0.01718984 1.391619 0.07286330 4.255428
## 6 RAP1GAP-211 0.3941176 0.02167414 1.418704 0.09420505 3.912625
## pval padj
## 1 0.000000e+00 0.000000e+00
## 2 0.000000e+00 0.000000e+00
## 3 1.401706e-265 2.287584e-262
## 4 6.703508e-240 1.094013e-236
## 5 4.589525e-202 7.490106e-199
## 6 3.770644e-199 6.153691e-196PlotExpression
PlotExpression() can be used to visualize isoform expression. The plot.type setting can be set to ‘violin’ (default), ‘reducedDim’ or ‘heatmap’. Let’s take a look at the top 3 DEIs expressed by the neuronal cells in a violin plot.
Top 3 neuron DEIs
# Get the top 3 DEIs of neurons
neur_top3_dei <- neur_dei |>
dplyr::filter(log2FC >= 0.6 & padj <= 0.05) |>
dplyr::slice_min(padj, n = 3, with_ties = FALSE) |>
dplyr::pull(transcript)
neur_top3_dei
## [1] "MEG3-209" "MEG3-220" "PKD1-201"# Violin plot
PlotExpression(
gbm,
transcripts = neur_top3_dei,
plot.type = "violin"
)For the reducedDim plot, 2D embeddings are required. Recall that umap_1 and umap_2 were included in the cell-level metadata when preparing the input data. However, these embeddings are best placed in the reducedDims slot, native to SCE objects. Afterwards, the specific embeddings must be specified in the dim.use parameter of PlotExpression().
# Method 1: add UMAP embeddings from the prepared colData
reducedDim(gbm, "UMAP") <- colData(gbm)[, c("umap_1", "umap_2")]
# Method 2: add UMAP embeddings to the SCE object directly from a Seurat object
reducedDim(gbm, "UMAP") <- seurat_object@reductions$umap@cell.embeddings[colnames(gbm), ]# reducedDim plot
PlotExpression(
gbm,
transcripts = neur_top3_dei,
plot.type = "reducedDim",
dim.use = "UMAP"
)The heatmap is best used to visualize multiple DEIs. Let’s plot the top 20 DEIs from the Glut vs GABA neurons comparison.
Top 20 Glut vs GABA DEIs
# Get the top 20 Glut vs GABA DEIs
glutvsgaba_top20 <- glutvsgaba_dei |>
dplyr::filter(abs(log2FC) >= 0.6 & padj <= 0.05) |>
dplyr::slice_min(padj, n = 20, with_ties = FALSE) |>
dplyr::arrange(desc(log2FC)) |>
dplyr::pull(transcript)
glutvsgaba_top20
## [1] "MAST3-209" "NPTX1-201" "AGRN-202" "SYT7-201"
## [5] "GRM2-202" "SLC6A7-201" "NECAB1-201" "NIPAL2-201"
## [9] "KCNIP4-IT1-201" "KIAA0930-212" "KIAA0930-201" "IGSF9B-206"
## [13] "GOLGA8A-204" "RAP1GAP-211" "MEG3-220" "ADCYAP1R1-202"
## [17] "DNER-201" "DGKD-201" "NRIP3-201" "SLC6A1-211"# Heatmap
PlotExpression(
gbm,
transcripts = glutvsgaba_top20,
plot.type = "heatmap",
group.subset = c("Glut neuron", "GABA neuron")
)GetExpression
GetExpression() quickly obtains an expression summary for isoforms or genes of interest.
# Expression of MEG3-209 for all groups
meg3_209 <- GetExpression(gbm, transcripts = "MEG3-209")
# Expression of MEG3-209 in neurons
meg3_209_neu <- GetExpression(gbm, transcripts = "MEG3-209", group.subset = c("Glut neuron", "GABA neuron"))
# Expression of all MEG3 isoforms in neurons
meg3_neu <- GetExpression(gbm, genes = "MEG3", group.subset = c("Glut neuron", "GABA neuron"))Example output
head(meg3_neu)
## group gene transcript transcript.pct avgExpr
## 1 GABA neuron MEG3 MEG3-203 0.010204082 0.03797111
## 2 GABA neuron MEG3 MEG3-206 0.028061224 0.09446307
## 3 GABA neuron MEG3 MEG3-209 0.869897959 3.97246360
## 4 GABA neuron MEG3 MEG3-219 0.002551020 0.00815719
## 5 GABA neuron MEG3 MEG3-220 0.436224490 1.58492992
## 6 GABA neuron MEG3 MEG3-222 0.005102041 0.01797501Column descriptions:
group: The cell group being queried.gene: The (associated) gene being queried.transcript: The transcripts being queried.transcript.pct: Percentage of cells ingroupwith expression of the transcript.avgExpr: Average expression of the transcript across all cells ingroup.
To retrieve the expression matrix without summary statistics, users can use combination of assay() with SubsetCells() and SubsetFeatures(). The following example demonstrates how to retrieve the normalized expression matrix for isoforms of the gene MBP in oligodendrocytes:
# Subset the object
oligod <- SubsetCells(gbm, cell_type == "Oligodendrocyte")
oligod_mbp <- SubsetFeatures(oligod, gene_name == "MBP")
# Get the matrix
assay(oligod_mbp, "logcounts")Utilities
Preparing cell-level metadata (extended)
The following shows how the cell-level metadata (gbm_colData from the example) was prepared from a Seurat object from gene expression analysis.
# Get the cell_type ident
gbm_colData <- seurat_object@metadata["cell_type"]
# Get umap_1 and umap_2
gbm_colData <- cbind(gbm_colData, seurat_object@reductions$umap@cell.embeddings)
# Match row name order with count matrix
gbm_colData <- gbm_colData[colnames(gbm_countData), ]Preparing transcript-level metadata (extended)
Below shows how the transcript-level metadata (gbm_rowData from the example) was prepared. The GTF from the IsoQuant output, reference genome GTF, and SQANTI results were used.
# Read in SQANTI classifications
sqanti_class <- readr::read_delim("data/gbm/sqanti/OUT.extended_annotation_classification.txt")
# Read in reference GTF
ref <- rtracklayer::import("data/ref/GRCh38-2024-A.gtf")
ref <- ref |>
as.data.frame() |>
dplyr::filter(type == "transcript") |>
dplyr::select(transcript_id, transcript_name, gene_id, gene_name) |>
dplyr::distinct()
# build rowData
gbm_rowData <- sqanti_class |>
dplyr::select(isoform, structural_category, associated_gene, RTS_stage, within_CAGE_peak, within_polyA_site) |>
dplyr::filter(isoform %in% rownames(gbm_countData)) |>
dplyr::rename("gene_id" = "associated_gene",
"transcript_id" = "isoform") |>
dplyr::left_join(., dplyr::distinct(ref[, c("gene_name", "gene_id")]), by = "gene_id") |>
dplyr::left_join(., dplyr::distinct(ref[, c("transcript_name", "transcript_id")]), by = "transcript_id") |>
dplyr::mutate(transcript_name = ifelse(is.na(transcript_name), transcript_id, transcript_name),
gene_name = ifelse(is.na(gene_name), gene_id, gene_name)) |>
tibble::column_to_rownames(var = "transcript_id")
# Match row name order with count matrix
gbm_rowData <- gbm_rowData[rownames(gbm_countData), ]Data integration
To analyze multiple samples in Hypatia, raw isoform count matrices can be directly merged and used to create the SCE object. The cell- and transcript-level metadata should contain data for all cells and isoforms in the correct orders. A cell metadata column containing the sample origin of each cell would be ideal, as the group.by setting in several Hypatia functions can accept vectors such that group.by = c("cell_type", "sample_id") or group.by = c("cell_type", "condition") would provide the additional stratification when making comparisons and visualizations.
Session Info
sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.0.1
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Chicago
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Hypatia_0.1.0
##
## loaded via a namespace (and not attached):
## [1] tidyr_1.3.1 generics_0.1.4
## [3] SparseArray_1.10.2 lattice_0.22-7
## [5] digest_0.6.39 magrittr_2.0.4
## [7] evaluate_1.0.5 grid_4.5.2
## [9] RColorBrewer_1.1-3 fastmap_1.2.0
## [11] jsonlite_2.0.0 Matrix_1.7-4
## [13] ggnewscale_0.5.2 backports_1.5.0
## [15] purrr_1.2.0 SingleCellExperiment_1.32.0
## [17] scales_1.4.0 abind_1.4-8
## [19] cli_3.6.5 rlang_1.1.6
## [21] XVector_0.50.0 Biobase_2.70.0
## [23] withr_3.0.2 DelayedArray_0.36.0
## [25] yaml_2.3.11 S4Arrays_1.10.0
## [27] tools_4.5.2 checkmate_2.3.3
## [29] dplyr_1.1.4 ggplot2_4.0.1
## [31] matrixTests_0.2.3.1 SummarizedExperiment_1.40.0
## [33] BiocGenerics_0.56.0 vctrs_0.6.5
## [35] R6_2.6.1 matrixStats_1.5.0
## [37] stats4_4.5.2 lifecycle_1.0.4
## [39] Seqinfo_1.0.0 S4Vectors_0.48.0
## [41] htmlwidgets_1.6.4 IRanges_2.44.0
## [43] pkgconfig_2.0.3 pillar_1.11.1
## [45] gtable_0.3.6 glue_1.8.0
## [47] xfun_0.54 tibble_3.3.0
## [49] GenomicRanges_1.62.0 tidyselect_1.2.1
## [51] rstudioapi_0.17.1 MatrixGenerics_1.22.0
## [53] knitr_1.50 farver_2.1.2
## [55] htmltools_0.5.8.1 patchwork_1.3.2
## [57] labeling_0.4.3 rmarkdown_2.30
## [59] compiler_4.5.2 S7_0.2.1