Background

RNA isoforms diversify the coding capacity of the genome and have significant implications in health and disease. Long-read single-cell RNA-sequencing (LR-scRNAseq) enables the investigation of RNA isoforms at cell type resolution. Hypatia is a computational suite in R for data analysis across 3 modules: 1) isoform usage, which identifies differential isoform usage events (DIUs), 2) isoform diversity, which identifies differential isoform diversity events (DIVs), and 3) isoform expression, which identifies differentially expressed isoforms (DEIs).

Check out our paper for more details on Hypatia. If you found this tool useful, please remember to cite us:

[Citation]

Setup

Installation

Hypatia is an R package available through Github:

if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

devtools::install_github("gaolabtools/Hypatia")

The following dependencies will be installed if they are not already present: checkmate, dplyr (>= 1.0.0), ggnewscale, ggplot2, Matrix, patchwork, purrr, S4Vectors, SingleCellExperiment, SummarizedExperiment, tidyr, and matrixTests.

Preparing the input data

Hypatia requires 3 inputs:

Single-cell isoform expression matrix
A data frame or matrix of raw isoform counts. Row names are transcript IDs and column names are cell IDs.
Cell-level metadata
A data frame with at least one column of cell annotations (cell types, clusters) necessary for differential analysis. Row names are cell IDs matching the count matrix.
Transcript-level metadata
A data frame with at least one column containing gene identifiers associated with transcripts in the count matrix. Row names are transcript IDs matching the count matrix.

Tutorial data

For the tutorial, we analyze single-cell long-read RNA-seq data of a human glioblastoma (GBM) tumor. Raw reads were processed using scNanoGPS, in which IsoQuant performed isoform quantification. Gene expression data were analyzed beforehand in Seurat, revealing 8 cell types: astrocytes, oligodendrocyte precursor cells (OPCs), tumor, microglia, vascular, GABAergic neurons (GABA neuron), and glutamatergic neurons (Glut neuron):

The following input data for the tutorial is available (lazy-loaded) after the package is loaded into the environment:

library(Hypatia)

gbm_countData
gbm_colData
gbm_rowData
gbm_gtf

The raw single-cell isoform expression matrix gbm_countData consists of 56,642 isoforms from 5,043 cells.

gbm_countData

str(gbm_countData)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:934559] 77 189 393 406 499 577 719 1340 1394 1566 ...
##   ..@ p       : int [1:5044] 0 256 985 1365 1416 1554 1828 2021 2145 2671 ...
##   ..@ Dim     : int [1:2] 56642 5043
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:56642] "ENST00000008440" "ENST00000040877" "ENST00000054666" "ENST00000060969" ...
##   .. ..$ : chr [1:5043] "GCCACAACATCGGTTC" "TCGGGCATCACCAGTA" "CCGAATTGTAGCTCGC" "CTGCATGGTTGGATAG" ...
##   ..@ x       : num [1:934559] 1 1 1 1 1 1 1 1 1 1 ...
##   ..@ factors : list()

The cell-level metadata gbm_colData contains the columns cell_type, umap_1, and umap_2, which were all extracted from the Seurat object from gene expression analysis. The gene expression-derived umap_1 and umap_2 will be used for visualizations later.

gbm_colData

str(gbm_colData)
## 'data.frame':    5043 obs. of  3 variables:
##  $ cell_type: chr  "Microglia" "Glut neuron" "GABA neuron" "Oligodendrocyte" ...
##  $ umap_1   : num  7.58 -14.63 -12.36 5.86 8.17 ...
##  $ umap_2   : num  1.97 -6.25 5.92 -7.23 7.03 ...

The transcript-level metadata gbm_rowData has the column gene_id that contains the associated genes for each transcript. Additional columns gene_name and transcript_name are included for easy viewing in the downstream analysis. The remaining columns hold various isoform QC metrics from running the SQANTI3 pipeline. We strongly recommend providing these, or similar metrics, to improve downstream isoform filtering.

gbm_rowData

str(gbm_rowData)
## 'data.frame':    56642 obs. of  7 variables:
##  $ structural_category: chr  "full-splice_match" "full-splice_match" "full-splice_match" "full-splice_match" ...
##  $ gene_id            : chr  "ENSG00000010072" "ENSG00000059588" "ENSG00000049245" "ENSG00000052723" ...
##  $ RTS_stage          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ within_CAGE_peak   : logi  TRUE TRUE TRUE TRUE FALSE TRUE ...
##  $ within_polyA_site  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ gene_name          : chr  "SPRTN" "TARBP1" "VAMP3" "SIKE1" ...
##  $ transcript_name    : chr  "SPRTN-201" "TARBP1-201" "VAMP3-201" "SIKE1-201" ...

For the detailed code used to prepare the individual metadata, see Utilities.

1. Create the SCE object

Hypatia utilizes the SingleCellExperiment. object, which provides a standardized framework for storing and organizing data for streamlined single-cell genomics analysis.

CreateSCE

After the input data are prepared, the SCE object can be created using the function CreateSCE().

library(Hypatia)

# Create the SCE
gbm <- CreateSCE(
  countData = gbm_countData,
  colData = gbm_colData,
  rowData = gbm_rowData
  )

gbm
## class: SingleCellExperiment 
## dim: 56642 5043 
## metadata(3): active.gene.id active.group.id active.transcript.id
## assays(1): counts
## rownames(56642): ENST00000008440 ENST00000040877 ...
##   transcript29.KI270742.1.nnic transcript71.KI270742.1.nnic
## rowData names(8): nCell structural_category ... gene_name
##   transcript_name
## colnames(5043): GCCACAACATCGGTTC TCGGGCATCACCAGTA ... AAGTCACCATTAACCT
##   CCCGATAGTGATTCCG
## colData names(7): project nCount ... umap_1 umap_2
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

By default, CreateSCE() searches for a column called “gene_id” in rowData for associated genes. This can be changed in the paramter active.gene.id.
The row name orders in rowData and colData must match the row and column name orders in countData, respectively.

Object slots

Data within SCE objects are stored inside ‘slots’. The main slots that are utilized by Hypatia are assays (count data), colData (cell-level metadata), and rowData (transcript-level metadata). They can be accessed from the object as shown:

# Raw count matrix
assay(gbm, "counts") 

# Cell-level metadata
colData(gbm)

# Transcript-level metadata
rowData(gbm)

Hypatia’s CreateSCE() will also create the following new variables in colData and rowData:

(colData) project: Project name (default: "Project").
(colData) nCount: Number of isoform counts for the cell.
(colData) nTranscript: Number of unique isoforms detected in the cell.
(colData) nGene: Number of unique genes detected in the cell.
(rowData) nCell: Number of cells with detection of the isoform.

colData

colData(gbm) |> head()
## DataFrame with 6 rows and 7 columns
##                      project    nCount nTranscript     nGene       cell_type
##                  <character> <numeric>   <integer> <integer>     <character>
## GCCACAACATCGGTTC     Project       269         256       248       Microglia
## TCGGGCATCACCAGTA     Project       804         729       683     Glut neuron
## CCGAATTGTAGCTCGC     Project       392         380       369     GABA neuron
## CTGCATGGTTGGATAG     Project        53          51        50 Oligodendrocyte
## ATCCAATAGCAGGGTA     Project       157         138       135       Microglia
## GGGATATAGTATGGAA     Project       286         274       264     GABA neuron
##                     umap_1    umap_2
##                  <numeric> <numeric>
## GCCACAACATCGGTTC   7.57581   1.96721
## TCGGGCATCACCAGTA -14.63275  -6.25013
## CCGAATTGTAGCTCGC -12.36459   5.92043
## CTGCATGGTTGGATAG   5.86348  -7.22947
## ATCCAATAGCAGGGTA   8.16642   7.03476
## GGGATATAGTATGGAA -11.20872   5.81996

rowData

rowData(gbm) |> head()
## DataFrame with 6 rows and 8 columns
##                     nCell structural_category         gene_id RTS_stage
##                 <integer>         <character>     <character> <logical>
## ENST00000008440         2   full-splice_match ENSG00000010072     FALSE
## ENST00000040877        72   full-splice_match ENSG00000059588     FALSE
## ENST00000054666        60   full-splice_match ENSG00000049245     FALSE
## ENST00000060969        51   full-splice_match ENSG00000052723     FALSE
## ENST00000164247         0   full-splice_match ENSG00000069424     FALSE
## ENST00000166244         8   full-splice_match ENSG00000070886     FALSE
##                 within_CAGE_peak within_polyA_site   gene_name transcript_name
##                        <logical>         <logical> <character>     <character>
## ENST00000008440             TRUE              TRUE       SPRTN       SPRTN-201
## ENST00000040877             TRUE              TRUE      TARBP1      TARBP1-201
## ENST00000054666             TRUE              TRUE       VAMP3       VAMP3-201
## ENST00000060969             TRUE              TRUE       SIKE1       SIKE1-201
## ENST00000164247            FALSE              TRUE      KCNAB2      KCNAB2-201
## ENST00000166244             TRUE              TRUE       EPHA8       EPHA8-201

Object metadata

The metadata() function can be used to access the metadata of the SCE object, and dictates how the object interacts with Hypatia’s functions. These include important settings such as the default cell groups and the gene/transcript names to report. By default, the only defined setting is active.gene.id, which was set by CreateSCE().

metadata(gbm)
## $active.gene.id
## [1] "gene_id"
## 
## $active.group.id
## [1] ""
## 
## $active.transcript.id
## [1] ""

An empty active.transcript.id indicates that the row names of the count matrix (e.g., Ensembl transcript IDs in the tutorial data) should be used. An empty active.groups.id means that cell identities for grouping (e.g., cell_type) must be specified in downstream functions. The active.groups.id can be set in advance to avoid repeatedly defining the grouping variable later on.

Below, we update the metadata for active gene IDs, transcript IDs, and cell groups.

# Set active.gene.id
gbm <- SetGenes(gbm, "gene_name")

# Set active.transcript.id
gbm <- SetTranscripts(gbm, "transcript_name")

# Set active.group.id
gbm <- SetGroups(gbm, "cell_type")

metadata(gbm)
## $active.gene.id
## [1] "gene_name"
## 
## $active.group.id
## [1] "cell_type"
## 
## $active.transcript.id
## [1] "transcript_name"

gene_name and transcript_name were both provided in the object’s rowData slot. The active cell grouping variable is set to cell_type of the object’s colData slot.

The object metadata can be changed at any point in the workflow. Providing alternative gene and transcript IDs are optional.

2. Quality control

Transcript filtering

To filter isoforms, SubsetTranscripts() can be used to subset the SCE object by the object’s rowData (transcript-level) variables. Isoforms that are not expressed in at least 3 cells are removed according to the variable nCell below.

gbm <- SubsetTranscripts(gbm, subset = nCell >= 3)

Isoforms can also filtered based on the provided transcript-level SQANTI metrics:

gbm <- SubsetTranscripts(
  gbm, 
  subset = ( (structural_category == "full-splice_match" & !RTS_stage) |
             (!RTS_stage & within_CAGE_peak & within_polyA_site) )
  )

Cell filtering

Users can run PlotCellQC() to visualize the distributions of the default variables nTranscript, nGene, and nCount (where nCount is the number of isoform counts per cell).

PlotCellQC(gbm)

Afterwards, SubsetCells() can filter cells using these or other colData (cell-level) variables. Cells without detectable expression in more than 50 isoforms are removed below.

gbm <- SubsetCells(object = gbm, subset = nTranscript >= 50)

SubsetCells() can also be used to subset the object for reasons unrelated to QC. The following are examples of isolating and removing tumor cells from the cell_type annotation.

# Isolate tumor cells
gbm_tumor <- SubsetCells(object = gbm, subset = cell_type == "Tumor")

# Remove tumor cells
gbm_notumor <- SubsetCells(object = gbm, subset = cell_type == "Tumor", invert = TRUE)

Vectors of cell IDs or active transcript IDs can also be used in through the arguments cells and transcripts in their respective subset functions.

3. Differential analysis

Hypatia provides 3 modules for differential isoform analysis:

Isoform Usage: Identifies differential isoform usage (DIU) events, marked by significant differences in isoform ratios of genes across cell groups.
Isoform Diversity: Computes isoform diversity, classifies genes as ‘monoform’ (low diversity) or ‘polyform’ (high diversity), and performs pairwise comparisons of classifications. Genes having diverging classifications across cell groups are considered DIVs.
Isoform Expression: Identifies differentially expressed isoforms (DEIs) by comparing the mean normalized expression of isoforms across groups of cells.

Method	Analysis	Visualization	Data retrieval
Isoform Usage	`RunDIU()`	`PlotUsage()`	`GetUsage()`
Isoform Diversity	`RunDIV()`	`PlotDiversity()`	`GetDiversity()`
Isoform Expression	`NormalizeCounts()`, `RunDEI()`	`PlotExpression()`	`GetExpression()`

Isoform Usage

Isoform usage refers to the relative abundances of transcript isoforms produced by a single gene. The usage of alternative isoforms is reported to vary across cell types, conditions, and developmental stages. Differential isoform usage (DIU) events can reveal regulatory changes, functional diversity, or shifts that are not apparent from gene-level expression alone.

RunDIU

RunDIU() identifies significant differences in isoform proportions per gene. Genes are first filtered for adequate detection and the Chi-square test is subsequently applied to identify DIU events from single-cell isoform expression data (see ?RunDIU() for more details). Group comparisons can be specified using group.1 and group.2 as shown below:

# Each cell group vs all other cells
all_diu <- RunDIU(gbm)

# Glut neurons vs all other cells
glut_diu <- RunDIU(gbm, group.1 = "Glut neuron")

# Glut and GABA neurons vs all other cells
neur_diu <- RunDIU(gbm, group.1 = c("Glut neuron", "GABA neuron"))

# Glut vs GABA neurons
glutvsgaba_diu <- RunDIU(gbm, group.1 = "Glut neuron", group.2 = "GABA neuron")

RunDIU() returns a list of two data frames:

$data contains the summarized isoform usage data used to generate the statistical output:

group.1 & group.2: The two cell groups being compared.
gene: The gene being tested.
gene.pct.1: Percentage of cells in group.1 with expression of the gene.
gene.pct.2: Percentage of cells in group.2 with expression of the gene.
transcript: The associated transcript.
cts.1: Total counts of the transcript across all cells in group.1.
cts.2: Total counts of the transcript across all cells in group.2.
prop.1: Transcript proportion for group.1.
prop.2: Transcript proportion for group.2.
delta: The difference in transcript proportions between groups (group.1 - group.2).

$stats contains results from statistical tests:

group.1 & group.2: The two cell groups being compared.
gene: The gene being tested.
max.delta: The largest absolute difference in transcript proportions between group.1 and group.2.
transcript: Transcript associated with the maximum proportion difference.
pval: P-value from the statistical test specified in method.use (default: Chi-square test).
padj: Adjusted p-value, calculated separately for each group comparison (default: Benjamini-Hochberg).
effect.size: Effect size of the test, measured as Cramer’s V.
approx: Indicates whether the Chi-square approximation is valid ("valid") or potentially unreliable ("warning"), based on whether at least 80% of transcript counts of the contingency table exceed 5.

Example (data)

head(glutvsgaba_diu$data)
##       group.1     group.2  gene gene.pct.1 gene.pct.2               transcript
## 1 Glut neuron GABA neuron  AAK1  0.1751592  0.2091837                 AAK1-208
## 2 Glut neuron GABA neuron  AAK1  0.1751592  0.2091837                 AAK1-211
## 3 Glut neuron GABA neuron ABCC5  0.1735669  0.1734694                ABCC5-204
## 4 Glut neuron GABA neuron ABCC5  0.1735669  0.1734694                ABCC5-205
## 5 Glut neuron GABA neuron ABCC5  0.1735669  0.1734694 transcript38904.chr3.nic
## 6 Glut neuron GABA neuron ABCC5  0.1735669  0.1734694 transcript39489.chr3.nic
##   cts.1 cts.2     prop.1     prop.2        delta
## 1    13     7 0.09848485 0.07291667  0.025568182
## 2   119    89 0.90151515 0.92708333 -0.025568182
## 3    62    40 0.48818898 0.50000000 -0.011811024
## 4    31    19 0.24409449 0.23750000  0.006594488
## 5     5     1 0.03937008 0.01250000  0.026870079
## 6    24    20 0.18897638 0.25000000 -0.061023622

Example (stats)

head(glutvsgaba_diu$stats)
##       group.1     group.2      gene max.prop.diff    transcript         pval
## 1 Glut neuron GABA neuron      MEG3   -0.14658436      MEG3-209 1.007387e-37
## 2 Glut neuron GABA neuron   GOLGA8A   -0.22523062   GOLGA8A-202 1.889232e-08
## 3 Glut neuron GABA neuron    SNAP25   -0.16139657    SNAP25-201 3.765722e-05
## 4 Glut neuron GABA neuron      PCLO   -0.29875887      PCLO-201 1.851101e-04
## 5 Glut neuron GABA neuron CCDC144BP    0.02809296 CCDC144BP-208 5.272233e-04
## 6 Glut neuron GABA neuron     CNTN3    0.16000000     CNTN3-201 1.253123e-03
##           padj effect.size  approx
## 1 8.764269e-35  0.17843568 warning
## 2 8.218158e-06  0.19139313   valid
## 3 1.092059e-02  0.30064122 warning
## 4 4.026144e-02  0.31372925 warning
## 5 9.173685e-02  0.07727502   valid
## 6 1.817028e-01  0.30218983 warning

PlotUsage

After DIU events are identified, the PlotUsage() function can be run to visualize differences in isoform proportions. Available settings for plot.type are ‘stackedbar’, ‘bar’, ‘pie’ and ‘heatmap’.

Let’s look the transcript proportions of the gene MEG3, which contains top DIU events from several comparisons.

# stackedbar (default)
p_stackedbar <- PlotUsage(
  gbm, 
  gene = "MEG3", 
  group.subset = c("GABA neuron", "Glut neuron", "Microglia", "Oligodendrocyte", "OPC",
                   "Tumor"),
  min.tx.prop = 0.10
  )

# heatmap
p_heatmap <- PlotUsage(
  gbm, 
  gene = "MEG3", 
  plot.type = "heatmap", 
  group.subset = c("GABA neuron", "Glut neuron", "Microglia", "Oligodendrocyte", "OPC",
                   "Tumor"),
  min.tx.prop = 0.10,
  show.prop = TRUE
  )

# pie 
p_pie <- PlotUsage(
  gbm, 
  gene = "MEG3", 
  plot.type = "pie",
  group.subset = c("GABA neuron", "Glut neuron", "Microglia", "Oligodendrocyte", "OPC",
                   "Tumor"),
  min.tx.prop = 0.10,
  nrow = 3
  )

# bar 
p_bar <- PlotUsage(
  gbm, 
  gene = "MEG3", 
  plot.type = "bar",
  group.subset = c("GABA neuron", "Glut neuron", "Microglia", "Oligodendrocyte", "OPC",
                   "Tumor"),
  min.tx.prop = 0.10,
  nrow = 3
  )

Patchwork

patchwork::wrap_plots(p_stackedbar, p_heatmap, p_pie, p_bar, ncol = 2)

For visualization purposes, transcripts with less than 10% proportion were collapsed into ‘Other’ by setting the parameter min.tx.prop, and the smaller cell populations (astrocytes, vascular) were not shown.

GetUsage

GetUsage() quickly obtains isoform counts and relative abundance from genes of interest.

# Usage of MEG3 isoforms for all groups
meg3_usage <- GetUsage(gbm, genes = "MEG3")

# Usage of MEG3 isoforms in Glut and GABA neurons only
meg3_usage_neu <- GetUsage(gbm, genes = "MEG3", group.subset = c("Glut neuron", "GABA neuron"))

Example output

head(meg3_usage_neu)
##         group gene  gene.pct transcript  cts         prop
## 1 GABA neuron MEG3 0.9438776   MEG3-203    4 0.0019753086
## 2 GABA neuron MEG3 0.9438776   MEG3-206   13 0.0064197531
## 3 GABA neuron MEG3 0.9438776   MEG3-209 1266 0.6251851852
## 4 GABA neuron MEG3 0.9438776   MEG3-219    1 0.0004938272
## 5 GABA neuron MEG3 0.9438776   MEG3-220  247 0.1219753086
## 6 GABA neuron MEG3 0.9438776   MEG3-222    2 0.0009876543

Column descriptions:

group: The group being queried.
gene: The gene being queried.
gene.pct: Percentage of cells in group with expression of the gene.
transcript: The associated transcript.
cts: Total counts of the transcript across cells in group.
prop: Transcript proportion.

Isoform Diversity

Isoform diversity is a measure of the heterogeneity of transcript isoforms produced by a gene. It provides a useful way to quantify the level of isoform complexity within a gene and to compare how this complexity varies across different cell groups or conditions.

RunDIV

RunDIV() computes isoform diversity at the gene level and performs comparisons. By default, the Tsallis entropy is calculated. Genes are then classified as monoform or polyform and classifications are compared using the pairwise McNemar’s test across cell groups, and the pairwise Wilcoxon test is used to compare isoform diversity values. Refer to the help page (?RunDIV) for more details on parameters. The usage is similar to Hypatia’s other ‘Run’ functions:

# Each cell group vs all other cells
gbm_div <- RunDIV(gbm)

# Glut neurons vs all other cells
glut_div <- RunDIV(gbm, group.1 = "Glut neuron")

# Glut and GABA neurons vs all other cells
neur_div <- RunDIV(gbm, group.1 = c("Glut neuron", "GABA neuron"))

# Glut vs GABA neurons
glutvsgaba_div <- RunDIV(gbm, group.1 = "Glut neuron", group.2 = "GABA neuron")

RunDIV() returns a list of two data frames:

$data contains the summarized isoform diversity data used to generate the statistical output.

group.1 & group.2: The two cell groups being compared.
gene: The gene for which the isoform diversity is computed.
gene.pct.1: Percentage of cells in group.1 with expression of the gene.
gene.pct.2: Percentage of cells in group.2 with expression of the gene.
n.transcripts: Number of transcripts associated with the gene.
div.1: Isoform diversity of the gene for cells in group.1.
div.2: Isoform diversity of the gene for cells in group.2.
div.diff: The difference in isoform diversity between groups (group.1 - group.2).
div.class.1: Isoform diversity classification of the gene for group.1.
div.class.2: Isoform diversity classification of the gene for group.2.
div.class.diff: A logical indicating difference between isoform diversity classifications between group.1 and group.2.

$stats contains results from the statistical tests.

group.1 & group.2: The two cell groups being compared.
avgDiv.1: Average isoform diversity of genes from cells in group.1.
avgDiv.2: Average isoform diversity of genes from cells in group.2.
wilcox.pval: P-value from the paired Wilcoxon rank sum test comparing isoform diversity across the two groups.
mono.poly.1: Number of genes classified monoform to polyform from cells in group.1.
mono.poly.2: Number of genes classified monoform to polyform from cells in group.2.
mcnemar.pval: P-value from the pairwise McNemar test comparing isoform diversity classifications across the two groups.

Example (data)

head(gbm_div$data)
##     group.1
## 1 Astrocyte
## 2 Astrocyte
## 3 Astrocyte
## 4 Astrocyte
## 5 Astrocyte
## 6 Astrocyte
##                                                                group.2
## 1 Microglia,Glut neuron,GABA neuron,Oligodendrocyte,OPC,Tumor,Vascular
## 2 Microglia,Glut neuron,GABA neuron,Oligodendrocyte,OPC,Tumor,Vascular
## 3 Microglia,Glut neuron,GABA neuron,Oligodendrocyte,OPC,Tumor,Vascular
## 4 Microglia,Glut neuron,GABA neuron,Oligodendrocyte,OPC,Tumor,Vascular
## 5 Microglia,Glut neuron,GABA neuron,Oligodendrocyte,OPC,Tumor,Vascular
## 6 Microglia,Glut neuron,GABA neuron,Oligodendrocyte,OPC,Tumor,Vascular
##        gene gene.pct.1 gene.pct.2 n.transcripts     div.1      div.2
## 1     RSRP1  0.1764706 0.11548048             7 0.4110000 0.43044440
## 2     MACF1  0.2941176 0.10329549            13 0.4556327 0.49008379
## 3  MIR9-1HG  0.4588235 0.10855719            19 0.3803472 0.39146140
## 4      GLUL  0.5411765 0.08418721             4 0.1499853 0.18711772
## 5 NOTCH2NLA  0.4588235 0.06729438             3 0.1065089 0.24033209
## 6   ZC3H11A  0.1647059 0.08474107             3 0.0000000 0.05079701
##      div.diff div.class.1 div.class.2 div.class.diff
## 1 -0.01944440    polyform    polyform          FALSE
## 2 -0.03445107    polyform    polyform          FALSE
## 3 -0.01111418    polyform    polyform          FALSE
## 4 -0.03713238    monoform    monoform          FALSE
## 5 -0.13382321    monoform    monoform          FALSE
## 6 -0.05079701    monoform    monoform          FALSE

Example (stats)

head(gbm_div$stats)
##           group.1
## 1     GABA neuron
## 2       Astrocyte
## 3     Glut neuron
## 4       Microglia
## 5 Oligodendrocyte
## 6             OPC
##                                                                      group.2
## 1         Microglia,Glut neuron,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular
## 2       Microglia,Glut neuron,GABA neuron,Oligodendrocyte,OPC,Tumor,Vascular
## 3         Microglia,GABA neuron,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular
## 4       Glut neuron,GABA neuron,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular
## 5             Microglia,Glut neuron,GABA neuron,OPC,Tumor,Astrocyte,Vascular
## 6 Microglia,Glut neuron,GABA neuron,Oligodendrocyte,Tumor,Astrocyte,Vascular
##    avgDiv.1  avgDiv.2  wilcox.pval mono.poly.1 mono.poly.2 mcnemar.pval
## 1 0.2285638 0.2404675 2.531951e-06     173/179     156/196  0.004057136
## 2 0.1673591 0.1856375 1.081485e-02       60/29       54/35  0.113846298
## 3 0.2270427 0.2358884 7.343045e-05     123/134     116/141  0.168668619
## 4 0.2144754 0.2169131 3.201307e-01       63/55       59/59  0.342781711
## 5 0.2381169 0.2292375 5.660336e-03       66/79       70/75  0.342781711
## 6 0.2413945 0.2389667 5.864317e-01     127/153     123/157  0.583882421

PlotDiversity

PlotDiversity() can be used to visualize isoform diversity across groups of cells. The plot.type parameter can be one of ‘lollipop’ (default), ‘pcoord’ for parallel coordinate, or ‘density’. The lollipop and pcoord plot types are designed to visualize the diversity of a few genes, while the density plots the distribution of isoform diversity of several genes.

Example genes

set.seed(1996)

random_genes <- gbm_div$data |>
  dplyr::filter(n.transcripts > 1) |>
  dplyr::slice_sample(n = 5) |>
  dplyr::pull(gene)

random_genes
## [1] "NAPEPLD" "NPEPPS"  "GNG7"    "ANKRD10" "PNISR"

# Lollipop (default)
p_loll <- PlotDiversity(
  gbm,
  genes = random_genes,
  group.subset = c("GABA neuron", "Glut neuron", "Microglia", "Oligodendrocyte", "OPC",
                   "Tumor")
  )

# Parallel coordinates
p_para <- PlotDiversity(
  gbm,
  genes = random_genes,
  plot.type = "pcoord",
  group.subset = c("GABA neuron", "Glut neuron", "Microglia", "Oligodendrocyte", "OPC",
                   "Tumor")
  )

# Density 
p_dens <- PlotDiversity(
  gbm, 
  genes = unique(gbm_div$data$gene),
  plot.type = "density",
  group.subset = c("GABA neuron", "Glut neuron", "Microglia", "Oligodendrocyte", "OPC",
                   "Tumor"),
  nrow = 2
)

Patchwork

patchwork::wrap_plots(p_loll, p_para, p_dens, ncol = 1)

GetDiversity

GetDiversity() quickly retrieves the isoform diversity data for genes of interest.

# Isoform diversity of gene MT3 for all groups
mt3_div <- GetDiversity(gbm, genes = "MT3")

# Isoform diversity of gene MT3 in OPC cells
mt3_div_opc <- GetDiversity(gbm, genes = "MT3", group.subset = "OPC")

Example output

head(mt3_div_opc)
##   group gene  gene.pct n.transcripts transcript cts     prop       div
## 1   OPC  MT3 0.1516854             2    MT3-201  43 0.641791 0.3448429
## 2   OPC  MT3 0.1516854             2    MT3-204  24 0.358209 0.3448429
##   div.class
## 1  polyform
## 2  polyform

Column descriptions:

group: The cell group being queried.
gene: The gene being queried.
gene.pct: Percentage of cells in group with expression of the gene.
n.transcripts: Number of associated transcripts for the gene.
transcript: The associated transcript.
cts: Total counts of the transcript in group.
prop: The transcript proportion in group.
div: Isoform diversity of the gene in group.

Isoform Expression

Isoform expression is a measure of the abundance of individual transcript isoforms produced by a gene. Measuring isoform expression allows for a higher-resolution view of gene expression activity, revealing the specific isoforms being produced in different cell types, cell states, conditions, etc. Hypatia adopts log-transformation and Wilcoxon-based testing by default.

NormalizeCounts

To compare single-cell isoform expression levels, raw counts must first be normalized with NormalizeCounts(). By default, counts are divided by the total counts per cell, multiplied by a scale factor, and log1p transformed.

gbm <- NormalizeCounts(gbm)

The normalized matrix is stored as “logcounts” in the assays slot. Use assayNames() to view available count matrices and assay() to fetch a specific matrix.

RunDEI

After normalization, RunDEI() can be used to compare the isoform expression between two cell groups. Group comparisons can be specified as shown:

# Each cell group vs all other cells
all_dei <- RunDEI(gbm)

# Glut neurons vs all other cells
glut_dei <- RunDEI(gbm, group.1 = "Glut neuron")

# Glut and GABA neurons vs all other cells
neur_dei <- RunDEI(gbm, group.1 = c("Glut neuron", "GABA neuron"))

# Glut vs GABA neurons
glutvsgaba_dei <- RunDEI(gbm, group.1 = "Glut neuron", group.2 = "GABA neuron")

RunDEI() outputs a data frame containing results from the group comparisons with the following columns:

group.1 & group.2: The two cell groups being compared.
gene: The gene associated with the transcript being tested.
transcript: The transcript being tested.
pct.1: Percentage of cells in group.1 with expression of the transcript.
pct.2: Percentage of cells in group.2 with expression of the transcript.
avgExpr.1: Average expression of the transcript across all cells in group.1.
avgExpr.2: Average expression of the transcript across all cells in group.2.
log2FC: The log2 fold change in transcript expression between the two groups (group.1 - group.2).
pval: P-value from the Wilcoxon rank-sum test.
padj: Adjusted p-value, calculated separately for each group comparison (default: Bonferroni).

Example output

head(neur_dei)
##                   group.1
## 1 Glut neuron,GABA neuron
## 2 Glut neuron,GABA neuron
## 3 Glut neuron,GABA neuron
## 4 Glut neuron,GABA neuron
## 5 Glut neuron,GABA neuron
## 6 Glut neuron,GABA neuron
##                                                  group.2       gene
## 1 Microglia,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular       MEG3
## 2 Microglia,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular       MEG3
## 3 Microglia,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular       PKD1
## 4 Microglia,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular MIR124-2HG
## 5 Microglia,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular       MEG3
## 6 Microglia,Oligodendrocyte,OPC,Tumor,Astrocyte,Vascular    RAP1GAP
##       transcript     pct.1      pct.2 avgExpr.1  avgExpr.2   log2FC
## 1       MEG3-209 0.8392157 0.09940209  3.767626 0.45517617 3.049159
## 2       MEG3-220 0.5509804 0.01121076  2.089258 0.04719830 5.468112
## 3       PKD1-201 0.5774510 0.04932735  2.212113 0.22108323 3.322763
## 4 MIR124-2HG-212 0.4117647 0.01046338  1.471338 0.04539883 5.018330
## 5       MEG3-225 0.3813725 0.01718984  1.391619 0.07286330 4.255428
## 6    RAP1GAP-211 0.3941176 0.02167414  1.418704 0.09420505 3.912625
##            pval          padj
## 1  0.000000e+00  0.000000e+00
## 2  0.000000e+00  0.000000e+00
## 3 1.401706e-265 2.287584e-262
## 4 6.703508e-240 1.094013e-236
## 5 4.589525e-202 7.490106e-199
## 6 3.770644e-199 6.153691e-196

PlotExpression

PlotExpression() can be used to visualize isoform expression. The plot.type setting can be set to ‘violin’ (default), ‘reducedDim’ or ‘heatmap’. Let’s take a look at the top 3 DEIs expressed by the neuronal cells in a violin plot.

Top 3 neuron DEIs

# Get the top 3 DEIs of neurons
neur_top3_dei <- neur_dei |> 
  dplyr::filter(log2FC >= 0.6 & padj <= 0.05) |>
  dplyr::slice_min(padj, n = 3, with_ties = FALSE) |>
  dplyr::pull(transcript)

neur_top3_dei
## [1] "MEG3-209" "MEG3-220" "PKD1-201"

# Violin plot
PlotExpression(
  gbm, 
  transcripts = neur_top3_dei,
  plot.type = "violin"
  )

For the reducedDim plot, 2D embeddings are required. Recall that umap_1 and umap_2 were included in the cell-level metadata when preparing the input data. However, these embeddings are best placed in the reducedDims slot, native to SCE objects. Afterwards, the specific embeddings must be specified in the dim.use parameter of PlotExpression().

# Method 1: add UMAP embeddings from the prepared colData
reducedDim(gbm, "UMAP") <- colData(gbm)[, c("umap_1", "umap_2")]

# Method 2: add UMAP embeddings to the SCE object directly from a Seurat object
reducedDim(gbm, "UMAP") <- seurat_object@reductions$umap@cell.embeddings[colnames(gbm), ]

# reducedDim plot
PlotExpression(
  gbm, 
  transcripts = neur_top3_dei,
  plot.type = "reducedDim", 
  dim.use = "UMAP"
  )

The heatmap is best used to visualize multiple DEIs. Let’s plot the top 20 DEIs from the Glut vs GABA neurons comparison.

Top 20 Glut vs GABA DEIs

# Get the top 20 Glut vs GABA DEIs 
glutvsgaba_top20 <- glutvsgaba_dei |>
  dplyr::filter(abs(log2FC) >= 0.6 & padj <= 0.05) |>
  dplyr::slice_min(padj, n = 20, with_ties = FALSE) |>
  dplyr::arrange(desc(log2FC)) |>
  dplyr::pull(transcript)

glutvsgaba_top20
##  [1] "MAST3-209"      "NPTX1-201"      "AGRN-202"       "SYT7-201"      
##  [5] "GRM2-202"       "SLC6A7-201"     "NECAB1-201"     "NIPAL2-201"    
##  [9] "KCNIP4-IT1-201" "KIAA0930-212"   "KIAA0930-201"   "IGSF9B-206"    
## [13] "GOLGA8A-204"    "RAP1GAP-211"    "MEG3-220"       "ADCYAP1R1-202" 
## [17] "DNER-201"       "DGKD-201"       "NRIP3-201"      "SLC6A1-211"

# Heatmap
PlotExpression(
  gbm, 
  transcripts = glutvsgaba_top20,
  plot.type = "heatmap",
  group.subset = c("Glut neuron", "GABA neuron")
  )

GetExpression

GetExpression() quickly obtains an expression summary for isoforms or genes of interest.

# Expression of MEG3-209 for all groups
meg3_209 <- GetExpression(gbm, transcripts = "MEG3-209") 

# Expression of MEG3-209 in neurons
meg3_209_neu <- GetExpression(gbm, transcripts = "MEG3-209", group.subset = c("Glut neuron", "GABA neuron"))

# Expression of all MEG3 isoforms in neurons
meg3_neu <- GetExpression(gbm, genes = "MEG3", group.subset = c("Glut neuron", "GABA neuron"))

Example output

head(meg3_neu)
##         group gene transcript transcript.pct    avgExpr
## 1 GABA neuron MEG3   MEG3-203    0.010204082 0.03797111
## 2 GABA neuron MEG3   MEG3-206    0.028061224 0.09446307
## 3 GABA neuron MEG3   MEG3-209    0.869897959 3.97246360
## 4 GABA neuron MEG3   MEG3-219    0.002551020 0.00815719
## 5 GABA neuron MEG3   MEG3-220    0.436224490 1.58492992
## 6 GABA neuron MEG3   MEG3-222    0.005102041 0.01797501

Column descriptions:

group: The cell group being queried.
gene: The (associated) gene being queried.
transcript: The transcripts being queried.
transcript.pct: Percentage of cells in group with expression of the transcript.
avgExpr: Average expression of the transcript across all cells in group.

To retrieve the expression matrix without summary statistics, users can use combination of assay() with SubsetCells() and SubsetFeatures(). The following example demonstrates how to retrieve the normalized expression matrix for isoforms of the gene MBP in oligodendrocytes:

# Subset the object
oligod <- SubsetCells(gbm, cell_type == "Oligodendrocyte")
oligod_mbp <- SubsetFeatures(oligod, gene_name == "MBP")

# Get the matrix
assay(oligod_mbp, "logcounts")

Utilities

Preparing cell-level metadata (extended)

The following shows how the cell-level metadata (gbm_colData from the example) was prepared from a Seurat object from gene expression analysis.

# Get the cell_type ident
gbm_colData <- seurat_object@metadata["cell_type"]

# Get umap_1 and umap_2 
gbm_colData <- cbind(gbm_colData, seurat_object@reductions$umap@cell.embeddings)

# Match row name order with count matrix
gbm_colData <- gbm_colData[colnames(gbm_countData), ]

Preparing transcript-level metadata (extended)

Below shows how the transcript-level metadata (gbm_rowData from the example) was prepared. The GTF from the IsoQuant output, reference genome GTF, and SQANTI results were used.

# Read in SQANTI classifications
sqanti_class <- readr::read_delim("data/gbm/sqanti/OUT.extended_annotation_classification.txt")

# Read in reference GTF
ref <- rtracklayer::import("data/ref/GRCh38-2024-A.gtf")
ref <- ref |> 
  as.data.frame() |>
  dplyr::filter(type == "transcript") |>
  dplyr::select(transcript_id, transcript_name, gene_id, gene_name) |>
  dplyr::distinct()

# build rowData
gbm_rowData <- sqanti_class |>
  dplyr::select(isoform, structural_category, associated_gene, RTS_stage, within_CAGE_peak, within_polyA_site) |>
  dplyr::filter(isoform %in% rownames(gbm_countData)) |>
  dplyr::rename("gene_id" = "associated_gene",
                "transcript_id" = "isoform") |>
  dplyr::left_join(., dplyr::distinct(ref[, c("gene_name", "gene_id")]), by = "gene_id") |>
  dplyr::left_join(., dplyr::distinct(ref[, c("transcript_name", "transcript_id")]), by = "transcript_id") |>
  dplyr::mutate(transcript_name = ifelse(is.na(transcript_name), transcript_id, transcript_name),
                gene_name = ifelse(is.na(gene_name), gene_id, gene_name)) |>
  tibble::column_to_rownames(var = "transcript_id")

# Match row name order with count matrix
gbm_rowData <- gbm_rowData[rownames(gbm_countData), ]

Data integration

To analyze multiple samples in Hypatia, raw isoform count matrices can be directly merged and used to create the SCE object. The cell- and transcript-level metadata should contain data for all cells and isoforms in the correct orders. A cell metadata column containing the sample origin of each cell would be ideal, as the group.by setting in several Hypatia functions can accept vectors such that group.by = c("cell_type", "sample_id") or group.by = c("cell_type", "condition") would provide the additional stratification when making comparisons and visualizations.

Session Info

sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.0.1
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Hypatia_0.1.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyr_1.3.1                 generics_0.1.4             
##  [3] SparseArray_1.10.2          lattice_0.22-7             
##  [5] digest_0.6.39               magrittr_2.0.4             
##  [7] evaluate_1.0.5              grid_4.5.2                 
##  [9] RColorBrewer_1.1-3          fastmap_1.2.0              
## [11] jsonlite_2.0.0              Matrix_1.7-4               
## [13] ggnewscale_0.5.2            backports_1.5.0            
## [15] purrr_1.2.0                 SingleCellExperiment_1.32.0
## [17] scales_1.4.0                abind_1.4-8                
## [19] cli_3.6.5                   rlang_1.1.6                
## [21] XVector_0.50.0              Biobase_2.70.0             
## [23] withr_3.0.2                 DelayedArray_0.36.0        
## [25] yaml_2.3.11                 S4Arrays_1.10.0            
## [27] tools_4.5.2                 checkmate_2.3.3            
## [29] dplyr_1.1.4                 ggplot2_4.0.1              
## [31] matrixTests_0.2.3.1         SummarizedExperiment_1.40.0
## [33] BiocGenerics_0.56.0         vctrs_0.6.5                
## [35] R6_2.6.1                    matrixStats_1.5.0          
## [37] stats4_4.5.2                lifecycle_1.0.4            
## [39] Seqinfo_1.0.0               S4Vectors_0.48.0           
## [41] htmlwidgets_1.6.4           IRanges_2.44.0             
## [43] pkgconfig_2.0.3             pillar_1.11.1              
## [45] gtable_0.3.6                glue_1.8.0                 
## [47] xfun_0.54                   tibble_3.3.0               
## [49] GenomicRanges_1.62.0        tidyselect_1.2.1           
## [51] rstudioapi_0.17.1           MatrixGenerics_1.22.0      
## [53] knitr_1.50                  farver_2.1.2               
## [55] htmltools_0.5.8.1           patchwork_1.3.2            
## [57] labeling_0.4.3              rmarkdown_2.30             
## [59] compiler_4.5.2              S7_0.2.1