2 Biological Context

Where does the sequening data come from? Once we have analysed our sequencing data, how does it translate back to cells, tissues, and organisms?

2.1 What is Single Cell RNA Seq (scRNA-seq)?

To understand what single cell RNA Seq is, we can compare and contrast with Bulk RNA Seq. Let’s walk through the bulk RNA Seq workflow below:

Bulk RNASeq

An estimation of the average expression level for each gene within a population of cells
Population-level resolution
DNA from every cell in the sample is mixed together for sequencing
Expression levels from every cell type are lumped together

Now let’s walk throug the Single Cell RNA Seq workflow:

Single Cell RNASeq

First Isolate Cells

Emphasizes the differences and variability between individual cells
Individual-level resolution
DNA from only one cell is sequenced
Expression levels from from individual cell types are separated

2.1.1 How is a single tiny cell isolated from other cells in a sample?

Cell Isolation Example

There are many technologies available commercially to perform single cell sorting.

Cell Sorting Tools

A common protocol comes from a company called 10x Genomics. Below is an overview of the cell sorting options available to researchers just to give you a quick impression of the variety of approaches.

Cell Sorting Tools

2.1.2 Once the single cells are isolated, how do we keep track of them? How do we know which cells belong to which sequences?

Droplets and UMIs

Cell Sorting Tools

What does the UMI do?

UMIs are oligonucleotides consisting of random bases.

Far more UMIs in the library than target DNA molecules so that ach target DNA molecule gets tagged with unique UMI
Target DNA molecule + UMI gets carried forward to amplification
Reads can later be grouped by UMI to generate a consensus sequence from all of the reads that contain one specific UMI

https://www.youtube.com/watch?v=gMzzXFMyM5g

UMIs

What does the Barcode do?

What does the Poly dT do?

2.2 Analysis Sequencing Data

Biological Context for interpreting scRNA-Seq Data and Data Analysis

This is an introduction to concepts. The purpose is to familiarize you with steps in an scRNA-Seq analysis workflow. Really, I hope that you understand what the purpose of the data processing steps are.

We are going to approach our learning using a journal-club model. I see that all of you are researchers and are likely very familiar with academic research publications and journal clubs. We will dissect a review paper from 2019 published in Molecular Systems Biology, titled “Current best practices in single cell RNA-seq analysis: a tutorial.”

This paper is paired with a tutorial freely available on github. These tutorials are not the same tutorials that we will cover in the course, providing an opportunity for extra practice as homework during the week of the workshop or afterward.

Luecken, M. D., & Theis, F. J. (2019). Current best practices in single‐cell RNA‐seq analysis: a tutorial. Molecular systems biology, 15(6), e8746.

https://www.github.com/theislab/single-cell-tutorial.

2.2.1 Objectives:

1.) scRNA-Seq Workflow:

To be able to recognize, select, and design an scRNA-Seq workflow

2.) Pre Processing and Visualization:

3.) Quality Control:

To distinguish between cell count per depth, genes detected per cell, count depth distribution, mitochondrial read fractions.
To be able to interpret quality control metrics

4.) Normalization

5.) Data correction and integration

Regressing out biological effects
Regressing out technical effects
Batch effects and data integration
Expression recovery

6.) Feature selection, dimensionality reduction and visualization

Feature Selection
Dimensionality reduction
Visualization

7.) Stages of Pre-Processed data

Five stages of data processing
Three pre-processing layers
Downstream analysis

8.) Clustering analysis

Clustering
Cluster Annotation
Compositional Analysis

9.) Trajectory analysis

Trajectory interference
Gene expression dynamics
Metastable states

10.) Cell Level Analysis unification

11.) Gene-level analysis

Differential Expression testing
Gene set analysis
Gene regulatory networks

Keep an eye out for references to the following analysis tools, because you will be using them later on in the workshop:

Cell Ranger

Seurat

2.2.2 scRNA-Seq Workflow:

To be able to recognize, select, and design an scRNA-Seq workflow to answer your own question.

Analysis Workflow

Questions:

1.) To get from a cluster to an annotated cluster, what do we need match up to the to the cluster data? See Figure 1.

Answer: Marker Identifiers

2.) The barcode details listed below indicate that a cell has a broken membrane. Why?

Barcodes with a low count depth
few detected genes
high fraction of mitochondrial counts

Answer: After nuclear DNA leaks through a broken nuclear membrane, the only DNA left in a cell is the mitochondrial DNA.

Keep in mind that there are always other implications to consider. For example: Cells involved in respiration ALSO have a high mitochondrial count.

2.2.3 Processing and Visualization

1.) Raw Data

2.) Processed data

molecular counts (count matrices)
read counts (read matrices)
UMIs
Cell Ranger

Quality Control (QC)

Assigning reads to cellular barcodes (demultiplexing)

Assigning reads to mRNA molecules of origin (demultiplexing)

genome alignment

quantification

3.) Doublets

Mistake whereby a barcode may tag multiple cells instead of one
like 2 cells in a droplet

4.) Empty Droplet/Well

Mistake where no cells are tagged
0 cells in droplet

2.2.4 Quality Control:

To distinguish between cell count per depth, genes detected per cell, count depth distribution, and mitochondrial read fractions. See Figure 2.

We will be regularly referring to “covariates” so it is important to keep in mind what a covariate is.

With respect to quality control in scRNA-seq, covariates are:

counts per barcode (count depth)
genes per barcode
fractions of counts from mitochondrial genes

Run all QC Covariates (Cojointly) and cross reference the results of all three. Covariate dependence.

Questions:

What about barcodes that have

unexpectedly high counts
large number of detected genes?

What might that indicate?

Answer: Could be doublets
A doublet is where multiple cells are captured together
in a microfluidic droplet

What might a quiescent cell population’s barcode details look like?

Answer:
low counts
low genes

What might a QC Covariates show for cells that are larger in size?

Answer:
high cell counts

Should thresholds be set permissively?

Answer: yes
as permissively as possible

Why? - Don’t want to filter out viable cell populations.

What if you don’t filter?

Only lowest count depth and lowest gene per barcode should be considered non-viable

Quality Control

2.2.5 Normalization

With normalization we are trying to get the correct relative gene expression abundances between cells.

Gene expression between cells is based on count data.

What does a count in a count matrix represent?

mRNA Capture
Reverse transcription of mRNA
sequencing of a molecule of mRNA

The most common normalization protocol is:

count depth scaling
aka CPM or counts per million
it assumes that all cells in the dataset initially contain an equal number of mRNA molecules
it assumes that count depth differences arise from sampling

Normalize complete

But wait!
We still have unwanted variability in the data.
What kind of unwanted variability?
What is the solution? Data Correction.

2.2.6 Data correction and integration

Biological Covariates

Cell-Cycle effects
Batch
Dropout

Which Covariates to Consider?

Depends on downstream analysis
Correct for biological and technical to be considered separately
Corrections are used for different purposes
Each approach to correction presents unique challenges

What are the Correction methods?

Regressing out biological effects
Regressing out technical effects
Batch effects and data integration
Expression recovery

2.2.6.1 Regressing out biological effects

Uncovering underlying biological signal

What is most common correction?

Removing effects of the cell cycle on the transcriptome is the most common correction.

What kind of effects does the cell cycle have on the transcriptome? First let’s make fresh in our mind what a transcriptome is and what happens in the cell cycle. We can draw on our prior knowledge and freshen it up a bit.

What’s happens in the cell cycle?

https://www.youtube.com/watch?v=U5vAO_f2LDQ

What’s a transcriptome and how does cell-cycle affect the transcriptome?

https://www.youtube.com/watch?v=Dp8WT94RmuQ

Why regress out cell -cycle markers?

Cell Cycle markers are conserved

Across tissue
Across spieces
https://www.youtube.com/watch?v=29y3P7lO9SQ
Do you want to study effects of cell-cycle markers?
Do you want to study something else?

How to regress out Cell-Cycle markers

simple linear regression
against a cell cycle score

Where does this score come from?

Lists of marker genes in literature
Macosko et al, 2015

Some argue that normalizing for Cell Size already accounts for Cell-Cycle effects

McDavid et al, 2016
https://www.nature.com/articles/srep33892

Is correcting for covariates always helpful in scRNA-Seq?

No
Again, maybe you want to study Cell-Cycle related phenomena

2.2.6.2 Regressing out technical effects

Most common technical variate

Count Dept
Batch

Count Depth

Normalization makes gene-counts comparable between cells
It does not make count depth comparable between cells
Can be both a biological effect or a technical effect
Different sized cell yield different mRNA molecule counts (biological)
Can improve performance of downstream trajectory inference algorithms looking for cell transitions (we will see later)
Larger variation of count depths per cell can mask the heterogeneity between cells

2.2.6.3 Batch effects and data integration

What is a Batch effect?

Transcript measurements may be affected by variation in environment experienced by the cell.

Cells on different chips
Cells in different sequencing lanes
Cells harvested at different time points

Multi-level effects

between groups of cells in same experiment (batch correction)
between experiments in same laboratory (batch corrections)
between datasets from different laboratories (data integration)

Batch correction

uses linear methods
Trying to solve problem of variation between “batches” of cells exposed to specific experimental and/or laboratory conditions.
from bulk RNA -seq
works for scRNA-seq if experiment is low-medium complexity
ComBat Fig 3
uses mean and variance of data
Best option is to have a good experimental design in the first place to avoid batch effects
Cell tagging is also a good idea because it is always possible to demultiplex later

Data integration

uses non-linear methods
Trying to solve to problem of compositional differences between datasets.
If cells types or cell states are not shared between datasets, we cannot use cells from a batch to fit batch parameters for the differing datasets because we wouldn’t be able to distinguish between batch effects and biological differences.
Be wary of overcorrection
More studies are needed

Figure 3

In this visualization, you don’t want to see regions defined by cell type.

Because the defined regions are the result of batch effects
The cells are aggregating by similarities based on their batch, not based on variation in expression levels between cell types and cell states.
Just tells us that cells were in the same experiment or lab, which we already know.
We need to know how the cells’ states and types vary

UMAP

2.2.6.4 Expression recovery

Correcting for technical effects

Trying to eliminate noise that is produced when measuring single cell transcriptomes
Common example of noise is dropout
Dropout is where expression values have been replaced by zeros due to sampling
So we are trying to recover those expression values that should not have been replaced by a zero value
Using appropriate normalization methods helps with dropout
Improve the estimate of gene-gene correlations
Sometimes expression recovery methods run on raw data too, not normalized
Warning: False correlation signals have been reportedly caused by expression recovery. (Andrews & Hemberg, 2018)
Better to just use expression recovery for visualization of data
Better not to use expression recovery for exploratory analysis or hypothesis generation
Experimentally verify!!

2.2.7 Feature selection, dimensionality reduction and visualization

2.2.7.1 Feature Selection

You can get expression values for up to 25,000 genes in a human singls-cell RNA-seq dataset.

Many genes won’t be informative for any given experimental inquiry.

So you can get rid of some of those non-useful expression values.

Can get rid of zero counts (often many of these) in QC, but there are still too many expression values. There are still too many “dimensions.”

Feature Space (gene expression vectors) can have over 15,000 dimensions. How many dimensions does an xy plot have? How many dimensions is an xyz plot? As humans, can we visualize 4+ dimensions? Not really.

So we want to reduce Dimensionality. Why?

To ease downstream computational burden
To reduce noise in data
To visualize data (need it in 2-3 dimensions)

Feature Selection is common first step in dimensionality reduction In this example the “feature” we are selecting for is high variability in genes. We only want the genes that are highly variable, because our aim is to compare cell types/cell states via expression level of genes. Genes that have a low level of variability between cells can’t tell us much about variation between the cells.

Filtered to keep only genes that are informative of variability in data.

Typically between 1,000 and 5,000 Highly Variable Genes (HVGs) are selected for downstream analysis.

200 to 2400 HVGs have similar low-dimensional representations in the PCA space.
so you might as well go for higher number of HVGs

How to select HVGs?

Seurat

Genes binned by mean expression
Genes with highest variance-to-mean ratio selected as HVGs
expects count data as input

Cell Ranger

expects log-transformed data to select HVGs
Select for this feature (HVGs) only after correcting for technical effects

Why?

So you don’t select HVGs based on batch effects.

Visualization

2.2.7.2 Dimensionality reduction

The authors say that Feature Selection is the first step in Dimensionality Reduction, but it can also be seen as a preparation step. In feature selection, we identify and filter out HVGs. Then those about 1000-5000 HVGs undergo dimensionality reduction. Capture underlaying structure in the data in as few dimensions as possible.

scRNA-Seq data are inherently low dimensional

Cellular Expression Profile manifolds can be described by far fewer dimensions than the number of genes

Algorithms embed expression matrix into a low-dimensional space

Let’s remind ourselves what we are reducing dimension for:

To ease downstream computational burden
To reduce noise in data
To visualize data (need it in 2-3 dimensions)

Gene expression vectors are feature space dimensions. Linear and non-linear combinations of feature space dimensions (gene expression vectors) generates reduced dimensions.

Popular dimensionality reduction techniques:

1.) Principal Component Analysis (PCA)

Linear Approach to reduce dimensions.
Does not capture data structure in few dimensions as well as non-linear methods, but it is common in analysis tools for clustering and for trajectory inference.
Sometimes used as pre-processing for non-linear dimensionality reduction approaches.
The “elbow” in Fig 4 determines the top N principal components.

2.) Diffusion Maps

popular for scRNA-Seq
non-linear approach
emphasize transitions in data

Visualization

2.2.7.3 Visualization

What does a reduced dimension become?

A coordinate in 2-3 dimensions
We need it in 2-3 dimensions to see it and interpret it.

A scatter plot is made of points at coordinates. A coordinate is a representation of reduced dimensions.

It is standard practice to use non-linear dimension reduction in visualization.

Figure 4 shows the following methods:

t-SNE

t-distributed stochastic neighbor embedding
non-linear dimension reduction
most Common
captures local similarity at expense of global structure
has an adjustable perplexity parameter, but changes the numbers of clusters
may exaggerate differences or overlook connections between cell populations

UMAP

non-linear dimension reduction
common option
speedy
can scale large numbers of cells
Best for visualization of exploratory data

PCA

linear dimension reduction

Diffusion Map (non-linear dimension reduction)

Force-directed graph layout

best approximation of underlying topology

2.2.8 Stages of Pre-Processed data

Table 1

Five stages of data processing
Three pre-processing layers
Downstream analysis

Downstream Workflow Overview

Downstream Applications Table

2.2.9 Clustering analysis

Clustering
Cluster Annotation
Compositional Analysis

Cluster Analysis

2.2.10 Trajectory analysis

Trajectory interference
Gene expression dynamics
Metastable states

Trajectory Analysis

2.2.11 Cell Level Analysis unification

2.2.12 Gene-level analysis

Differential Expression testing
Gene set analysis
Gene regulatory networks