2 Biological Context

Where does the sequening data come from? Once we have analysed our sequencing data, how does it translate back to cells, tissues, and organisms?

2.1 What is Single Cell RNA Seq (scRNA-seq)?

To understand what single cell RNA Seq is, we can compare and contrast with Bulk RNA Seq. Let’s walk through the bulk RNA Seq workflow below:

Bulk RNASeq

  • An estimation of the average expression level for each gene within a population of cells
  • Population-level resolution
  • DNA from every cell in the sample is mixed together for sequencing
  • Expression levels from every cell type are lumped together

Now let’s walk throug the Single Cell RNA Seq workflow:

Single Cell RNASeq

First Isolate Cells

  • Emphasizes the differences and variability between individual cells
  • Individual-level resolution
  • DNA from only one cell is sequenced
  • Expression levels from from individual cell types are separated

2.1.1 How is a single tiny cell isolated from other cells in a sample?

Cell Isolation Example

There are many technologies available commercially to perform single cell sorting.

Cell Sorting Tools

A common protocol comes from a company called 10x Genomics. Below is an overview of the cell sorting options available to researchers just to give you a quick impression of the variety of approaches.

Cell Sorting Tools

2.1.2 Once the single cells are isolated, how do we keep track of them? How do we know which cells belong to which sequences?

Droplets and UMIs

Cell Sorting Tools

What does the UMI do?

UMIs are oligonucleotides consisting of random bases.

  • Far more UMIs in the library than target DNA molecules so that ach target DNA molecule gets tagged with unique UMI
  • Target DNA molecule + UMI gets carried forward to amplification
  • Reads can later be grouped by UMI to generate a consensus sequence from all of the reads that contain one specific UMI

https://www.youtube.com/watch?v=gMzzXFMyM5g

UMIs

What does the Barcode do?

What does the Poly dT do?

2.2 Analysis Sequencing Data

Biological Context for interpreting scRNA-Seq Data and Data Analysis

This is an introduction to concepts. The purpose is to familiarize you with steps in an scRNA-Seq analysis workflow. Really, I hope that you understand what the purpose of the data processing steps are.

We are going to approach our learning using a journal-club model. I see that all of you are researchers and are likely very familiar with academic research publications and journal clubs. We will dissect a review paper from 2019 published in Molecular Systems Biology, titled “Current best practices in single cell RNA-seq analysis: a tutorial.”

This paper is paired with a tutorial freely available on github. These tutorials are not the same tutorials that we will cover in the course, providing an opportunity for extra practice as homework during the week of the workshop or afterward.

Luecken, M. D., & Theis, F. J. (2019). Current best practices in single‐cell RNA‐seq analysis: a tutorial. Molecular systems biology, 15(6), e8746.

https://www.github.com/theislab/single-cell-tutorial.

2.2.1 Objectives:

1.) scRNA-Seq Workflow:

  • To be able to recognize, select, and design an scRNA-Seq workflow

2.) Pre Processing and Visualization:

3.) Quality Control:

  • To distinguish between cell count per depth, genes detected per cell, count depth distribution, mitochondrial read fractions.

  • To be able to interpret quality control metrics

4.) Normalization

5.) Data correction and integration

  • Regressing out biological effects

  • Regressing out technical effects

  • Batch effects and data integration

  • Expression recovery

6.) Feature selection, dimensionality reduction and visualization

  • Feature Selection

  • Dimensionality reduction

  • Visualization

7.) Stages of Pre-Processed data

  • Five stages of data processing

  • Three pre-processing layers

  • Downstream analysis

8.) Clustering analysis

  • Clustering

  • Cluster Annotation

  • Compositional Analysis

9.) Trajectory analysis

  • Trajectory interference

  • Gene expression dynamics

  • Metastable states

10.) Cell Level Analysis unification

11.) Gene-level analysis

  • Differential Expression testing

  • Gene set analysis

  • Gene regulatory networks

Keep an eye out for references to the following analysis tools, because you will be using them later on in the workshop:

Cell Ranger

Seurat

2.2.2 scRNA-Seq Workflow:

To be able to recognize, select, and design an scRNA-Seq workflow to answer your own question.

Analysis Workflow

Questions:

1.) To get from a cluster to an annotated cluster, what do we need match up to the to the cluster data? See Figure 1.

Answer: Marker Identifiers

2.) The barcode details listed below indicate that a cell has a broken membrane. Why?

  • Barcodes with a low count depth

  • few detected genes

  • high fraction of mitochondrial counts

Answer: After nuclear DNA leaks through a broken nuclear membrane, the only DNA left in a cell is the mitochondrial DNA.

Keep in mind that there are always other implications to consider. For example: Cells involved in respiration ALSO have a high mitochondrial count.

2.2.3 Processing and Visualization

1.) Raw Data

2.) Processed data

  • molecular counts (count matrices)

  • read counts (read matrices)

  • UMIs

  • Cell Ranger

Quality Control (QC)

Assigning reads to cellular barcodes (demultiplexing)

Assigning reads to mRNA molecules of origin (demultiplexing)

genome alignment

quantification

3.) Doublets

  • Mistake whereby a barcode may tag multiple cells instead of one

  • like 2 cells in a droplet

4.) Empty Droplet/Well

  • Mistake where no cells are tagged

  • 0 cells in droplet

2.2.4 Quality Control:

To distinguish between cell count per depth, genes detected per cell, count depth distribution, and mitochondrial read fractions. See Figure 2.

We will be regularly referring to “covariates” so it is important to keep in mind what a covariate is.

With respect to quality control in scRNA-seq, covariates are:

  • counts per barcode (count depth)
  • genes per barcode
  • fractions of counts from mitochondrial genes

Run all QC Covariates (Cojointly) and cross reference the results of all three. Covariate dependence.

Questions:

What about barcodes that have

  • unexpectedly high counts
  • large number of detected genes?

What might that indicate?

  • Answer: Could be doublets
  • A doublet is where multiple cells are captured together
  • in a microfluidic droplet

What might a quiescent cell population’s barcode details look like?

  • Answer:
  • low counts
  • low genes

What might a QC Covariates show for cells that are larger in size?

  • Answer:
  • high cell counts

Should thresholds be set permissively?

  • Answer: yes
  • as permissively as possible

Why? - Don’t want to filter out viable cell populations.

What if you don’t filter?

  • Only lowest count depth and lowest gene per barcode should be considered non-viable

Quality Control

2.2.5 Normalization

With normalization we are trying to get the correct relative gene expression abundances between cells.

Gene expression between cells is based on count data.

What does a count in a count matrix represent?

  • mRNA Capture
  • Reverse transcription of mRNA
  • sequencing of a molecule of mRNA

The most common normalization protocol is:

  • count depth scaling
  • aka CPM or counts per million
  • it assumes that all cells in the dataset initially contain an equal number of mRNA molecules
  • it assumes that count depth differences arise from sampling

Normalize complete

  • But wait!
  • We still have unwanted variability in the data.
  • What kind of unwanted variability?
  • What is the solution? Data Correction.

2.2.6 Data correction and integration

Biological Covariates

  • Cell-Cycle effects
  • Batch
  • Dropout

Which Covariates to Consider?

  • Depends on downstream analysis
  • Correct for biological and technical to be considered separately
  • Corrections are used for different purposes
  • Each approach to correction presents unique challenges

What are the Correction methods?

  • Regressing out biological effects
  • Regressing out technical effects
  • Batch effects and data integration
  • Expression recovery

2.2.6.1 Regressing out biological effects

Uncovering underlying biological signal

What is most common correction?

  • Removing effects of the cell cycle on the transcriptome is the most common correction.

What kind of effects does the cell cycle have on the transcriptome? First let’s make fresh in our mind what a transcriptome is and what happens in the cell cycle. We can draw on our prior knowledge and freshen it up a bit.

What’s happens in the cell cycle?

https://www.youtube.com/watch?v=U5vAO_f2LDQ

What’s a transcriptome and how does cell-cycle affect the transcriptome?

https://www.youtube.com/watch?v=Dp8WT94RmuQ

Why regress out cell -cycle markers?

Cell Cycle markers are conserved

How to regress out Cell-Cycle markers

  • simple linear regression
  • against a cell cycle score

Where does this score come from?

  • Lists of marker genes in literature
  • Macosko et al, 2015

Some argue that normalizing for Cell Size already accounts for Cell-Cycle effects

Is correcting for covariates always helpful in scRNA-Seq?

  • No
  • Again, maybe you want to study Cell-Cycle related phenomena

2.2.6.2 Regressing out technical effects

Most common technical variate

  • Count Dept
  • Batch

Count Depth

  • Normalization makes gene-counts comparable between cells
  • It does not make count depth comparable between cells
  • Can be both a biological effect or a technical effect
  • Different sized cell yield different mRNA molecule counts (biological)
  • Can improve performance of downstream trajectory inference algorithms looking for cell transitions (we will see later)
  • Larger variation of count depths per cell can mask the heterogeneity between cells

2.2.6.3 Batch effects and data integration

What is a Batch effect?

Transcript measurements may be affected by variation in environment experienced by the cell.

  • Cells on different chips
  • Cells in different sequencing lanes
  • Cells harvested at different time points

Multi-level effects

  • between groups of cells in same experiment (batch correction)
  • between experiments in same laboratory (batch corrections)
  • between datasets from different laboratories (data integration)

Batch correction

  • uses linear methods
  • Trying to solve problem of variation between “batches” of cells exposed to specific experimental and/or laboratory conditions.
  • from bulk RNA -seq
  • works for scRNA-seq if experiment is low-medium complexity
  • ComBat Fig 3
  • uses mean and variance of data
  • Best option is to have a good experimental design in the first place to avoid batch effects
  • Cell tagging is also a good idea because it is always possible to demultiplex later

Data integration

  • uses non-linear methods
  • Trying to solve to problem of compositional differences between datasets.
  • If cells types or cell states are not shared between datasets, we cannot use cells from a batch to fit batch parameters for the differing datasets because we wouldn’t be able to distinguish between batch effects and biological differences.
  • Be wary of overcorrection
  • More studies are needed

Figure 3

In this visualization, you don’t want to see regions defined by cell type.

  • Because the defined regions are the result of batch effects
  • The cells are aggregating by similarities based on their batch, not based on variation in expression levels between cell types and cell states.
  • Just tells us that cells were in the same experiment or lab, which we already know.
  • We need to know how the cells’ states and types vary

UMAP

2.2.6.4 Expression recovery

Correcting for technical effects

  • Trying to eliminate noise that is produced when measuring single cell transcriptomes
  • Common example of noise is dropout
  • Dropout is where expression values have been replaced by zeros due to sampling
  • So we are trying to recover those expression values that should not have been replaced by a zero value
  • Using appropriate normalization methods helps with dropout
  • Improve the estimate of gene-gene correlations
  • Sometimes expression recovery methods run on raw data too, not normalized
  • Warning: False correlation signals have been reportedly caused by expression recovery. (Andrews & Hemberg, 2018)
  • Better to just use expression recovery for visualization of data
  • Better not to use expression recovery for exploratory analysis or hypothesis generation
  • Experimentally verify!!

2.2.7 Feature selection, dimensionality reduction and visualization

2.2.7.1 Feature Selection

You can get expression values for up to 25,000 genes in a human singls-cell RNA-seq dataset.

Many genes won’t be informative for any given experimental inquiry.

So you can get rid of some of those non-useful expression values.

Can get rid of zero counts (often many of these) in QC, but there are still too many expression values. There are still too many “dimensions.”

Feature Space (gene expression vectors) can have over 15,000 dimensions. How many dimensions does an xy plot have? How many dimensions is an xyz plot? As humans, can we visualize 4+ dimensions? Not really.

So we want to reduce Dimensionality. Why?

  • To ease downstream computational burden
  • To reduce noise in data
  • To visualize data (need it in 2-3 dimensions)

Feature Selection is common first step in dimensionality reduction In this example the “feature” we are selecting for is high variability in genes. We only want the genes that are highly variable, because our aim is to compare cell types/cell states via expression level of genes. Genes that have a low level of variability between cells can’t tell us much about variation between the cells.

Filtered to keep only genes that are informative of variability in data.

Typically between 1,000 and 5,000 Highly Variable Genes (HVGs) are selected for downstream analysis.

  • 200 to 2400 HVGs have similar low-dimensional representations in the PCA space.
  • so you might as well go for higher number of HVGs

How to select HVGs?

Seurat

  • Genes binned by mean expression
  • Genes with highest variance-to-mean ratio selected as HVGs
  • expects count data as input

Cell Ranger

  • expects log-transformed data to select HVGs
  • Select for this feature (HVGs) only after correcting for technical effects

Why?

  • So you don’t select HVGs based on batch effects.

Visualization

2.2.7.2 Dimensionality reduction

The authors say that Feature Selection is the first step in Dimensionality Reduction, but it can also be seen as a preparation step. In feature selection, we identify and filter out HVGs. Then those about 1000-5000 HVGs undergo dimensionality reduction. Capture underlaying structure in the data in as few dimensions as possible.

scRNA-Seq data are inherently low dimensional

Cellular Expression Profile manifolds can be described by far fewer dimensions than the number of genes

Algorithms embed expression matrix into a low-dimensional space

Let’s remind ourselves what we are reducing dimension for:

  • To ease downstream computational burden
  • To reduce noise in data
  • To visualize data (need it in 2-3 dimensions)

Gene expression vectors are feature space dimensions. Linear and non-linear combinations of feature space dimensions (gene expression vectors) generates reduced dimensions.

Popular dimensionality reduction techniques:

1.) Principal Component Analysis (PCA)

  • Linear Approach to reduce dimensions.
  • Does not capture data structure in few dimensions as well as non-linear methods, but it is common in analysis tools for clustering and for trajectory inference.
  • Sometimes used as pre-processing for non-linear dimensionality reduction approaches.
  • The “elbow” in Fig 4 determines the top N principal components.

2.) Diffusion Maps

  • popular for scRNA-Seq
  • non-linear approach
  • emphasize transitions in data

Visualization

2.2.7.3 Visualization

What does a reduced dimension become?

  • A coordinate in 2-3 dimensions
  • We need it in 2-3 dimensions to see it and interpret it.

A scatter plot is made of points at coordinates. A coordinate is a representation of reduced dimensions.

It is standard practice to use non-linear dimension reduction in visualization.

Figure 4 shows the following methods:

t-SNE

  • t-distributed stochastic neighbor embedding
  • non-linear dimension reduction
  • most Common
  • captures local similarity at expense of global structure
  • has an adjustable perplexity parameter, but changes the numbers of clusters
  • may exaggerate differences or overlook connections between cell populations

UMAP

  • non-linear dimension reduction
  • common option
  • speedy
  • can scale large numbers of cells
  • Best for visualization of exploratory data

PCA

  • linear dimension reduction

Diffusion Map (non-linear dimension reduction)

Force-directed graph layout

  • best approximation of underlying topology

2.2.8 Stages of Pre-Processed data

Table 1

  • Five stages of data processing
  • Three pre-processing layers
  • Downstream analysis

Downstream Workflow Overview

Downstream Applications Table

2.2.9 Clustering analysis

  • Clustering
  • Cluster Annotation
  • Compositional Analysis

Cluster Analysis

2.2.10 Trajectory analysis

  • Trajectory interference
  • Gene expression dynamics
  • Metastable states

Trajectory Analysis

2.2.11 Cell Level Analysis unification

2.2.12 Gene-level analysis

  • Differential Expression testing
  • Gene set analysis
  • Gene regulatory networks