Section 5 Normalization

Correct technical biases and prepare your datasets for further analyses:

  • Basic transformations (log transformation, scaling to unit variance, quantile normalization)
  • Normalizations specific of certain data types (for count data like RNA-seq or for compositional data like metagenomics)
  • Normalization for the correction of an explicit batch effect
  • All normalization methods come with diagnostic plots.

5.1 Standard transformation

5.1.1 Logarithm transformation

5.1.1.1 Interpretation

Logarithm transformation aims at making variable distributions closer to the Gaussian distribution. This is a requirement for parametric tests (ANOVA, Student, …) and it is advised for exploratory analysis. In particular, it is usual to use these transformations on microarray data for all subsequent analyses or on TMM / TMMwsp normalized RNA-seq data for subsequent exploratory analyses (PCA, clustering, …).

On the before/after boxplot, check that the variable distributions are symmetric, with only a few outliers.

     

For RNA-seq data, check the corresponding menu that provides an option to combine normalization and log-transformation.

5.1.1.2 Default parameters

Used function and default parameters:

The log transformation correspond to the following computation:

log(df + prior.count), with the log function being either log, log2 or log10.

5.1.2 Scaling

5.1.2.1 Interpretation

Centering and scaling are useful to obtain comparable scales for variables that are not on the same scale. It is also performed by default in most exploratory analyses (clustering, PCA, …) without the need to use the transformed dataset.

5.1.2.2 Default parameters

Used function and default parameters:

The scale transformation uses the following function:

  • If scale to unit variance is unchecked: scale(df, center = T, scale = FALSE),

  • If scale to unit variance is checked: scale(df, center = T, scale = TRUE).

5.2 Quantile normalization

5.2.1 Interpretation

Quantile normalization aims at aligning the distributions of all individuals (over all variable values). This type of normalization is usually advised to correct technical biases in some experiments that produce a large number of continuous measurements (like microarray data, after they have been log2-transformed, for instance). Check that all density plots are well aligned after quantile normalization.

5.2.2 Default parameters

Used function and default parameters:

The quantile transformation uses the following function:

preprocessCore::normalize.quantiles(x = t(df)) ).

5.3 RNA-seq (and other count) data

5.3.1 Interpretation

Two different types of normalization are implemented: TMM (which is advised in most cases) and TMMwsp, both as implemented in the R package edgeR). These normalizations aim at correcting the sequencing depth bias and making counts comparable between individuals. TMMwsp is to be preferred over TMM when the number of zeros in your data is large.
Check that the boxplots are similar after normalization.

     

These normalizations are designed for count data. If your RNA-seq data are already normalized, you must not use them (and the results of your tests might be incorrect because tests for count data use both raw and normalized data).

If your data are already log-transformed, you can undo this log-transformation and perform the TMM or TMMwsp normalization. Log-transformed RNA-seq counts have to be used for exploratory analyses (PCA, clustering, …) but should not be used for tests.

5.3.2 Default parameters

Used function and default parameters:

The normalisation for count data consists in three steps, according to those who have been chosen by the user:

  • filter: remove the values inferior or equal to a certain value,

  • TMM: edgeR::calcNormFactors with option method = 'TMM',

  • log(cpm(x) + prior.count) : it’s a log transformation on cpm values computed this way : edgeR::cpm(y = df, lib.size = lib.size).

5.4 Compositional data

5.4.1 Interpretation

Compositional data are data for which the sum of all variable values for a given individual is always the same (e.g., percentages are compositional data because they sum to 1). Such data are not adequate for a direct use in statistical analyses. We propose different types of transformations that correct this problem. Barplots are used to represent the value of each variable in each individual. It is expected that bars corresponding to individuals have varying heights after the transformation.

5.4.2 Default parameters

Used function and default parameters:

If the dataset is seen as count data, the normalization consists in three steps:

If the dataset is seen as compositional data, the normalization consists in three steps:

  • offset: add a value to all cells in the dataset (df + offset),

  • filter: remove the values inferior or equal to a certain value,

  • normalization: one of the following methods, chosen by the user:

    • CLR: compositions::clr with the default parameters,

    • ILR: compositions::ilr with the default parameters,

    • CSS: metagenomeSeq::cumNormStatFast, with the default parameters.

5.5 Metagenomic data

5.5.1 Interpretation

Metagenomic data can be considered as count data (like RNA-seq) or as compositional data. Both situations are handled with different normalization methods (similar to the ones described in the corresponding tabs).

     

When considering metagenomic data as count data, TMMwsp normalization might be a better choice than TMM because it accounts for the excess of zeros that metagenomic data usually have.

Be careful to check if your metagenomic data are already log-transformed or not before you choose a normalization.

5.5.2 Default parameters

Used function and default parameters:

The normalisation for metagenomics data uses these function, depending on the kind of normalization done:

  • For CLR normalization: compositions::clr with the default parameters,

  • For ILR normalization: compositions::ilr with the default parameters.

5.6 Correcting batch effect with ComBat

5.6.1 Interpretation

ComBat approach is the approach implemented in the R package sva and is designed to correct explicit batch effect (described by a factor that must be provided with the dataset to normalize). It can be applied to continuous data (like microarray data) or to count data (before TMM or TMMwsp normalization).

5.6.2 Default parameters

Used function and default parameters:

  • In case of count data: sva::ComBat_seq(counts = as.matrix(df), batch = batches)
  • ortherwise: sva::ComBat(dat = df, batch = batches).