vignettes/BIRSBIO2020.scSpatial.sPLS-DA_MINT.Rmd
BIRSBIO2020.scSpatial.sPLS-DA_MINT.Rmd
Tasic et al. (2016) used scRNA sequencing to explore the extent of cell types of mouse primal visual cortex. In the study, 1723 high-quality cells with thousands of genes in primal visual cortex in male adult mice were sequenced on a single cell resolution for cell classification. This study identified 49 cell types, including 23 GABAergic, 19 glutamatergic and 7 non-neural types in the 1723 cells.
The cell classification result from Tasic et al. (2016) was used by Zhu et al. (2018) to design a study to distinguish the difference between intrinsic and extrinsic effect on gene expression. Intrinsic effect means the regulatory gene network, while extrinsic means the cellular microenvironment. The study was conducted by combining scRNA sequencing data and smFISH data (single-molecule fluorescence in situ hybridization). The former one has high molecular resolution of transcriptomics (thousands of genes) without spatial information, while the latter keeps the spatial information but loses the high resolution (only a few hundred genes).
Zhu et al. (2018) mapped the sRNA sequencing data to the seqFISH data to enhance the molecular resolution of the cells using SVM (support vector machine). The model was trained to identify the major cell type difference by training on scRNA data of 8 groups, GABAergic and Glutamatergic are the major neuron types, and other non-neuronal types, including Astrocytes, Mndothelial cells, microcytes, and three types of Oligocytes. The selected features (genes) were the top 43 differentially expressed (DE) genes in the identified 113 genes. The classification result was FURTHER validated by different evidence like cell type specific staining and previously reported marker genes.
Now, given a dataset of gene expression levels of 1723 cells and 113 genes and the cell types (from Tasic et al. 2016), our task is to predict the cells provided by Zhu et al. (2018), 1597 cells with 113 genes.
Outline of analysis:
We will adopt a semi-supervised approach to classify the cells of seqFISH data.
This semi-supervised method can borrow the information from the seqFISH data and make the model customised. The combination of the data will be used to train a MINT.sPLS-DA model. MINT (Multivariate INTegrative method) (Rohart et al. 2017) is robust for integrating data from different sources regardless of the batch effect. The top discriminative genes identified from the model will be validated using previous literature as evidence. The determination of the minimal number of genes will be done by restricting the values of keepX and the performance will be monitored by balanced error rate (BER). BER is the average of the proportion of wrong classifications in each class. The lower the BER, the more accurate the model is.
First we import the data, including the training data (scRNA
), the training labels (scRNA_label
), the test data (seqFISH
) and the predicted test labels by Zhu’s paper (seqFISH_label
).
##### import the data #####
scRNA1 = read.delim("data/tasic_training_b2.txt",header = FALSE, sep ='\t')
scRNA <- scRNA1[,-1]
rownames(scRNA) <- scRNA1[,1]
scRNA = data.frame(t(scRNA))
rownames(scRNA) <- paste(rep("scRNA",dim(scRNA)[1]),1:dim(scRNA)[1],sep = "_")
seqFISH1 = read.delim("data/seqfish_cortex_b2_testing.txt",header = FALSE, sep ='\t')
seqFISH <- seqFISH1[,-1]
rownames(seqFISH) <- seqFISH1[,1]
seqFISH = data.frame(t(seqFISH))
rownames(seqFISH) <- paste(rep("seqFISH",dim(seqFISH)[1]),1:dim(seqFISH)[1],sep = "_")
scRNA_label = read.table("data/tasic_labels.tsv", sep = "\t")[,1]
seqFISH_label = read.table("data/seqfish_labels.tsv", sep = "\t")[,3]
scRNA
is from Taisc et el. 2016. It contains the gene expression level of 1723 cells with 113 genes.
dim(scRNA)
## [1] 1723 113
#Let's take a look at the first 6 cells.
head(scRNA[,1:6])
## abca15 abca9 acta2 adcy4 aldh3b2 amigo2
## scRNA_1 11 22 15 12 27 23
## scRNA_2 42 46 47 45 49 43
## scRNA_3 17 22 15 12 27 101
## scRNA_4 42 46 42 45 49 43
## scRNA_5 35 39 34 38 42 72
## scRNA_6 35 39 42 38 42 43
n_genes <- 113
n_scRNA <- 1723
seqFISH
is from Zhu et el. 2018. It contains the gene expression level of 1597 cells with the 113 genes shared with scRNA
.
dim(seqFISH)
## [1] 1597 113
# Let's take a look at the first 6 cells.
head(seqFISH[,1:6])
## abca15 abca9 acta2 adcy4 aldh3b2 amigo2
## seqFISH_1 68 41 25 39 101 93
## seqFISH_2 49 42 23 54 47 64
## seqFISH_3 50 38 16 37 41 93
## seqFISH_4 39 36 21 18 52 93
## seqFISH_5 31 47 29 37 101 93
## seqFISH_6 59 18 36 26 50 93
n_seqFISH <- 1597
scRNA_label
has the class of the cells corresponding to scRNA
data. The phenotyping was done by Tasic et al. and the labels are treated as ground truth in our analysis and are used for training models.
length(scRNA_label)
## [1] 1723
# Let's take a look at the number of cells in each type of the scRNA data
table(scRNA_label)
## scRNA_label
## Astrocyte Endothelial Cell GABA-ergic Neuron
## 43 29 761
## Glutamatergic Neuron Microglia Oligodendrocyte.1
## 812 22 19
## Oligodendrocyte.2 Oligodendrocyte.3
## 6 31
There are 8 classes in scRNA_label
. The numbers of cells in each type are not balanced.
seqFISH_label
has the classes predicted for seqFISH
data by Zhu et al. The predicitons have not been benchmarked, thus seqFISH_label
will not be used in the prediction. It will be used to compare my prediction with Zhu’s prediction.
length(seqFISH_label)
## [1] 1597
# the number of cells in each cell type of the seqFISH data predicted by Zhu's paper
table(seqFISH_label)
## seqFISH_label
## Astrocyte Endothelial Cell GABA-ergic Neuron
## 97 11 556
## Glutamatergic Neuron Microglia Oligodendrocyte.1
## 859 22 8
## Oligodendrocyte.2 Oligodendrocyte.3
## 21 23
First, we check the gene expression distribution of each gene in scRNA
and seqFISH
.
The histograms show that the features (genes) are of the same scale and the distributions are almost normal. It means that there’s no need for further feature normalisation.
Now we check the gene expression distribution of the 40 random samples in each dataset.
par(mfrow = c(1,1))
random_scRNA = sample(1:n_scRNA, 40, replace=FALSE)
boxplot(t(scRNA)[,random_scRNA], names = random_scRNA,
main = "gene expression boxplot of 40 random scRNA samples",
xlab = "sample ID (scRNA)",cex.main = 1)
random_seqFISH = sample(1:n_seqFISH, 40, replace=FALSE)
boxplot(t(seqFISH)[,random_seqFISH], names = random_seqFISH,
main = "gene expression boxplot of 40 random seqFISH samples",
xlab = "sample ID (seqFISH)",cex.main = 1)
Figure 1. Gene expression distribution of each sample in scRNA data and seqFISH data.
The boxplots show that the the distributions of scRNA data have some sample-wise variation, while seqFISH samples have more consistent distributions.
Some other analyses in the hackalthon did quantile normalisation for paired genes to try to eliminate the differences across datasets. However, we proposed to do quantile normalisation on samples because although different samples have different differentially expressed (DE) genes, the majority of the genes should express at the same level. That means whichever genes are DE, the mean and quantile of each sample should be of the similar level.
Therefore, We unify the quantile normalisation of samples between the datasets to maximumly eliminate the differences between the datasets. The normalisation is done by normalize.quantiles
function from preprocessCore
package.
## QUANTILE normalisation on all the samples
all_data = rbind(scRNA,seqFISH)
qt_all_data = t(normalize.quantiles(t(all_data)))
colnames(qt_all_data) <- colnames(scRNA)
# get the quantile normalised scRNA and seqFISH data
qt_scRNA = data.frame(qt_all_data[1:dim(scRNA)[1],])
qt_seqFISH = data.frame(qt_all_data[(1+dim(scRNA)[1]):(dim(scRNA)[1]+dim(seqFISH)[1]),])
Then check 40 random samples from each dataset again.
Figure 2. Gene expression distribution of 40 random samples in scRNA data and seqFISH data after quantile normalisation.
qt_scRNA
is quantile normalised scRNA data and qt_seqFISH
is quantile normalised seqFISH data.
Let’s check the number of samples in each class.
table(scRNA_label)
## scRNA_label
## Astrocyte Endothelial Cell GABA-ergic Neuron
## 43 29 761
## Glutamatergic Neuron Microglia Oligodendrocyte.1
## 812 22 19
## Oligodendrocyte.2 Oligodendrocyte.3
## 6 31
The cells of each class are heavily imbalanced in the training data. There is not enough training data for the minor group and the major groups will dominate the classification. Then the model will not have enough power to classify minor groups. To solve the problem, I used two strategies. First, I combined the three subgroups of Oligodendroctyes into one group. This is because in the Tasic paper, some cells were not identified as “core” cells for any Oligo but identified as “intermediate” cells and these three types were identified as close to each other. Also, Oligo.2 has too few samples. The second strategy is to oversample the minor groups.
First we combine the three Oligo subgroups.
scRNA_relabel = scRNA_label
scRNA_relabel[c(which(scRNA_relabel == "Oligodendrocyte.1"),
which(scRNA_relabel == "Oligodendrocyte.2"),
which(scRNA_relabel == "Oligodendrocyte.3"))] = "Oligodendrocyte"
table(scRNA_relabel)
## scRNA_relabel
## Astrocyte Endothelial Cell GABA-ergic Neuron
## 43 29 761
## Glutamatergic Neuron Microglia Oligodendrocyte
## 812 22 56
qt_scRNA_Astro = qt_scRNA[which(scRNA_label == "Astrocyte"),]
qt_scRNA_Endo = qt_scRNA[which(scRNA_label == "Endothelial Cell"),]
qt_scRNA_GABA = qt_scRNA[which(scRNA_label == "GABA-ergic Neuron"),]
qt_scRNA_Glu = qt_scRNA[which(scRNA_label == "Glutamatergic Neuron"),]
qt_scRNA_Micro = qt_scRNA[which(scRNA_label == "Microglia"),]
qt_scRNA_O1 = qt_scRNA[which(scRNA_label == "Oligodendrocyte.1"),]
qt_scRNA_O2 = qt_scRNA[which(scRNA_label == "Oligodendrocyte.2"),]
qt_scRNA_O3 = qt_scRNA[which(scRNA_label == "Oligodendrocyte.3"),]
qt_scRNA_Oligo = rbind(qt_scRNA_O1, qt_scRNA_O2,qt_scRNA_O3)
Now, we oversample the minor groups.
com_qt_scRNA_label = c(rep("Astrocyte", 43*15), rep("Endothelial Cell",29*25),
rep("GABA-ergic Neuron",761), rep("Glutamatergic Neuron", 812),
rep("Microglia",22*30), rep("Oligodendrocyte", 56*12))
com_qt_scRNA_label = as.factor(com_qt_scRNA_label)
We end up with around 700 samples in each class. The total number of cells after oversampling is 4275.
length(com_qt_scRNA_label)
## [1] 4275
table(com_qt_scRNA_label)
## com_qt_scRNA_label
## Astrocyte Endothelial Cell GABA-ergic Neuron
## 645 725 761
## Glutamatergic Neuron Microglia Oligodendrocyte
## 812 660 672
com_qt_scRNA_label
is the new scRNA label vector. Correspondingly, the scRNA
data should also be resized. com_qt_scRNA
is the oversampled scRNA data.
com_qt_scRNA_Astro = qt_scRNA_Astro %>% slice(rep(1:n(), each = 15))
com_qt_scRNA_Endo = qt_scRNA_Endo %>% slice(rep(1:n(), each = 25))
com_qt_scRNA_GABA = qt_scRNA_GABA
com_qt_scRNA_Glu = qt_scRNA_Glu
com_qt_scRNA_Micro = qt_scRNA_Micro %>% slice(rep(1:n(), each = 30))
com_qt_scRNA_Oligo = qt_scRNA_Oligo %>% slice(rep(1:n(), each = 12))
com_qt_scRNA = rbind(com_qt_scRNA_Astro, com_qt_scRNA_Endo, com_qt_scRNA_GABA, com_qt_scRNA_Glu,
com_qt_scRNA_Micro, com_qt_scRNA_Oligo)
dim(com_qt_scRNA)
## [1] 4275 113
com_qt_scRNA
has 113 cells.
Sparse PLS discriminant analysis (sPLS-DA) is a supervised machine learning algorithm (Lê Cao et al. 2011). Its performance in public microarray and SNP data sets is similar to other algorithms and it’s very efficient. The function is available in mixOmics
package.
Now let’s train the sPLS-DA model on preprocessed scRNA data. First we tune the number of components.
# The cross validation is stochastic so need to set seed for reproducibility
set.seed(0610)
# build the model
com_qt_splsda = splsda(com_qt_scRNA, com_qt_scRNA_label, ncomp = 10)
# hyperparameter tuning
com_qt_perf.splsda <- perf(com_qt_splsda, validation = "Mfold",
folds = 5, nrepeat = 8)
plot(com_qt_perf.splsda, overlay = 'measure')
n_comp <- 6 # 6 is the elbow in the error rate plot
error_rate <- com_qt_perf.splsda$error.rate$BER[n_comp,"max.dist"]
The error rate curve for tuning the number of components shows that the elbow is at comp 6, and at comp 6 max distance has the lowest error rate 0.0517155. Note that after oversampling, the balanced error rate (BER) is similar to overall error rate. We still use BER to evaluate the performance since the numbers of each class are still different.
Then, we tune the number of genes of each compnent. We use max distance as the distance method. (You can skip the next chunk and directly tune the model with fewer genes on each component using the code in “gene selection” section.)
# tune keepX (the number of genes selected in each component)
set.seed(0610)
list.keepX <- c(1:10, seq(20, 100, 5))
com_qt_splsda.tune <- tune.splsda(com_qt_scRNA, com_qt_scRNA_label, ncomp = 6, validation = 'Mfold', folds = 5, dist = 'max.dist',progressBar = F , test.keepX = list.keepX, nrepeat = 8)
plot(com_qt_splsda.tune, optimal = TRUE, sd = TRUE)
# the number of genes selected in each component
com_qt_splsda.tune$choice.keepX
## comp1 comp2 comp3 comp4 comp5 comp6
## 5 1 65 1 75 10
keepX <- com_qt_splsda.tune$choice.keepX
# the training error rate of the tuned model
com_qt_splsda.tune$error.rate[27,n_comp]
## [1] 0.1037554
error_rate <- com_qt_splsda.tune$error.rate[27,n_comp]
The number of selected genes in each component is 5, 1, 65, 1, 75, 10. The BER (balanced error rate, max distance) is 0.1037554.
We can try using fewer genes for each component. Hence we restrict the number of genes of each component within 5.
# tune keepX
set.seed(0610)
list.keepX <- c(1:5)
select_genes.tune <- tune.splsda(com_qt_scRNA, com_qt_scRNA_label, ncomp = 6, validation = 'Mfold',folds = 5, dist = 'max.dist',progressBar = F , test.keepX = list.keepX, nrepeat = 8)
# the number of genes selected in each component
select_genes.tune$choice.keepX
## comp1 comp2 comp3 comp4 comp5 comp6
## 5 1 5 5 5 5
keepX <- select_genes.tune$choice.keepX
# the training error rate of the tuned model
select_genes.tune$error.rate[5,n_comp]
## [1] 0.1014857
error_rate <- select_genes.tune$error.rate[5,n_comp]
The tuned keepX is 5, 1, 5, 5, 5, 5. The BER (max distance) is 0.1014857, even lower than the model with more genes.
select_genes.tune$error.rate.class[,6]
## Astrocyte Endothelial Cell GABA-ergic Neuron
## 0.00000000 0.07413793 0.18971748
## Glutamatergic Neuron Microglia Oligodendrocyte
## 0.11961207 0.00000000 0.22544643
The error rate (training set) of all the classes are low. The training error rate of Astro and Micro are 0.0. This may be explained by the presence of evidently representative marker genes in these two types.
# this is the final sPLS-DA model (using the tuned hyperparameters)
splsda <- mixOmics::splsda
final_sPLSDA.train <- splsda(com_qt_scRNA, com_qt_scRNA_label, ncomp = 6, keepX = c(5,1,5,5,5,5))
# predict seqFISH
final_sPLSDA.predict <- predict(final_sPLSDA.train, qt_seqFISH)
# Take a look at the number of cells predicted in each class
table(final_sPLSDA.predict$class$max.dist[, n_comp])
##
## Astrocyte Endothelial Cell GABA-ergic Neuron
## 114 21 539
## Glutamatergic Neuron Microglia Oligodendrocyte
## 797 55 71
The predictions of the cell types with probability higher than 50% are considered as confident classifications. Hence we keep the cells that have prediction probability over 50%.
pred_prob = final_sPLSDA.predict[["predict"]][,,n_comp]
low_prob = which(rowMaxs(pred_prob)<0.5)
# the seqFISH cells with high probabilities (predictions considered "ground truth")
seqFISH_data_predicted = qt_seqFISH[-low_prob,]
# the labels of the seqFISH cells with high probabilities
seqFISH_label_predicted = final_sPLSDA.predict$class$max.dist[, n_comp][-low_prob]
table(seqFISH_label_predicted)
## seqFISH_label_predicted
## Astrocyte Endothelial Cell GABA-ergic Neuron
## 84 7 376
## Glutamatergic Neuron Microglia Oligodendrocyte
## 663 29 35
length(low_prob)
## [1] 403
n_low_prob <- length(low_prob)
403 cells are with low prediction prob (< 0.5).
We compare the sPLSDA predictions with Zhu paper.
# my prediction
predictions = final_sPLSDA.predict$class$max.dist[, n_comp]
# Zhu's prediction
seqFISH_relabel = seqFISH_label
seqFISH_relabel[c(which(seqFISH_relabel == "Oligodendrocyte.1"),
which(seqFISH_relabel == "Oligodendrocyte.2"),
which(seqFISH_relabel == "Oligodendrocyte.3"))] = "Oligodendrocyte"
# the percentage of cells that are the same with Zhu's prediction
sum(seqFISH_relabel == as.character(predictions))/length(predictions)
## [1] 0.7069505
converge <- sum(seqFISH_relabel == as.character(predictions))/length(predictions)
71% of the final predictions of splsda converge with Zhu paper.
The above model used max distance when tuning the hyperparameters. We also tuned the model with other distances. The code for tuning is shown below.
#centroids.tune <- tune.splsda(com_qt_scRNA, com_qt_scRNA_label, ncomp = 6, validation = 'Mfold',
# folds = 5, dist = 'centroids.dist',progressBar = F,
# test.keepX = list.keepX, nrepeat = 8)
#mahalanobis.tune <- tune.splsda(com_qt_scRNA, com_qt_scRNA_label, ncomp = 6, validation = 'Mfold',
# folds = 5, dist = 'mahalanobis.dist',progressBar = F ,
# test.keepX = list.keepX, nrepeat = 8)
#centroids.tune$choice.keepX
#centroids.tune$error.rate[5,6]
#mahalanobis.tune$choice.keepX
#mahalanobis.tune$error.rate[5,6]
You can use these two models to predict the seqFISH data simply by modifying the code from line 294. The results will show that the convergence with Zhu’s paper of the two models is the same to the model with max distance.
Now we calculate the total number of genes selected in the sPLS-DA model.
# calculate the number of selected features
features = c()
for (i in 1:n_comp){
new_features = rownames(plotLoadings(final_sPLSDA.train, plot = F,
contrib = "max", method = "median",comp = i))
features = c(features,new_features)
}
features = unique(features)
length(features)
## [1] 24
With only 24 genes the model is able to accurately classify the training data and make the predictions converge with Zhu’s paper up to 0.7069505.
The following figures are the sample plots of the sPLS-DA model. The plots are also overlaid with the predicted cell types of seqFISH data.
##### sampleplots ####
# extract prediction labels
label.predict <-final_sPLSDA.predict$class$max.dist[, 'comp6']
label.predict <- factor(label.predict, levels = levels(as.factor(com_qt_scRNA_label)))
summary(label.predict)
## Astrocyte Endothelial Cell GABA-ergic Neuron
## 114 21 539
## Glutamatergic Neuron Microglia Oligodendrocyte
## 797 55 71
col.predict <- color.mixo(as.numeric(label.predict))
# comp 1&2 with legend
plotIndiv(final_sPLSDA.train, ellipse = TRUE,comp = c(1,2),legend = TRUE, legend.title = 'Cell type',
ind.names = FALSE, style = "graphics", pch = 16, ellipse.level = 0.90,
X.label = 'sPLS-DA comp1', Y.label = 'sPLS-DA comp2')
points(final_sPLSDA.predict$variates[,1], final_sPLSDA.predict$variates[,2], pch = 2,
cex = 0.5, col = col.predict)
legend("topleft",legend = c("scRNA training","seqFISH prediction"), pch = c(16, 2))
# comp 1&3 with legend
plotIndiv(final_sPLSDA.train, ellipse = TRUE,comp = c(3,4),legend = TRUE, legend.title = 'Cell type',
ind.names = FALSE, style = "graphics", pch = 16, ellipse.level = 0.90,
X.label = 'sPLS-DA comp1', Y.label = 'sPLS-DA comp3')
points(final_sPLSDA.predict$variates[,1], final_sPLSDA.predict$variates[,3], pch = 2,
cex = 0.5, col = col.predict)
legend("topleft",legend = c("scRNA training","seqFISH prediction"), pch = c(16, 2))
The sample plots show that on the first 3 components, the sPLS-DA model can discriminate Astrocyte, Microglia and Endothelial cell types in scRNA data (circles) well. Overlaying seqFish data (triangels) shows that the predictions of cell types for seqFISH by sPLS-DA model are also in agreement with the cell types from scRNA data.
The codes for sample plots of other components can be adapted from the codes for the first three components.
Let’s plot the loading factors on each component. It shows how much the genes contribute to the class discrimination of the training data on each component.For example in the following plots, the contribution on Comp 1 shows that gene gja1 is over expressed in Astrocyte class according to its median value compared to the other cell types.
## loading factors for the trained splsda model
par(mfrow=c(1,2))
plotLoadings(final_sPLSDA.train, comp = 1, method = 'median', contrib = 'max',cex.main = 2)
plotLoadings(final_sPLSDA.train, comp = 2, method = 'median', contrib = 'max',legend = FALSE, title = "comp2",cex.main = 3)
plotLoadings(final_sPLSDA.train, comp = 3, method = 'median', contrib = 'max',legend = FALSE, title = "comp3",cex.main = 3)
plotLoadings(final_sPLSDA.train, comp = 4, method = 'median', contrib = 'max',legend = FALSE, title = "comp4",cex.main = 3)
plotLoadings(final_sPLSDA.train, comp = 5, method = 'median', contrib = 'max',legend = FALSE, title = "comp5",cex.main = 3)
plotLoadings(final_sPLSDA.train, comp = 6, method = 'median', contrib = 'max',legend = FALSE, title = "comp6",cex.main = 3)
par(mfrow = c(1,1))
The following table lists the most indictive genes for each cell type. The table also shows which genes are selected on which sPLS-DA component..
Regarding the list of genes selected, several were found to be associated with specific cell types: Gja1 was reported to regulate astrocytic migration and proliferation (Homkajorn et al. 2010) and was identified as a marker gene of astrocytes in the original study of Zhu et al. (2018) ; Cldn5 encodes a protein that forms tight junction strands, as a physical barrier to prevent solutes and water from passing through endothelial cell sheets (Barmeyer et al. 2017); Omg (oligodendrocyte myelin glycoprotein) is present in Oligodendrocytes; Laptm5 was chosen in Bonham et al. 2019 to illustrate the expression profiles for the microglial gene module and Tbr1 was found to be a common genetic determinant for the differentiation of early-born glutamatergic neocortical neurons (Hevner et al. 2001).
In general, the genes found with large loading factors by the sPLS-DA model are backed up with previous studies. The classifier has a biological meaning.
The PCA sample plot shows that there is batch effect between training and test datasets. To account for the batch effect, we will use a semi-supervised approach to borrow the information from seqFISH data. MINT algorithm is able to consider the batch effect between datasets. The predictions with high probabilities by sPLS-DA model will be considered as accurate labels here. I will train a MINT model on the combination of the training data and the predictions of seqFISH with high probabilities and predict the seqFISH data that have low probabilities.
# pca sample plot for seqFISH and scRNA
X <- rbind(qt_scRNA, qt_seqFISH)
group <- c(rep("scRNA",n_scRNA),rep("seqFISH",n_seqFISH))
pca <- pca(X, ncomp = 2)
plotIndiv(pca, group = group, legend = T, ellipse = T)
We combine the “accurate” seqFISH data and scRNA data. The combined data still need oversampling and undersampling due to the imbalance of classes.
## [1] 2917 113
## [1] 2917
## [1] 2917
## Y
## Astrocyte Endothelial Cell GABA-ergic Neuron
## 127 36 1137
## Glutamatergic Neuron Microglia Oligodendrocyte
## 1475 51 91
Now we oversample the minor groups and undersample the major groups. The number of samples in each groups will be around 500.
## [1] 2998 113
## Y
## Astrocyte Endothelial Cell GABA-ergic Neuron
## 508 576 500
## Glutamatergic Neuron Microglia Oligodendrocyte
## 500 459 455
## [1] 2998
We train a MINT.sPLSDA model on the combined data.
set.seed(0610)
mint = mint.splsda(X, Y, study = study, ncomp = 10)
# tune the number of component
perf.mint.splsda <- perf(mint, validation = "Mfold", folds = 5,progressBar = FALSE, nrepeat = 8)
n_comp = 6
plot(perf.mint.splsda, overlay = 'measure')
The error rate curve for tuning the number of components shows that 6 component is enough to reach a low error rate. Now we use the number of genes selected by each component. We restrict the number of each component no more than 5.
# tune keepX
list.keepX <- c(1:5)
set.seed(0610)
mint.splsda.tune <- tune.mint.splsda(X, Y, ncomp = 6, dist = 'max.dist', study = as.factor(study), measure = "BER", test.keepX = list.keepX)
plot(mint.splsda.tune, optimal = TRUE, sd = TRUE)
# the chosen keepX after tuning
mint.splsda.tune$choice.keepX
## comp1 comp2 comp3 comp4 comp5 comp6
## 5 1 1 2 5 3
# the
mint.splsda.tune$error.rate[5,n_comp]
## [1] 0.1565505
Here is the final model after tuning.
## train mint
mint.train <- mint.splsda(X, Y, ncomp = n_comp, study = study,
keepX=c(5,1,4,2,5,3))
We plot the loading factors on each component.
par(mfrow=c(1,2))
plotLoadings(mint.train, comp = 1, method = 'median', contrib = 'max',title = "contribution on comp1")
plotLoadings(mint.train, comp = 2, method = 'median', contrib = 'max' ,legend = FALSE, title = "comp2",cex.main = 3)
plotLoadings(mint.train, comp = 3, method = 'median', contrib = 'max' ,legend = FALSE, title = "comp3",cex.main = 3)
plotLoadings(mint.train, comp = 4, method = 'median', contrib = 'max' ,legend = FALSE, title = "comp4",cex.main = 3)
plotLoadings(mint.train, comp = 5, method = 'median', contrib = 'max' ,legend = FALSE, title = "comp5",cex.main = 3)
plotLoadings(mint.train, comp = 6, method = 'median', contrib = 'max' ,legend = FALSE, title = "comp6",cex.main = 3)
par(mfrow = c(1,1))
The most predictive genes significantly overlap with the genes identified by sPLS-DA model. It means that the gene expression has a larger effect on the prediction rather than the batch effect. It also means that the MINT classifier also has biological meaning. Note that for GABA, there are no significantly high loading-factor genes.
plotIndiv(mint.train, ellipse = TRUE,comp = c(2,3),legend = TRUE, title ="sample plot by mint", cex.main = 3)
#predict the rest of seqFISH
mint.predict <- predict(mint.train, seqFISH_data_tobe_predicted,
study.test = rep(2, dim(seqFISH_data_tobe_predicted)[1]))
seqFISH_the_rest = cbind(seqFISH_data_tobe_predicted, mint.predict$class$max.dist[, 6])
colnames(seqFISH_the_rest)[114] = "type"
seqFISH_predicted = cbind(seqFISH_data_predicted,seqFISH_label_predicted)
colnames(seqFISH_predicted)[114] = "type"
final_prediction = rbind(seqFISH_the_rest, seqFISH_predicted)
final_prediction = final_prediction[ order(as.numeric(row.names(final_prediction))), ]
final_labels = final_prediction$type
The following figure shows the mean distribution of each cell type. The predicted cell types show distinct distribution patterns. Note that the mean distribution of GABA cells do not have DE genes. This is aligned with the observation that the predictive genes for GABA do not have very high loading coefficients.
#plot the mean distribution of each class
distribution = rep(0, n_genes+1)
for (i in levels(as.factor(final_prediction$type))){
mean = colMeans(final_prediction[final_prediction$type==i,1:n_genes])
distribution = rbind(distribution ,c(mean,i))
}
distribution = distribution[-1,]
for (i in 1:n_comp){
plot(distribution[i,-1],ylim = c(30,70), main = paste("gene distribution of",
distribution[i, n_genes+1]),
xlab = "gene_id", ylab = "normalised gene count")
}
seqFISH_relabel = seqFISH_label
seqFISH_relabel[c(which(seqFISH_relabel == "Oligodendrocyte.1"),
which(seqFISH_relabel == "Oligodendrocyte.2"),
which(seqFISH_relabel == "Oligodendrocyte.3"))] = "Oligodendrocyte"
sum(seqFISH_relabel == as.character(final_labels))/length(final_labels)
## [1] 0.6537257
converge_mint <- sum(seqFISH_relabel == as.character(final_labels))/length(final_labels)
65% of the predictions converge with Zhu et al. prediction. It is lower than the prediction made by pure sPLS-DA model. Note that Zhu’s prediction is not benchmarked. Here is not used for evaluation. For studies with no certain answers, we should compare results from different methods.
The provided datasets have been well preprocessed. We further did quantile normalisation across all samples to try to eliminate batch effect before the supervised stage of learning. We also oversampled the data since the classes were imbalanced. The learning algorithm was divided into two analysis stages. First was to learn a classifier on scRNA data using sPLS-DA model and predict seqFISH data using the tuned sPLS-DA
model. The second stage was to select cells with high probability of prediction as the new training data and combine them with scRNA data. The combined data were used to train a mint.splsda
model to predict the rest of seqFISH data. Both models were cross validated (M-fold).
The selection of features were done by restricting the number of genes selected on each component. The final sPLS-Da model selected 24 genes to predict the cell types of the cells from the scRNA-seq data. The genes that contribute most to each classes were summarised and many of them were already referenced in the literature. The predictive top genes selected by MINT algorithms were in agreement with the top genes selected with sPLS-DA. It means that the batch effect does not influence much of the model training. The following table summarises the final prediction of seqFISH data. Finally, we visualised the mean gene distribution of the samples in each class. Each of the class presents a very distinct distribution pattern.
Around 71% of the sPLS-DA predictions converged with Zhu’s predictions. The semi-supervised approach by MINT had a lower convergence with Zhu’s paper (65%). However, these values are based on the assumption that Zhu’s prediction are considered as ‘ground truth’.
Here is the final prediction.
table(final_labels)
## final_labels
## Astrocyte Endothelial Cell GABA-ergic Neuron
## 176 30 449
## Glutamatergic Neuron Microglia Oligodendrocyte
## 770 103 69
Barmeyer, C., Erko, I., Awad, K., Fromm, A., Bojarski, C., Meissner, S., Loddenkemper, C., Kerick, M., Siegmund, B., Fromm, M., Schweiger, M. R., & Schulzke, J. D. (2017). Epithelial barrier dysfunction in lymphocytic colitis through cytokine-dependent internalization of claudin-5 and -8. Journal of gastroenterology, 52(10), 1090–1100. https://doi.org/10.1007/s00535-017-1309-2
Bonham, L. W., Sirkis, D. W., & Yokoyama, J. S. (2019). The Transcriptional Landscape of Microglial Genes in Aging and Neurodegenerative Disease. Frontiers in immunology, 10, 1170. https://doi.org/10.3389/fimmu.2019.01170
Lê Cao, K. A., Boitard, S., & Besse, P. (2011). Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC bioinformatics, 12, 253. https://doi.org/10.1186/1471-2105-12-253
Hevner, R. F., Shi, L., Justice, N., Hsueh, Y., Sheng, M., Smiga, S., Bulfone, A., Goffinet, A. M., Campagnoni, A. T., & Rubenstein, J. L. (2001). Tbr1 regulates differentiation of the preplate and layer 6. Neuron, 29(2), 353–366. https://doi.org/10.1016/s0896-6273(01)00211-2
Homkajorn, B., Sims, N. R., & Muyderman, H. (2010). Connexin 43 regulates astrocytic migration and proliferation in response to injury. Neuroscience letters, 486(3), 197–201. https://doi.org/10.1016/j.neulet.2010.09.051
Rohart F., Gautier B., Singh A., and Le Cao K-A. (2017). mixOmics: An R package for ’omics feature selection and multiple data integration. PLoS computational biology 13(11):e1005752
Tasic, B., Menon, V., Nguyen, T. N., Kim, T. K., Jarsky, T., Yao, Z., Levi, B., Gray, L. T., Sorensen, S. A., Dolbeare, T., Bertagnolli, D., Goldy, J., Shapovalova, N., Parry, S., Lee, C., Smith, K., Bernard, A., Madisen, L., Sunkin, S. M., Hawrylycz, M., … Zeng, H. (2016). Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nature neuroscience, 19(2), 335–346. https://doi.org/10.1038/nn.4216
Zhu, Q., Shah, S., Dries, R., Cai, L., & Yuan, G. C. (2018). Identification of spatially associated subpopulations by combining scRNAseq and sequential fluorescence in situ hybridization data. Nature biotechnology, 10.1038/nbt.4260. Advance online publication.
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.0.6 preprocessCore_1.52.1 mixOmics_6.14.1
## [4] ggplot2_3.3.3 lattice_0.20-41 MASS_7.3-53.1
## [7] matrixStats_0.58.0
##
## loaded via a namespace (and not attached):
## [1] ggrepel_0.9.1 Rcpp_1.0.6 tidyr_1.1.3
## [4] corpcor_1.6.9 rprojroot_2.0.2 digest_0.6.27
## [7] utf8_1.2.1 RSpectra_0.16-0 R6_2.5.0
## [10] plyr_1.8.6 ellipse_0.4.2 evaluate_0.14
## [13] highr_0.9 pillar_1.6.0 rlang_0.4.11
## [16] Matrix_1.3-2 rmarkdown_2.8 pkgdown_1.6.1
## [19] labeling_0.4.2 textshaping_0.3.3 desc_1.3.0
## [22] rARPACK_0.11-0 BiocParallel_1.24.1 stringr_1.4.0
## [25] igraph_1.2.6 munsell_0.5.0 compiler_4.0.5
## [28] xfun_0.22 pkgconfig_2.0.3 systemfonts_1.0.1
## [31] htmltools_0.5.1.1 tidyselect_1.1.1 tibble_3.1.1
## [34] gridExtra_2.3 fansi_0.4.2 crayon_1.4.1
## [37] withr_2.4.2 grid_4.0.5 gtable_0.3.0
## [40] lifecycle_1.0.0 magrittr_2.0.1 scales_1.1.1
## [43] stringi_1.5.3 cachem_1.0.4 farver_2.1.0
## [46] reshape2_1.4.4 fs_1.5.0 ellipsis_0.3.2
## [49] ragg_1.1.2 generics_0.1.0 vctrs_0.3.8
## [52] RColorBrewer_1.1-2 tools_4.0.5 glue_1.4.2
## [55] purrr_0.3.4 parallel_4.0.5 fastmap_1.1.0
## [58] yaml_2.2.1 colorspace_2.0-1 memoise_2.0.0
## [61] knitr_1.33