spatial Transcription region deg genemodule heatmap GO analysis

After the spatial transcriptome is annotated manually, the regional annotation information of the spatial group is obtained. At this time, we want to obtain the functional annotation information of its differentially expressed genes. There is no universal script here. I only provide an idea. ,For reference.

Since most of our spatial group data is in the h5ad file format of anndata, we need to convert it into the rds file format that can be read by seurat. Here we use a convenient tool that we talked about in our previous blog Interaction between rds and h5ad Conversion, the approximate code is as follows:

sceasy::convertFormat('/data/*.h5ad', 
                      from="anndata", 
                      to="seurat", 
                      main_layer = 'count',
                      outFile='/data/*.rds')

So we got the corresponding rds file, and then we used the findallmarkers function in seurat to screen differentially expressed genes and remove some genes we don’t need, such as mitochondrial genes, ribosomal protein genes, hemoglobin genes, and cytoplasmic genes. , collagen genes, etc. Here we can filter through regular expressions and sort avglog2fc to filter out the top 50 differentially expressed genes of each group, and then calculate them with genemodule. What needs to be noted here is that when I was calculating deg, I did not normalize the data of the spatial group, because it is bin50 data. Of course, it is not impossible to normalize it. You can decide whether you want to do this. Carry out normalization; in addition, when calculating genemodule, I chose to select the raw matrix and the matrix after normalization to calculate genemodule. You can decide whether to normalize based on the results. Although addgenemodule’s The documentation states that the input matrix is a normalized matrix.

There are also some strange-looking codes in the script. The matrix input to calculate deg and the matrix used to calculate genemodule can be different. The reason is that we may capture certain positions on the space group chip differently from other places, and we need to remove them. At this time, we can select the matrix with certain areas removed to calculate deg, and then use the normal chip to calculate genemodule.

# Rscript your_script.R path_to_rds_file.rds refined_pred_column_name path_to_output_csv_file.csv

# Read command line parameters
args <- commandArgs(trailingOnly = TRUE)

# Check that the correct number of parameters is provided
if (length(args) != 4) {
  stop("Three parameters need to be provided：rds_deg_path, rds_module_path, refined_pred, def_save_path")
}

# Parse command line parameters
rds_deg_path <- args[1]
rds_module_path <- args[2]
cluster <- args[3]
save_path <- args[4]

# Import required libraries
library(Seurat)
library(tidyverse)

# Read  data
rds <- readRDS(rds_deg_path)

# Set the cell region label of the rds object
Idents(rds) <- rds@meta.data[[cluster]]

# Find differentially expressed genes
df <- FindAllMarkers(rds, only.pos = TRUE)

# Remove genes starting with MT|HB|RP|AC|COL
#MT Mitochondria
# RP ribosomal protein
# HB hemoglobin
# AC Cytoplasm
# COL COLLAGEN
deg_save_path = paste(save_path,'deg/deg.csv', sep = '')
write_csv(df, deg_save_path)

filtered_df <- df[!grepl("^(MT|HB|RP|AC|COL)", df$gene), ]

# Create a list indexed by cluster, each cluster corresponds to a sub-data frame
cluster_list <- split(filtered_df, filtered_df$cluster)
sort_and_select <- function(df) {
  sorted_df <- df[order(df$avg_log2FC, decreasing = TRUE), ]
  return(sorted_df)
}

# Apply sorting and filtering functions to each subdataframe
sorted_dfs <- lapply(cluster_list, sort_and_select)

# Merge all sorted sub-data frames into a new data frame
filtered_df <- do.call(rbind, sorted_dfs)

deg_save_path = paste(save_path,'deg/filtered_deg.csv', sep = '')
write_csv(filtered_df, deg_save_path)

# Create a list indexed by cluster, each cluster corresponds to a sub-data frame
cluster_list <- split(filtered_df, filtered_df$cluster)

# Define a function that sorts a subdataframe by the avglog2fc column and filters the first 50 rows
sort_and_select <- function(df) {
  sorted_df <- df[order(df$avg_log2FC, decreasing = TRUE), ]
  if (nrow(sorted_df) >= 50) {
    return(sorted_df[1:50, ])
  } else {
    return(sorted_df)
  }
}

# Apply sorting and filtering functions to each subdataframe
sorted_dfs <- lapply(cluster_list, sort_and_select)

# Merge all sorted sub-data frames into a new data frame
new_dataframe <- do.call(rbind, sorted_dfs)

# output final result
# print(new_dataframe)
deg_save_path = paste(save_path,'deg/filtered_top50_deg.csv', sep = '')
write_csv(new_dataframe, deg_save_path)

# Save unique values in cluster column into unique_cluster
unique_cluster <- unique(new_dataframe$cluster)
# Loop over each unique cluster value
for (cluster_value in unique_cluster) {
    print(cluster_value)
    subset_df <- subset(filtered_df, cluster == cluster_value)
    gene_list <- subset_df$gene
    gene_list = list(gene_list)
    gene_modules = paste(cluster_value ,'_module', sep = '')
    rds <- readRDS(rds_module_path)
    rds = AddModuleScore(rds, features=gene_list,name='gene_modules', nbin = 12)
    gene_module_path = paste(save_path, 'genemodules/csv/',cluster_value, '_raw_genemodule.csv', sep = "")
    write.csv(rds@meta.data, gene_module_path)

    rds = NormalizeData(rds)
    rds = AddModuleScore(rds, features=gene_list,name='gene_modules', nbin = 12)
    
    gene_module_path = paste(save_path, 'genemodules/csv/',cluster_value, '_lognor_genemodule.csv', sep = "")
    
    write.csv(rds@meta.data, gene_module_path)
}

After calculating the information of genemodule, we need to draw it on the spatial group to see whether these calculated deg are enriched in the areas we divided. Scanpy is used here. The reason for using scanpy is the spatial drawing function of scanpy. More versatile and convenient.

# python your_script.py --adata_file_path path_to_adata_file --csv_file_path path_to_csv_file --gene_modules module_name

import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
import os
import argparse
import scanpy as sc

# Create parameter parser
parser = argparse.ArgumentParser()
# Add command line options
parser.add_argument('--adata_file_path', type=str, help='Path to the AnnData file')
parser.add_argument('--save_path', type=str, help='Name of the gene module column')

# Parse command line parameters
args = parser.parse_args()

# Assign parameter values to corresponding variables
adata_file_path = args.adata_file_path
save_path = args.save_path

adata = sc.read(adata_file_path)

csv_path_dir = os.listdir(save_path+'genemodules/csv')
print(csv_path_dir)

for i in csv_path_dir:
    if 'genemodule' in i:
        print(i)
        name = i.replace('.csv', '')
        gene_modules = 'gene_modules1' 

        csv = pd.read_csv(save_path +'genemodules/csv/'+ i)
        gene_modules = csv[gene_modules].tolist()

        adata.obs[name] = gene_modules
        sc.pl.spatial(adata, color = name, spot_size = 100, show = False, frameon = False)
        plt.savefig(save_path + 'genemodules/fig/'+i.replace('.csv', '.pdf'))

    else:
        continue

print('genemodule plot over!')

So far, deg has been calculated, genemodule has been calculated, and drawing has been completed. Next, draw the heatmap. Here we use Python for calculation and R language for drawing. We treat each area as a psudo bulk. For each gene, calculate its average expression in each psudo bulk, and then Save it as csv and then draw it using R language.

import pandas as pd
import anndata as ad
import os
import argparse
import scanpy as sc
parser = argparse.ArgumentParser()

parser.add_argument('--adata_file_path', type=str, help='Path to the AnnData file')
parser.add_argument('--save_path', type=str, help='Name of the gene module column')
parser.add_argument('--cluster', type=str, help='Name of the cluster column')

args = parser.parse_args()

adata_file_path = args.adata_file_path
save_path = args.save_path
cluster = args.cluster

listdir = os.listdir(save_path + 'deg/')

print(listdir)

for i in listdir:
    if 'filtered' in i:
        adata = ad.read(adata_file_path)
        name = i.replace('.csv', '')
        print(name)
        csv = pd.read_csv(save_path + 'deg/' + i)
        deg_df_cluster = list(set(csv['cluster']))
        all_genes_ = []
        for j in deg_df_cluster:
            temp = csv[csv['cluster'] ==j].sort_values('avg_log2FC', ascending = False)[:50].gene.tolist()
            all_genes_ += temp
        all_genes = []
        
        for j in all_genes_:
            if j in all_genes:
                continue
            else:
                all_genes.append(j)
        heatmap = pd.DataFrame(index = all_genes)
        for j in deg_df_cluster:
            
            adata_temp = adata[adata.obs[cluster] == j]
            adata_temp = adata_temp[:,all_genes].X.todense()
            adata_temp = pd.DataFrame(adata_temp,columns = all_genes)
            heatmap[j] = adata_temp.mean()
        heatmap.to_csv(save_path + 'heatmap/' + name +'_raw_heatmap.csv')
        
        sc.pp.normalize_total(adata, target_sum=1e4)
        sc.pp.log1p(adata)
        adata = adata[:,all_genes]
        sc.pp.scale(adata, max_value=10)
        
        heatmap = pd.DataFrame(index = all_genes)
        for j in deg_df_cluster:
            
            adata_temp = adata[adata.obs[cluster] == j]
            # print(adata_temp.X)
            adata_temp = adata_temp.X
            adata_temp = pd.DataFrame(adata_temp,columns = all_genes)
            heatmap[j] = adata_temp.mean()
        heatmap.to_csv(save_path + 'heatmap/' + name +'_lognor_heatmap.csv')
        
        
    else:
        continue

The plotting function looks like this:

# Rscript 1.R gene_list_value rds_path_value gene_modules_value gene_module_path_value gene_module_normalized_path_value

args <- commandArgs(trailingOnly = TRUE)

# Check that the correct number of parameters is provided
if (length(args) != 1) {
  stop("A parameter needs to be provided：save_path")
}

# Parse command line parameters
save_path = args[1]

library(pheatmap)
library(ggplot2)
library(viridis)
library(tidyverse)

raw = paste(save_path, 'heatmap/filtered_deg_raw_heatmap.csv', sep = '')

data <- read.csv(raw, encoding="UTF-8")
rownames(data) = data$X
data$X =NULL
mycolors = viridis(11)

options(repr.plot.width=6, repr.plot.height=20)
p = pheatmap(data ,cluster_col = FALSE,cluster_row = FALSE,color = colorRampPalette(mycolors)(100),)

raw_save_path = paste(save_path, 'heatmap/filtered_deg_raw_heatmap.pdf', sep = '')
ggsave(raw_save_path, plot = p, width=6,height=20,limitsize = FALSE)


options(repr.plot.width=6, repr.plot.height=20)
p = pheatmap(data ,cluster_col = FALSE,cluster_row = FALSE,scale = "row",color = colorRampPalette(mycolors)(100),)

raw_save_path = paste(save_path, 'heatmap/filtered_deg_raw_scale_heatmap.pdf', sep = '')
ggsave(raw_save_path, plot = p, width=6,height=20,limitsize = FALSE)


raw = paste(save_path, 'heatmap/filtered_deg_lognor_heatmap.csv', sep = '')

data <- read.csv(raw, encoding="UTF-8")
rownames(data) = data$X
data$X =NULL
mycolors = viridis(11)

options(repr.plot.width=6, repr.plot.height=20)
p = pheatmap(data ,cluster_col = FALSE,cluster_row = FALSE,color = colorRampPalette(mycolors)(100),)

raw_save_path = paste(save_path, 'heatmap/filtered_deg_lognor_heatmap.pdf', sep = '')
ggsave(raw_save_path, plot = p, width=6,height=20,limitsize = FALSE)


options(repr.plot.width=6, repr.plot.height=20)
p = pheatmap(data ,cluster_col = FALSE,cluster_row = FALSE,scale = "row",color = colorRampPalette(mycolors)(100),)

raw_save_path = paste(save_path, 'heatmap/filtered_deg_lognor_scale_heatmap.pdf', sep = '')
ggsave(raw_save_path, plot = p, width=6,height=20,limitsize = FALSE)


raw = paste(save_path, 'heatmap/filtered_top50_deg_raw_heatmap.csv', sep = '')

data <- read.csv(raw, encoding="UTF-8")
rownames(data) = data$X
data$X =NULL
mycolors = viridis(11)

options(repr.plot.width=6, repr.plot.height=20)
p = pheatmap(data ,cluster_col = FALSE,cluster_row = FALSE,color = colorRampPalette(mycolors)(100),)

raw_save_path = paste(save_path, 'heatmap/filtered_top50_deg_raw_heatmap.pdf', sep = '')
ggsave(raw_save_path, plot = p, width=6,height=20,limitsize = FALSE)


options(repr.plot.width=6, repr.plot.height=20)
p = pheatmap(data ,cluster_col = FALSE,cluster_row = FALSE,scale = "row",color = colorRampPalette(mycolors)(100),)

raw_save_path = paste(save_path, 'heatmap/filtered_top50_deg_raw_scale_heatmap.pdf', sep = '')
ggsave(raw_save_path, plot = p, width=6,height=20,limitsize = FALSE)


raw = paste(save_path, 'heatmap/filtered_top50_deg_lognor_heatmap.csv', sep = '')

data <- read.csv(raw, encoding="UTF-8")
rownames(data) = data$X
data$X =NULL
mycolors = viridis(11)

options(repr.plot.width=6, repr.plot.height=20)
p = pheatmap(data ,cluster_col = FALSE,cluster_row = FALSE,color = colorRampPalette(mycolors)(100),)

raw_save_path = paste(save_path, 'heatmap/filtered_top50_deg_lognor_heatmap.pdf', sep = '')
ggsave(raw_save_path, plot = p, width=6,height=20,limitsize = FALSE)


options(repr.plot.width=6, repr.plot.height=20)
p = pheatmap(data ,cluster_col = FALSE,cluster_row = FALSE,scale = "row",color = colorRampPalette(mycolors)(100),)

raw_save_path = paste(save_path, 'heatmap/filtered_top50_deg_lognor_scale_heatmap.pdf', sep = '')
ggsave(raw_save_path, plot = p, width=6,height=20,limitsize = FALSE)

The heatmap of the R function draws more than one graph, and multiple graphs. You can choose the one you need. The last step is to perform GO enrichment analysis. Here I use clusterProfiler to perform GO enrichment analysis. The code here is also relatively simple. You can refer to it yourself.

# Rscript your_script.R path_to_rds_file.rds refined_pred_column_name path_to_output_csv_file.csv

args <- commandArgs(trailingOnly = TRUE)

if (length(args) != 1) {
  stop("A parameter needs to be provided：save_path")
}

save_path <- args[1]

library("clusterProfiler")
#install.packages("BiocManager")
#BiocManager::install("clusterProfiler")
#BiocManager::install("org.Hs.eg.db")
library(org.Hs.eg.db)
library(DOSE)
library(enrichplot)
library("tidyverse")
#install.packages("tidyverse")
library(RColorBrewer)
library(tidyverse)


#### go
cal_go = function(genelist, fig_name, csv_name) {
  go = enrichGO(genelist, 
                OrgDb = org.Hs.eg.db,
                ont = 'ALL',
                pAdjustMethod = "BH",
                pvalueCutoff = 1,
                qvalueCutoff = 1,
                keyType = "SYMBOL")
  go.count = data.frame(go)
  go.count_sort = go.count[order(go.count$pvalue, decreasing = FALSE),]
  
  top10 = function(i) {
    k = head(i[order(i$pvalue), ], 10);
    return(k)
  }
  
  dp = rbind(top10(go.count[go.count['ONTOLOGY'] == "BP", ]),
             top10(go.count[go.count['ONTOLOGY'] == "CC", ]),
             top10(go.count[go.count['ONTOLOGY'] == "MF", ]))
  d1 = dp[which(dp[, 1] == "BP"), ]
  d2 = dp[which(dp[, 1] == "CC"), ]
  d3 = dp[which(dp[, 1] == "MF"), ]
  
  d_l = rbind(d1, d2, d3)
  
  mt = d_l[, c(1, 3, 10)]
  
  colnames(mt)<-c("Type","name","number")
  mt$name <- factor(mt$name, levels = rev(mt$name))
  
  ggplot(data=mt, aes(x=name,y=number,fill=Type))+ theme_bw()+
    geom_bar(stat = "identity", position=position_dodge())+coord_flip()+xlab("GO term") + ylab("Number of genes")+
    scale_fill_brewer(palette="Set2")+
    theme(panel.border = element_blank(),
          plot.title = element_text(face = "bold",size = 12,vjust =1),
          axis.title = element_text(face = "bold",size = 12, vjust = 0.3),
          axis.text = element_text(size = 8, vjust = 0.8), 
          axis.line = element_line(), 
          panel.grid = element_blank(),
          legend.position=c(0.8,0.75),
          legend.key.size = unit(4, "mm"))
  ggsave(fig_name, height = 5, width = 10)
  #write.csv(go.count_sort, csv_name, row.names = F, quote = F)
  go.count_sort %>% write_csv(csv_name)
}


dataframe_path = paste(save_path, 'deg/', 'filtered_deg.csv',sep = '')

dataframe = read_csv(dataframe_path, col_names = FALSE)
dataframe <- dataframe[-1, ]

unique_clusters  <- unique(dataframe$X6)

for (cluster_value  in unique_clusters ) {
    print(cluster_value)
    subset_df <- dataframe[dataframe$X6 == cluster_value, ]
    
    genelist = subset_df$X7
    genelist = genelist[!is.na(genelist)]
    genelist = genelist[-1]
    
    csv_result = paste(save_path, 'go/raw_results/', cluster_value,'.csv', sep = '')
    pdf_result = paste(save_path, 'go/raw_results/', cluster_value,'.pdf', sep = '')
    cal_go(genelist, pdf_result, csv_result)
}


dataframe_path = paste(save_path, 'deg/', 'filtered_top50_deg.csv',sep = '')

dataframe = read_csv(dataframe_path, col_names = FALSE)
dataframe <- dataframe[-1, ]

unique_cluster <- unique(dataframe$X6)

for (cluster_value in unique_cluster) {
    subset_df <- dataframe[dataframe$X6 == cluster_value, ]
    genelist = subset_df$X7
    genelist = genelist[!is.na(genelist)]
    genelist = genelist[-1]
    csv_result = paste(save_path, 'go/raw_results/', cluster_value,'_top50.csv', sep = '')
    pdf_result = paste(save_path, 'go/raw_results/', cluster_value,'_top50.pdf', sep = '')
    cal_go(genelist, pdf_result, csv_result)
}

This code can give a rough reference. Sometimes we only want BP-related data. We can also read the table generated above and redraw it. The code is as follows:

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import os
import argparse

parser = argparse.ArgumentParser()

parser.add_argument('--save_path', type=str, help='Name of the gene module column')

args = parser.parse_args()

save_path = args.save_path

csv_file_dir_path = save_path + 'go/raw_results/'
save_dir_path = save_path + 'go/filtered_results/'
csv_file_dir = os.listdir(csv_file_dir_path)
csv_file = []
for i in csv_file_dir:
    if '.csv' in i:
        csv_file.append(i)

def go_bp_plot(csv_file_path, save_path):
    csv = pd.read_csv(csv_file_path)
    csv = csv[csv['ONTOLOGY']=='BP']
    # csv = csv.drop_duplicates(subset=['geneID'])
    csv.index = range(len(csv))


    plt.rcParams['font.size'] = 8  

    ax = csv[:10].plot.barh(y = 'Count', x = 'Description', width=0.92, color='#66c2a5')

    ax.set_xlabel('Number of genes', fontsize='12')
    ax.set_ylabel('GO term', fontsize='12')

    ax.set_yticklabels([]) # Remove y-axis tick labels

    # Add Description labels on top of each bar
    max_v = max(csv[:10]['Count'])
    for i, v in enumerate(csv[:10]['Count']):
        Description = csv['Description'][i]
        ax.text(0.025*max_v, i, Description, color='black', va='center')
    # csv['Description'][index]

    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

    # Set x-axis tick locator
    ax.xaxis.set_major_locator(ticker.MaxNLocator(integer=True))

    # Set x-axis tick formatter
    ax.xaxis.set_major_formatter(ticker.FormatStrFormatter('%d'))

    ax.legend().remove()
    plt.savefig(save_path, format='pdf')
    plt.close()

    
for i in csv_file:
    name = i.split('.')[0]
    go_bp_plot(csv_file_dir_path + name + '.csv', save_dir_path + name + '.pdf')

In general, we have implemented the calculation of deg of spatial group data, the calculation and drawing of genemodule, the calculation of the expression amount of each gene in each spatial annotation region and the drawing of heatmap, as well as the GO analysis of each region.

Finally, there is a simple sh script to run these 6 python or R scripts in a unified manner, as shown below:

# Save address of all files
save_path="/data/gw21_scp_0103/"
# plot genemodule heatmap 's h5ad path
h5ad_path="/data/*_0.01.h5ad"
# deg's rds path
rds_deg_path="/data/*_0103.rds"
# module rds path
rds_module_path="/data/*_0102.rds"
# Annotation information for brain regions
region='Ribbon'

folder=${save_path}""

if [ -d "$folder" ]; then
    echo "folder already exists!"
else
    mkdir "$folder"
fi

deg_folder=${save_path}"deg/"

if [ -d "$deg_folder" ]; then
    echo "deg_folder already exists!"
else
    mkdir "$deg_folder"
fi


genemodules_folder=${save_path}"genemodules/"

if [ -d "$genemodules_folder" ]; then
    echo "genemodules_folder already exists!"
else
    mkdir "$genemodules_folder"
fi

genemodules_folder=${save_path}"genemodules/csv/"

if [ -d "$genemodules_folder" ]; then
    echo "genemodules_folder already exists!"
else
    mkdir "$genemodules_folder"
fi

genemodules_folder=${save_path}"genemodules/fig/"

if [ -d "$genemodules_folder" ]; then
    echo "genemodules_folder already exists!"
else
    mkdir "$genemodules_folder"
fi

heatmap_folder=${save_path}"heatmap/"

if [ -d "$heatmap_folder" ]; then
    echo "genemodule_folder already exists!"
else
    mkdir "$heatmap_folder"
fi

go_folder=${save_path}"go/"

if [ -d "$go_folder" ]; then
    echo "go_folder already exists!"
else
    mkdir "$go_folder"
fi

go_folder=${save_path}"go/raw_results/"

if [ -d "$go_folder" ]; then
    echo "go_raw_results_folder already exists!"
else
    mkdir "$go_folder"
fi

go_folder=${save_path}"go/filtered_results/"

if [ -d "$go_folder" ]; then
    echo "go_filtered_results_folder already exists!"
else
    mkdir "$go_folder"
fi

# deg and genemodule calculations
Rscript /data/work/17.Script/addgenemodule/1.R ${rds_deg_path} ${rds_module_path} ${region} ${save_path}
# genemodule plot
python /data/work/17.Script/addgenemodule/2.py --adata_file_path ${h5ad_path} --save_path ${save_path}
# heatmap calculations
python /data/work/17.Script/addgenemodule/3.py --adata_file_path ${h5ad_path} --save_path ${save_path} --cluster ${region}
# heatmap plot
Rscript /data/work/17.Script/addgenemodule/4.R ${save_path}
# go calculations
Rscript /data/work/17.Script/addgenemodule/5.R ${save_path}
# go plot
python /data/work/17.Script/addgenemodule/6.py --save_path ${save_path}

that’s all.