甲基化特异性区域的计算鉴别_wangyunpeng_bio的博客-爱代码爱编程
多形性成胶质细胞瘤(GBM)甲基化区域的计算鉴别
目的:找出胶质细胞瘤特异性甲基化区域,为临床诊断提供理论依据
步骤:
1、 查找数据:下载TCGA中GBM的RNA-seq和甲基化数据
2、 甲基化数据分析,正常肿瘤对比,进行差异甲基化分析,找出肿瘤样本中高甲基化区域
3、 对RNA-seq数据进行分析,正常肿瘤对比,差异表达基因的筛选,找出肿瘤样本中低表达基因。
4、 结合甲基化和RNA-seq数据,将高甲基化和低表达基因取交集,这些基因很可能属于抑癌基因,与抑癌基因取交集,再结合promoter区域的CpG整合分析,寻找候选靶标。
5、 对找出的靶标进行验证,利用pubmed以及其他数据库,反向验证靶标的可靠性
整体流程
一、数据下载
首先进入TCGA下载数据GBM的RNA-seq和甲基化数据,从下表可见GBM共有172套RNA-seq数据以及437套DNA甲基化数据,由于TCGA提供Infinium HumanMethylation27 BeadChip和Infinium HumanMethylation450 BeadChip两种芯片平台的数据,为了避免后续不同芯片平台间数据合并的困难,仅下载HumanMethylation450的芯片数据,共计154套。
图表 1TCGA数据汇总
二、初步整理数据
使用TCGA-Assembler.2.0.5进行GBM数据批量下载与初步整理,并且绘制RNA-seq基因表达量盒型图以及甲基化芯片数据盒型图,由于数据量较大,此处不贴图。
三、整体可视化
首先对于甲基化数据,选取ID为TCGA.06.AABW.11A.31D.A368.05的数据,查看总体甲基化程度。由于每个位点真实情况只存在:甲基化/非甲基化两种,所以对全部位点甲基化程度进行统计,也应该是大部分位点处于“完全甲基化”(Methylation state=1)和“完全非甲基化”(Methylation state=0)两种状态,下图绘制了数据的频数柱状图,可以明显看出形状处于“两头高,中间低”,反向说明芯片数据质量较好。
图表 2单个样本CpG甲基化程度统计
接下来,对多个样本绘制CpG甲基化程度小提琴图,同一行是同一个病人,左边样本来源于Primary Solid Tumor,右边样本来源于Recurrent Solid Tumor,除了甲基化程度大部分分布于0和1附近外,还能看出来源于同一病人肿瘤的甲基化程度依旧会有略微差异。
TCGA barcode:https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode
图表 3小提琴图
同样的,对于RNA-seq数据也可以进行一些初步可视化,除了数据下载后绘制的盒型图,亦可以进行PCA初步查看数据分布,下图左为PCA陡坡图,反映了第一主成分、第二主成分…等等所拥有信息量的比例,下图右为使用PCA1和PCA2绘制的散点图,可以发现5个正常样本距离较近,从侧面反映数据可信度较好。
最后,对于RNA-seq表达谱数据,使用系统聚类方法,绘制树状图,可以发现5个正常样本距离也是很近,数据质量还行。
四、差异甲基化区域筛选
为了更加科学高效地筛选差异甲基化位点,参考bioconductor中甲基化芯片的分析流程,使用minfi包进行差异甲基化分析,得到差异甲基化位点。
http://www.bioconductor.org/help/workflows/methylationArrayAnalysis/
在检测的526733个CpG位点中,共有4927个CpG位点P值<0.01,且在肿瘤样本中保持着甲基化程度高于0.7,对应2054个基因。
五、差异表达基因筛选
由于数据源自RNA-seq,最主流的分析方法当然是基于负二项分布模型的DESeq2包。
先用MA-plot查看差异表达基因大致分布。意外的是,图形左侧有大概七条线状条纹,最初我怀疑这是sample之间有batch effect导致,需要用其他更好normalize的方法,后来用identify方法挨个找出每条线上的基因名及其对应的表达量,发现这些基因在172套样本中表达量几乎全为0,仅有一两个样本有一点点表达,这种数据的存在导致这些线状条纹的产生。
图表 4 MA-plot
然后, 选取p值最小的差异表达基因,绘制其在不同组间表达量,确实差异很显著。
图表 5表达量散点图
接着,绘制差异表达基因在不同组间的表达量热图,正常样本是图片最左边的五列,当然如果需要解释具体的生物学问题,需要将聚类出来的每一类,将差异表达基因进行GO以及KEGG注释,结合有关的生物学表型,探讨其分子机制及意义
图表 6表达量热图
最后,选取筛选条件为p值小于0.01且log2FoldChange<-2的差异表达基因,在肿瘤样本低表达的基因,共计1657个
六、抑癌基因的获取
在pubmed中查询研究人员整理的tumor suppressor genes,果然在Nucleic Acids Res发表过TSGene数据库,存储了抑癌基因的列表。https://bioinfo.uth.edu/TSGene/download.cgi
下载全部1217个人类抑癌基因的列表。
All the 1217 human tumor suppressor genes with basic annotations
图表 7 TSGene database
七、数据整合
对于甲基化数据中,肿瘤样本高甲基化CpG附近的基因,RNA-seq中肿瘤样本低表达的基因,以及TSGene数据库中下载的抑癌基因列表,三者做overlap,找出特异性的候选靶标,为后续分析做准备,下图为三者overlap的韦恩图。
图表 8数据整合韦恩图
共计找出12个候选靶标基因。
八、靶标筛选
之前筛选选择的单个CpG的差异甲基化,而实际临床检测应用时候,可能需要多个CpG作为对照,因此统计了12个候选靶标基因TSS前1.5kb内所有CpG的甲基化程度,然后绘制热图,可以明显发现,虽然当初用CpG的差异甲基化位点筛出来的基因都是肿瘤样本高甲基化的,可是统计TSS前1.5kb内所有CpG的甲基化程度,这些基因却有很多在所有样本中都是低甲基化状态,而看上去很靠谱的是NUAK1基因,其正常样本在TSS前1.5kb内低甲基化,肿瘤样本中对应区域高甲基化。
图表 9 methylation heatmap
NUAK1基因TSS前1.5kb内共检测了7个CpG,这7个CpG在154个样本中检测出来的甲基化程度如下图,可以明显看出来这7个CpG在Tumor组织中甲基化程度都相对高,而在Normal组织中甲基化程度相对较低。
图表 10 NUAK1的TSS区CpG甲基化程度
使用Cpgplot预测CpG island位置,这7个CpG在promoter5’到3’序列图上相对位置如下
1035 1094 1408 1413 1444 1448 1471
图表 11 CpG island预测
参数使用:
NUAK1promoter from 1 to 1500
Observed/Expected ratio > 0.40
Percent C + Percent G > 40.00
Length > 100
CpG island详细信息: Length 101 (1086..1186) Length 105 (1366..1470)
这七个CpG基本都在CpGisland中,具体序列见附录
九、靶标基因相关讨论
进入Gene数据库搜索NUAK1相关内容,可以发现基因全称NUAK family kinase 1,还是个激酶,激酶的话就对调控会有很大作用了,而在HPA RNA-seq normal tissues项目中,又看出来这个激酶在脑中表达量明显高于其他组织,这又与发生在脑部的GBM不谋而合。
图表 12 NUAK1相关讨论
十、分子机制探讨
对于肿瘤组织中高甲基化CpG附近的,并且在肿瘤样本中低表达的intersect共计274个基因,使用Gene Ontology进行富集分析,可以明显发现在GO biological process生物学过程中的“神经系统发育”、“化学性突触传递”和“细胞膜的组织”等部分里面有着富集,特别是“中枢神经系统的髓鞘形成”,富集程度达到26.95倍,这又与研究的多发生于脑补的GBM有着密切的联系,反向验证实验结果的正确性。
图表 13 GO富集分析
十一、FurtherMore
根据生物学知识可以得到,CpG的甲基化会调控基因的转录,因此,Transcript Start Site(TSS)附近的甲基化程度值得进行一番深入研究,选用人类基因组hg19版本,对23056基因共计46489个转录起始位点,进行转录起始位点富集甲基化程度统计。
统计TSS前后5000bp内CpG甲基化程度,并且使用曲线进行拟合,可以发现TSS处的CpG Methylation水平明显降低,这也与科学常识相吻合。
图表 14 TSS附近甲基化程度
附录
NUAK1promoter区CpG island用蓝色标注,检测的CpG用加粗横线下标标注。
>NUAK1promoter
TATGAAAGGAGAAGGGGGAGCTTTGGAACTGGACAGGTAGGGTTTAAATCCTGGTCCTGC
CATTTACAAACTGTGTAACCTCTGGGAAATTACTTAACCTTTCTGATACGGTTTCTTCAA
TTGAAAATAGGGATTGTAAACAGCTACTTTACAGAGGAGGGTTTACTGTCATAAAACAGT
ACCAGCCTATGGTAGATGATGCTGTTGATTCAATAGATACTGATGAAGTCCCACATATCT
GGGAGTAACACTATCAGCCAGAATAAGCCAGGTTATGCTGCTTTAACAAATCCCACTGAC
TTAACATAAATAAAGTTTAATATTTGCTGACACTACTTTTCCAATGCAGATTGGCTGGGA
TTTTCTGCCATCTTTACACAGAGACTCAGGCTGGCTTGGGGAGGCTCCAACTTTGTACCA
CCATGATTGTCAAGATAGGAAAAAGAAGACATGGGGATTTGCTCACTGGCTCCTAAAGGT
TCCAAGTGAAAGCAACACAATCACTTCTGCTCTCATTCCATTGGCCTAAGCAAGTCCGAT
GGTCTCATCTAACTTCAAGAGGATGGGGTAGTGGATTTCTACCACATCGCAGGGGAGGGG
AAGAAGTAGAAAATATTTTAAAATACTACTGTAGACACAATGTGTTTATCTCTACAATAG
ATCTGCTAAATCCATATCTTAAGAGAAAACCGAAGCTCTGAGAGGTTATGTCATTTGACC
AAGGCCACACAGCCAATCCTCTGGGAAGCCAGGACTCAACCTGCATGCCACTCATCCTGA
CTGTGGGATCTATAGGCACAACGTCCACAAACTGTATAGAAGCAGAACAGTTTCAGGATG
GGGGTGGGTGGCAGGGAGGCCCTGCAAATGCGGTGGACAGAATAGGAATTGAATAGGCAT
GATGCGCCTTTGTGTCTGTGTTCTGTCATTGCTCCTGGGGAAGGGAAGAAGGGGCAGGGA
AGTTTGTGTGAAATATCAGTAGAAGGAAAGTTTGGACAGGTAAGAATATCACGCGTCAGA
GTACAGCAGAGACACGTGTGGAGGATGAGGGCAGTGTTTCAGGCCATTACTCTGGCAGCA
GTGAGGAGGCTCCCGGGGAGGGTTGGGGAGAGCGGGCTGTTGCTGGAGTCCGGGGTAAGG
TGACCAGGGGTAGACAGGAATGAAGGCGGCGGGAATGTAGAGGAAGGGCTGCACCTGAGA
GCTGCAGAGGAGGCCCCAGTGAGGTCCACTGGTGGAAAAGAACCGCCATGCGATCTGCAG
AACACCAGACCTTCTTCAGCCTTCACTCTTCCCACCCTAGTCTGGTACATTGCACCACTT
GTTAAAAAAAAAAATTTCCCTAAGACCTCTTTTTCTCTAGCCTTCTTCCTTATTTTTCAT
GGTCTCTTCTTAGAACGGCGGCAGCCACGCCGCGGTGGGAGGCCCTTCCTGCCTGACCCT
TACCGTGCGGGGTACCGTTCCTGTCACCATCGCCAGGATCTGGCCCTTTCAGTGAAAGGA
diffFind.R
setwd("G:/AllShare/SkillTrainHomework/BasicDataProcessingResult")
load("GBM__methylation_450.rda")
setwd("G:/AllShare/SkillTrainHomework")
DATAFRAME<- data.frame(Data)
TCGASampleName<- colnames(DATAFRAME)
TCGASamplelist <- strsplit(TCGASampleName,split=".",fixed = TRUE)
TCGASampletpye <- c()
TCGASampleID <- c()
for(i in 1:length(TCGASamplelist)){
TCGASampletpye <- c(TCGASampletpye,TCGASamplelist[[i]][4])
TCGASampleID <- c(TCGASampleID,paste(TCGASamplelist[[i]][2],TCGASamplelist[[i]][3],sep = "."))
}
TCGASampletpyeResult <- c()
for(i in 1:length(TCGASampletpye)){
if(as.integer(substr(TCGASampletpye[i],1,2))<10){
TCGASampletpyeResult <- c(TCGASampletpyeResult,"Tumor")
}else{
TCGASampletpyeResult <- c(TCGASampletpyeResult,"Normal")
}
}
library(ggplot2)
#h<- DATAFRAME$TCGA.06.0125.01A.01D.A45W.05
h<- DATAFRAME$TCGA.06.AABW.11A.31D.A368.05
pdf("CpGmethlation.pdf")
ggplot(NULL,aes(x=h))+
geom_histogram(binwidth = 0.02)+
labs(x = "% methylation per base",title = "Histogram of % CpG methylation\nSampleID:TCGA.06.AABW.11A.31D.A368.05")
dev.off()
library(reshape2)
tmp <- Data[,1:8]
tmp <- data.frame(tmp)
colnames(tmp) <- TCGASampleID[1:8]
tmp2 <- stack(tmp)
tmp2 <- data.frame(tmp2)
colnames(tmp2) <- c("CpG methylation","SampleID")
pdf("MethlationViolin.pdf",width =12,height = 10 )
ggplot(tmp2,aes(x=SampleID,y=`CpG methylation`,fill=SampleID))+
geom_violin()+
facet_wrap(~SampleID,ncol=2)
dev.off()
library(minfi)
dmp <- dmpFinder(Data, pheno = TCGASampletpyeResult, type = "categorical")
save(dmp,file = "diff.Rdata")
load("diff.Rdata")
meanTumorData<- apply(Data[,TCGASampletpyeResult=="Tumor"],1,mean)
#&!is.na(meanTumorData)
# head(meanTumorData)
# head(dmp)
# head(signifidmp,10)
# summary(dmp)
# rownames(dmp)
dmp$genelistID <- as.integer(rownames(dmp))
o<- order(dmp[,"genelistID"])
dmp<- dmp[o,]
#筛选出p<0.01且无空值的CpG,并且正常样本甲基化程度<0.3,即筛选肿瘤中高甲基化的基因
signifidmp <- dmp[meanTumorData>0.7 & !is.na(meanTumorData) & dmp$pval<0.01
&!is.na(dmp$pval) & !is.na(dmp$qval) & !is.na(dmp$intercept),]
#signifidmp <- signifidmp[signifidmp$intercept<0.3,]dmp <-
signifiData<- Data[as.integer(rownames(signifidmp)),]
signifiDes<- Des[as.integer(rownames(signifidmp)),]
# Data[Des[,1]=="ch.3.2438620R",]
# meanTumorData[Des[,1]=="ch.3.2438620R"]>0.7
# mean(Data[103429,TCGASampletpyeResult=="Normal"])
# mean(Data[103429,TCGASampletpyeResult=="Tumor"])
# summary(Data[Des[,1]=="ch.3.2438620R",])
# apply((head(signifiData,10)[,TCGASampletpyeResult=="Tumor"]),1,mean)
signifiGene <- signifiDes[,2]
length(unique(signifiGene))
head(signifidmp)
#intercept正常样本甲基化程度
# Data[362831,118]
# Data[362831,48]
# Data[362831,]
#
# Data[310369,118]
# Data[310369,48]
# Data[310369,]
#以下是关于CpG的一些计算
#因为Des有部分空缺,取出非空部分,生成Desfull
Desfull<- Des[!is.na(Des[,3]),]
#同样取出Data里非空部分,计算mean
methtstat<- apply(Data[!is.na(Des[,3]),],1,mean,na.rm=TRUE)
#去掉计算mean值以后NA的部分对应的Desfull
Desfull<- Desfull[!is.na(methtstat),]
#再去掉methtstat的mean值中NA的部分
methtstatresult<- methtstat[!is.na(methtstat)]
grCpG <- GRanges(seqnames = paste("chr",Desfull[,3],sep = ""),
ranges = IRanges(start = as.integer(Desfull[,4]), width = 1))
grCpG$value <- methtstatresult
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
trans <- as.data.frame(transcripts(txdb_hg19))
trans<- trans[trans$seqnames %in% c("chrX","chrY",paste("chr",1:22,sep="")),]
grTSS <- GRanges(seqnames = trans$seqnames,
ranges = IRanges(start = trans$start-5000,end = trans$start+5000))
grCpG
grTSS
hitObj<- findOverlaps(grTSS,grCpG)
CpGRelativeSite <- c()
CpGRelativeMeth <- c()
for(i in 1:length(hitObj)){
#取出对应CpG真实位置
CpGsite <- grCpG[hitObj[i]@to]@ranges@start
#计算回TSS真实位置
TSSsite<- mean(grTSS[hitObj[i]@from]@ranges)
#计算CpG相对位置并且存储
CpGRelativeSite <- c(CpGRelativeSite,TSSsite - CpGsite)
#取出CpG平均甲基化程度
CpGRelativeMeth <- c(CpGRelativeMeth,grCpG[hitObj[i]@to]$value)
}
save.image(file = "diffresult.Rdata")
load("diffresult.Rdata")
CpGResult <- c()
for(i in -5000:5000){
CpGResult <- c(CpGResult,mean(CpGframe[CpGframe$CpGRelativeSite==i,"CpGRelativeMeth"]))
}
CpGframe <- data.frame(-5000:5000,CpGResult)
colnames(CpGframe) <- c("CpGRelativeSite","MethState")
ggplot(CpGframe, aes(x=CpGRelativeSite, y=MethState))+
#geom_point()+
geom_smooth()
ggsave(filename="TSS附近甲基化程度.pdf",width = 12,height=8)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
trans <- as.data.frame(transcripts(txdb_hg19))
grTSS <- GRanges(seqnames = trans$seqnames,
ranges = IRanges(start = trans$start-5000,end = trans$start+5000))
genesall<- genes(txdb_hg19)
SERPIND1<- genesall[genesall$gene_id=="3053"]
SERPIND1promoter<- promoters(SERPIND1,upstream = 1500,downstream = 0)
SERPIND1promoter
SERPIND1hit<- findOverlaps(SERPIND1promoter,grCpG)
SERPIND1hit@to
grCpG[SERPIND1hit@to]
#将Data里对应数据取出
Datafull<- Data[!is.na(Des[,3]),]
Datafull<- Datafull[!is.na(methtstat),]
NUAK1<- genesall[genesall$gene_id=="9891"]
NUAK1promoter<- promoters(NUAK1,upstream = 1500,downstream = 0)
NUAK1promoter
NUAK1hit<- findOverlaps(NUAK1promoter,grCpG)
NUAK1hit@to
grCpG[NUAK1hit@to]
#将Data里对应数据取出
Datafull<- Data[!is.na(Des[,3]),]
Datafull<- Datafull[!is.na(methtstat),]
apply(Datafull[NUAK1hit@to,TCGASampletpyeResult=="Normal"],1,mean)
apply(Datafull[NUAK1hit@to,TCGASampletpyeResult=="Tumor"],1,mean)
library(ggplot2)
normalNUAK1<- data.frame(stack(Datafull[NUAK1hit@to,TCGASampletpyeResult=="Normal"]))
tumorNUAK1 <- data.frame(stack(Datafull[NUAK1hit@to,TCGASampletpyeResult=="Tumor"]))
normalNUAK1$sampletype <- "Normal"
normalNUAK1$site <- 1:7
tumorNUAK1$sampletype <- "Tumor"
tumorNUAK1$site <- 1:7
frame1<- data.frame(normalNUAK1$NA..1,normalNUAK1$sampletype,normalNUAK1$site)
colnames(frame1) <- c("MethState","sampletype","site")
frame2<- data.frame(tumorNUAK1$NA..1,tumorNUAK1$sampletype,tumorNUAK1$site)
colnames(frame2) <- c("MethState","sampletype","site")
resultframe <- rbind(frame1,frame2)
ggplot(resultframe, aes(x=site, y=MethState,colour=sampletype)) +
scale_x_discrete(limits=1:7)+
geom_point(position="jitter")
ggsave(filename="TargetCpGNUAK1.pdf",width = 8,height=6)
apply(Datafull[SERPIND1hit@to,TCGASampletpyeResult=="Normal"],1,mean)
apply(Datafull[SERPIND1hit@to,TCGASampletpyeResult=="Tumor"],1,mean)
library(ggplot2)
normalSERPIND1<- data.frame(stack(Datafull[SERPIND1hit@to,TCGASampletpyeResult=="Normal"]))
tumorSERPIND1 <- data.frame(stack(Datafull[SERPIND1hit@to,TCGASampletpyeResult=="Tumor"]))
normalSERPIND1$sampletype <- "Normal"
normalSERPIND1$site <- 1:2
tumorSERPIND1$sampletype <- "Tumor"
tumorSERPIND1$site <- 1:2
frame1<- data.frame(normalSERPIND1$NA..1,normalSERPIND1$sampletype,normalSERPIND1$site)
colnames(frame1) <- c("MethState","sampletype","site")
frame2<- data.frame(tumorSERPIND1$NA..1,tumorSERPIND1$sampletype,tumorSERPIND1$site)
colnames(frame2) <- c("MethState","sampletype","site")
resultframe <- rbind(frame1,frame2)
ggplot(resultframe, aes(x=site, y=MethState,colour=sampletype)) +
scale_x_discrete(limits=1:2)+
geom_point(position="jitter")
ggsave(filename="TargetCpGSERPIND1.pdf",width = 8,height=6)
library("BSgenome.Hsapiens.UCSC.hg19")
library(seqinr)
genome <- BSgenome.Hsapiens.UCSC.hg19
grCpG[NUAK1hit@to]
promoterSeq<- getSeq(genome,NUAK1promoter)
write.fasta(promoterSeq$`9891`,"NUAK1promoter","promoterSeq.txt")
normalNUAK1$NA..1
tumorNUAK1$NA..1
dim(Datafull)
dim(Desfull)
# 尝试用mice包补缺失值,由于数据量太大而取消
# library(mice)
# library(VIM)
# #md.pattern(Data)
# tmp <- Data[,1:3]
# aggr_plot <- aggr(tmp, col = c('navyblue', 'red'), numbers=TRUE, sortVars=TRUE,
# labels=names(tmp),cex.axis=.7, gap=2,
# ylab=c("Histogram of missing data", "Pattern"))
# tempData <- mice(Data,m=1,maxit=50,meth='pmm',seed=500)
#
# 尝试T检验,由于有缺失值而取消
# Ttest<- t.test(Data[1,1:3],Data[1,c(48,118)])
#
# Data[1,118]
# Data[1,48]
# tmp<- Data[1,]
# myfun <-function(x){
# Ttest<- t.test(x[c(1:47,49:117,119:154)],x[c(48,118)])
# return(Ttest$p.value)
# }
# myfun(Data[1,])
# result<- apply(Data,1,myfun)
# group <- factor(TCGASampletpyeResult,levels=c("Normal","Tumor"))
# design <- model.matrix(~-1+group)
# design
# fit.reduced <- lmFit(Data,design)
# fit.reduced <- eBayes(fit.reduced)
# summary(decideTests(fit.reduced))
#
# 尝试使用missMethyl包,最后designMatrix设置有错误而差异区域识别错误,差异不明显
# top<-topTable(fit.reduced,coef=1)
# top
# cpgs <- as.integer(rownames(top))
# Data[cpgs[10],]
# Data[cpgs[10],48]
# Data[cpgs[10],118]
#
# par(mfrow=c(2,2))
# for(i in 1:4){
# stripchart(Data[rownames(Data)==cpgs[i],]~design[,4],method="jitter",
# group.names=c("Normal","Tumor"),pch=16,cex=1.5,col=c(4,2),ylab="Beta values",
# vertical=TRUE,cex.axis=1.5,cex.lab=1.5)
# title(cpgs[i],cex.main=1.5)
# }
Downloaded.R
setwd("G:/AllShare/SkillTrainHomework/TCGA-Assembler.2.0.5/TCGA-Assembler")
#' Load functions
source("Module_A.R")
source("Module_B.R")
setwd("G:/AllShare/SkillTrainHomework")
#' choose a cancer type
#' 可查看网址https://tcga-data.nci.nih.gov/docs/publications/tcga/
sCancer <- "GBM"
sPath1 <- "./DownloadedData"
sPath2 <- "./BasicDataProcessingResult"
sPath3 <- "./AdvancedDataProcessingResult"
#下载RNA-Seq数据
path_geneExp <- DownloadRNASeqData(cancerType = sCancer,
assayPlatform = "gene.normalized_RNAseq",
saveFolderName = sPath1)
#' Download DNA methylation 450 data
#' 使用Illumina的甲基化分析芯片测出来的甲基化数据
path_methylation_450 <- DownloadMethylationData(cancerType = sCancer,
assayPlatform = "methylation_450",
saveFolderName = sPath1)
# 下载出来的格式说明:
# TCGA-3C-AAAU-01A-11R-A41B-07
# 前三位TCGA-3C-AAAU表示病人ID
# 第四位01A表示肿瘤类型Tumor types range from 01 - 09,
# normal types from 10 - 19 and control samples from 20 - 29.
# 第五位Order of portion in a sequence of 100 - 120 mg sample portions
# 可查看网址https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode
#' Process gene expression data
#' 对RNA-seq下载结果进行处理,将基因名进行处理
list_geneExp <-
ProcessRNASeqData(inputFilePath = path_geneExp[1],
outputFileName = paste(sCancer,
"geneExp",
sep = "__"),
dataType = "geneExp",
outputFileFolder = sPath2)
#处理methylation数据
list_methylation_450 <-
ProcessMethylation450Data(inputFilePath =
path_methylation_450[1], outputFileName = paste(sCancer,
"methylation_450", sep = "__"), outputFileFolder = sPath2)
#'Perform advanced data processing using Module B functions
#'Advanced进一处理methylation数据
list_methylation_450_OverallAverage <-
CalculateSingleValueMethylationData(input = list_methylation_450,
regionOption = c("TSS1500", "TSS200"), DHSOption = "Both",
outputFileName = paste(sCancer, "methylation_450", sep = "__"),
outputFileFolder = sPath3,
chipAnnotationFile = "./SupportingFiles/MethylationChipAnnotation.rda")
save.image("tmpdata.Rdata")
load("tmpdata.Rdata")
RNAdiff.R
#setwd("G:/AllShare/SkillTrainHomework/BasicDataProcessingResult")
#load("GBM__geneExp.rda")
setwd("G:/AllShare/SkillTrainHomework")
load("RNAdiff.RData")
DATAFRAME<- data.frame(Data)
TCGASampleName<- colnames(DATAFRAME)
TCGASamplelist <- strsplit(TCGASampleName,split=".",fixed = TRUE)
TCGASampletpye <- c()
TCGASampleID <- c()
for(i in 1:length(TCGASamplelist)){
TCGASampletpye <- c(TCGASampletpye,TCGASamplelist[[i]][4])
TCGASampleID <- c(TCGASampleID,paste(TCGASamplelist[[i]][2],TCGASamplelist[[i]][3],sep = "."))
}
TCGASampletpyeResult <- c()
for(i in 1:length(TCGASampletpye)){
if(as.integer(substr(TCGASampletpye[i],1,2))<10){
TCGASampletpyeResult <- c(TCGASampletpyeResult,"Tumor")
}else{
TCGASampletpyeResult <- c(TCGASampletpyeResult,"Normal")
}
}
phenotype <- data.frame(TCGASampletpyeResult)
rownames(phenotype) <- colnames(Data)
rownames(Data) <- Des[,2]
library(DESeq2)
Data<- floor(Data)
dds <- DESeqDataSetFromMatrix(countData = Data,
colData = phenotype,
design = ~ TCGASampletpyeResult)
dds <- DESeq(dds)
res <- results(dds)
resOrdered <- res[order(res$padj),]
resSig <- subset(resOrdered, padj < 0.01)
res001 <- results(dds, alpha=0.01)
pdf(file = "MA-plot.pdf",width = 15,height = 9)
plotMA(res001, ylim=c(-3,3))
idx <- identify(res$baseMean, res$log2FoldChange)
rownames(res)[idx]
res[idx,]
Data[c("100131561","677811","414899","677846"),]
Data[c("9271","140893"),]
Data[c("390077","144742"),]
dev.off()
d <- plotCounts(dds, gene=which.min(res$padj), intgroup="TCGASampletpyeResult",
returnData=TRUE)
library("ggplot2")
ggplot(d, aes(x=TCGASampletpyeResult, y=count)) +
geom_point(position=position_jitter(w=0.1,h=0)) +
scale_y_log10(breaks=c(25,100,400))+
labs(title=paste("EntrezID:",res@rownames[which.min(res$padj)]))+
theme(plot.title = element_text(hjust = 0.5))
ggsave(filename="p值最小的基因不同表达量.pdf",width = 8,height=6)
ntd <- normTransform(dds)
library("pheatmap")
select <- rownames(resOrdered)[1:20]
df <- as.data.frame(colData(dds)[,"TCGASampletpyeResult"])
pdf(file = "热图.pdf",width = 20,height = 9)
pheatmap(assay(ntd)[select,],show_rownames=TRUE)
dev.off()
RnaSigGeneID<- rownames(resSig)
RnaSigGeneName<- Des[Des[,2]%in%RnaSigGeneID,1]
intersect(RnaSigGeneName,signifiGene)
#log2FoldChange为负数,正常高表达;log2FoldChange为正数,肿瘤高表达
#筛选log2FoldChange为负数,即肿瘤低表达的部分
resSigdown<- resSig[resSig$log2FoldChange<(-2),]
#dim(resSigdown)
RnaSigGeneIDdown<- rownames(resSigdown)
RnaSigGeneNamedown<- Des[Des[,2]%in%RnaSigGeneIDdown,1]
# Data["2498",c(38:41,76)]
# Data["2498",]
# Data["7153",c(38:41,76)]
# Data["7153",]
#与抑癌基因取交集
Human_TSGs <- read.table("Human_TSGs.txt",header = TRUE,sep = "\t")
#TSGene <- read.table("TSGene-LOFdataset.txt",header = TRUE,sep = "\t")
#sig_exp <- read.table("sig_exp.txt",header = TRUE,sep = "\t")
TSGs <- Human_TSGs$GeneSymbol
#TSGs <- TSGene$GeneName
#TSGs <- sig_exp$Symbol
result1<- intersect(RnaSigGeneNamedown,TSGs)
result2<- intersect(result1,signifiGene)#三者交集
resultmuch <- intersect(RnaSigGeneNamedown,signifiGene)
write.table(resultmuch,"候选靶标基因多多多.txt",quote = FALSE,row.names = FALSE,col.name = FALSE)
#"CCHCR1"%in%signifiGene
#此段代码需要methylation的TCGASampletpyeResult对象
setwd("G:/AllShare/SkillTrainHomework/AdvancedDataProcessingResult")
load("GBM__methylation_450__TSS1500-TSS200__Both.rda")
MethTssFrame<- data.frame(Data)
rownames(MethTssFrame) <- Des[,1]
targetFrame<- MethTssFrame[result2,]
colnames(targetFrame) <- TCGASampletpyeResult
targetFrame<-na.omit(targetFrame)
library(pheatmap)
pdf(file = "target热图.pdf",width = 20,height = 6)
pheatmap(targetFrame,show_rownames=TRUE)
dev.off()
library (VennDiagram)
draw.triple.venn(area1=5, area2=5, area3=5
,n12=3, n23=3, n13=3, n123=3
,category = c('A','B','C'))
pdf("交集.pdf",width = 10,height = 10)
T<-venn.diagram(list(A=RnaSigGeneNamedown,B=TSGs,C=signifiGene),filename=NULL
,lwd=1,lty=2,category = c('RNA-seq down','Tumor suppressor gene','Hypermethylation'))
grid.draw(T)
dev.off()
#save.image("RNAdiff.RData")
PCAdata<-t(Data)
#作主成分分析
PCAdata.pr<-prcomp(PCAdata,scale=FALSE)
#作预测
PCA_eset<- predict(PCAdata.pr)
colnames(PCA_eset)
pdf(file = "陡坡图.pdf",width = 14,height = 7)
screeplot(PCAdata.pr,type="lines")
dev.off()
data.hc <- hclust( dist(PCAdata))
pdf(file = "树状图.pdf",width = 22,height = 12)
plot(data.hc, hang = -1)
dev.off()
#plot(PCA_eset[,1:2])
library("ggplot2")
ggplot(NULL, aes(x=PCA_eset[,1], y=PCA_eset[,2], colour=TCGASampletpyeResult)) +
geom_point() +
guides(color=guide_legend(title=NULL)) +
labs(x = "PCA1 34.7%",y = "PCA2 15.3%",title = "RNA-seq Principal components analysis")
ggsave(filename="PCA RNA-seq.pdf",width = 8,height=6)
GOterm <- read.table("analysis.txt",header = TRUE,sep = "\t")
library(ggplot2)
ggplot(GOterm,aes(x=GO.biological.process,y=upload_1..over.under.,fill = P.value))+
geom_bar(stat="identity")+
labs(x = "GO terms",y = "Fold Enrichment",title = "GO biological process analysis")+
coord_flip()
ggsave(filename="GO analysis.pdf",width = 8,height=6)