CIMP analysis

22 Jul 2015

Introduction

This page contains the pipeline analysis for the "CIMP analysis". Source code is available here. The whole analysis can be computed by running the run_all.R after data processing.

Data processing

Data processing is done by running the run_all.R located in data/src. The procedure is standard (see report).

Data analysis

Part A: Analysis of average CGI+SS patterns

DiseaseList <- c(('BLCA','BRCA','COAD','LUAD','STAD')

1) We first assess CIMP in each tissue using the same methodology on genome-wide methylation profiles by performing hierarchical clustering on the top 5% most variant probes in each disease.

source('fun/analyze_CIMP_all_CGIs.R')
for (DiseaseName in DiseaseList)
{
        print(DiseaseName)
        out <- analyze_CIMP_all_CGIs(DiseaseName=DiseaseName,CIMP.Number = 2, calc.Var= T)
}

2) We assess the robustness of the clusters by varying the number of CGIs considered from 1 to 10 percent. At the same time, we also look at the stability of 3 clusters to assess the existence of a CIMP-low phenotype.

var.list <- c(1,2,5,10)
CIMP.list <- c(2,3)
source('fun/analyze_CIMP_all_CGIs.R')
for (CIMP.Number in CIMP.list)
{
        for (var.thresh in var.list)
        {
                for (DiseaseName in DiseaseList)
                {
                        print(paste0('Analyzing ',DiseaseName,'...', ' with var=',var.thresh,'% and CIMP.number=',CIMP.Number))
                        out <- analyze_CIMP_all_CGIs(DiseaseName=DiseaseName,CIMP.Number = CIMP.Number, calc.Var= F, var.thresh = var.thresh)
                }
        }
}

### 2.A) Look at cluster robustness
source('fun/cluster_analysis.R')
out <- cluster_analysis(DiseaseList=DiseaseList, var.thresh=var.thresh)

### 2.B) Look at cluster robustness given the var.thresh
source('fun/cluster_analysis_var.R')
var.list <- c(1,2,5,10)

for (Disease in DiseaseList)
{
        out <- cluster_analysis_var(DiseaseName=Disease, CIMP.Number=3, var.list= var.list)
}

3) We fix the top 5% CGIs to define the CIMP-signature instead of another cutoff as a tradeoff between relevant probes and having a wide enough coverage. We then analyze whether there is a common panel of probes between the tissue-specific CIMP-signature.

var.thresh <- 5
source('fun/compare_panel_all_CGIs.R')
DiseaseList <- c('BRCA','BLCA','COAD','LUAD','STAD')
out <- compare_panel_all_CGIs(DiseaseList, var.thresh=var.thresh)

We obtain a subset of 89 CGIs common between all the CIMP-signatures.

4) By combining the samples from the different tissues, we then perform clustering on this common CIMP-signature:

source('fun/analyze_CIMP_all_CGIs_bis.R')
out <- analyze_CIMP_all_CGIs_bis(DiseaseList,CIMP.Number = 2)

4) We then analyze whether the methylation aberrations can be associated with transcriptomic or genetic variations:

4.A) Can we assess CIMP from gene expression variations i.e CIMP=f(Gene Expression)?

We propose to tackle this problem using a sparse logistic regression with different formulations:

i. In the first case we predict the CIMP status for each tissue separately:

source('fun/predict_CIMP_GE_glmnet.R')
for (DiseaseName in DiseaseList)
{
        print(DiseaseName)
        out <- predict_CIMP_GE(DiseaseName, var.thresh=var.thresh, CIMP.Number=2, centered=T, scaled=T, intercept=T, n.folds=3, bootstrap=100, cores=10, log_exp=T, balanced=T)
}

ii. In the second case we compute a single classifier for all datasets:

source('fun/predict_CIMP_GE_all.R')
out <- predict_CIMP_GE_all(DiseaseList, var.thresh=var.thresh, CIMP.Number=2, centered=T, scaled=T, intercept=T, n.folds=3, bootstrap=100, cores=10, log_exp=T, balanced=T)

iii. Finally, in the last case we relax the previous constraint (single classifier) by forcing each tissue-specific predictor to have the same non-zero coefficients but allowing the coefficients to vary:

source('fun/predict_CIMP_GE_MT_par.R')
out <- predict_CIMP_GE_MT(DiseaseList, var.thresh=var.thresh, CIMP.Number=2, centered=T, scaled=T, intercept=T,  n.folds=3, bootstrap=100, cores=10, balanced=T)

iv. Summary of the results:

source('fun/analyze_predict_CIMP_GE_MT.R')

4.B) Analysis of the mutations associated with CIMP:

i) We analyzed the the association between CIMP and known reported mutations associated with tissue-specific CIMPs (e.g BRAF, KRAS, IDH1, IDH2, TET2).

source('fun/analyze_mutations.R')
Mutation.List <- c('BRAF','KRAS','IDH1','IDH2','TET2')

analyze_mutations(DiseaseList, Mutation.List=Mutation.List)

ii) We then also searched for other mutations significantly associated with CIMP in all diseases: {r} source('fun/analyze_mutations.R') analyze_mutations(DiseaseList, Mutation.List=Mutation.List)

5) Survival analysis

source('fun/compare_clinical.R')
for (DiseaseName in DiseaseList)
{
        print(DiseaseName)
        out <- compare_clinical(DiseaseName=DiseaseName, var.thresh=var.thresh, CIMP.Number= CIMP.Number)
}