This document demonstrates the specifications of a novel methodology aiming to generate a new set of descriptors, referred to as GO descriptors, which would efficiently summarize omics data and also integrate Gene Ontology (GO) data. Our goal is to enrich the data using gene set information whilst emphasizing the importance of -omics data in modelling ENM toxicity. GO was selected as the golden standard for annotation in three ontology branches, namely Cellular Components (CC), Molecular Functions (MF), and Biological Processes (BP) containing 41,694 classes.
GO descriptors aim to summarize the omics data by their GO classes, and our particular use case was protein corona proteomics data (Walkey et al., 2014) integrated with pathway information from GO database. Summarization is performed by clustering the GO classes of the data. The number of the GO descriptors equals the number of clusters of those GO groups found to be highly enriched in the supplied data set. Clustering is performed using hierarchical clustering or bi-clustering algorithms (Parmeet et al., 2014) (Aggarwal and Reddy 2013) where number of clusters is a user-specified parameter for both clustering algorithms. Thus by defining the number of clusters the user can decide to what extent the data should be summarized, but also to what depth should the interaction relationships be detailed between the significant GO ids.
GO descriptors can be used in nanoQSAR models, possibly together with physicochemical descriptors, and be further exploited for their biological relevance and functional similarity.
Two separate R packages have been created, called GOdescrPred and GOdescrCalculus. Although they both follow the above procedure, we have made some structural changes between the two, so that the user can either:
In particular, GOdescrPred aims to save the cluster memberships of the proteins for a particular data set, and based on them to produce GO descriptors for a new set of data. In this case, it is a prerequisite that the new data given by the user should include the same protein/gene/ etc. ids as the original. Alternatively, GOdescrCalculus can be employed when the user wants to calculate GO descriptors for a particular set of omics data and does not have the ‘clustering model’, i.e. those set of proteins have not been previously used to create a set of GO descriptors. If the objective is to just create a set of new descriptors, then GOdescrCalculus should be used, where its output will be a data matrix (or array) with different columns corresponding to different GO descriptors.
GOdescrPred includes the following functions:
GOdescrCalculus includes the following functions:
This package expects an omics data (e.g. proteomics/genomics), creates an R raw model to store the cluster memberships based on the data and the algorirthm selected; bi-clustering and hierarchical clustering algorithms are currently implemented. The cluster memberships can be stored as an R raw model and used for ’prediction’, i.e. to construct new decsriptors for similar data. By similar, we mean that the new data needs to include the genes/proteins/etc used for the initial clustering.
Package: GOdescrPred Type: Package Version: 1.0 Date: 2015-04-15 License: GPL-2
Generate GO descriptors given an omics data. Important functions are generate.biclust.model and generate.hierar.model.
Georgia Tsiliki Maintainer: Georgia Tsiliki gtsiliki@central.ntua.gr
1.blockcluster 2.vegan 3.GSEABase 4.GOstats
data("dat1p")
data("dat1i")
data("dat1m")
data.file<- read.in.json.for.pred(dat1p, dat1m, dat1i)
The dataset for this test is a data frame
data("dat1")
A list of two objects datasetURI a character vector- ambit data set uri dataEntry a data frame containing two columns: compound and values. Compound is a character vector with all compound anbit uris, and values is a data frame with all numberic values of the proteomics data set (compounds by features). Only dependent features are included.
Please find more about the data set in $dataEntry at http://pubs.acs.org/doi/abs/10.1021/nn406018q?journalCode=ancac3
Walkey et al., 2014
data(dat1)
maybe str(dat1) ; plot(dat1) ...
Here we specify the function with which we summarize GO descriptor cluster memberships from dat1m given dat1 data set. For example, mean, sd, var are some typical functions that can be provided.
data("dat1i")
A data frame with one observation. X.mean. a character vector, giving the name of the function used for summary
Example function to summarize dat1 using dat1m cluster memberships
There are no references
data(dat1i)
maybe str(dat1i) ; plot(dat1i) ...
A character string for a serialized GO descriptors model, i.e. provides clustering memberships for all proteins in proteomics data dat1.
data("dat1m")
A character string
Example GO descriptors model based on dat1
There are no references
data(dat1m)
maybe str(dat1m) ; plot(dat1m) ...
The dataset for this test is a data frame
data("dat1p")
A list of two objects datasetURI a character vector- ambit data set uri dataEntry a data frame containing two columns: compound and values. Compound is a character vector with all compound anbit uris, and values is a proteomics data frame with all numberic values of the data set (compounds by features)
Data set for prediction with dat1m
Walkey et al., 2014
data(dat1p)
maybe str(dat1p) ; plot(dat1p) ...
The dataset for this test is a data frame
data("dat1p")
A list of two objects datasetURI a character vector- ambit data set uri dataEntry a data frame containing two columns: compound and values. Compound is a character vector with all compound anbit uris, and values is a proteomics data frame with all numberic values of the data set (compounds by features)
Data set for prediction with dat1m
Walkey et al., 2014
data(dat4h)
maybe str(dat4h) ; plot(dat4h) ...
This function is used to estimate the cluster memberships for omics data, based on GO ontology. Data are clustered based on bi-clustering algorithm from the blockcluster R package using default values. The user needs to specify the number of clusters for both axes.
generate.biclust.model(dataset, predictionFeature, parameters)
dataset list of 2 objects, datasetURI:= character sring, code name of dataset, dataEntry:= data frame with 2 columns predictionFeature character string specifying which is the prediction feature in dataEntry parameters list with parameter values for ontology and biclustering. 5 objects should be included, i.e. ’key’a character sring for gene/protein/etc names id (for dat1 ’UNIPROT’ is the right key), ’onto’ a character vector showing the ontology and sub.ontologies used ( c(’GO’,’MF’)), ’pvalCutoff’ a numeric value for hy- pergeometric p-values cutoff (e.g. 0.05), ’nclust’ a numeric vector indicating the number of clusters for GOs(x axis) and genes/proteins (y axis) (e.g. c(5,4)), ’FUN’ a string, R function to summarize vector’s groups (e.g. mean).
More details can be found in https://cran.r-project.org/web/packages/blockcluster/index.html.
A List rawModel A serialized GO descriptors object (class raw) giving the cluster memberships of proteins/genes in the data. pmmlModel A pmml GO descriptors object - now empty predictedFeatures A character vector with names for the new descriptors independentFeatures A list with Ambit names for all genes/ proteins features included in the model additionalInfo A data frame with all independent features included in the model and their dummy name in the model - here empty
Georgia Tsiliki
1.blockcluster 2.GSEABase 3.GOstats
generate.hierar.model
predF<- list()
required.param<- list(key=¹UNIPROT¹,onto=c(¹GO¹,¹MF¹),pvalCutoff=0.05,nclust=c(3,2),FUN=¹mean¹) clust.memb<- generate.biclust.model(dat1,predF,required.param)
This function is used to estimate the cluster memberships for omics data, based on GO ontology. Data are clustered based on hierarchical clustering algorithm from the vegan R package using de- fault values. The user needs to specify the number of clusters or the height of the dendrogram.
generate.hierar.model(dataset, predictionFeature, parameters)
dataset list of 2 objects, datasetURI:= character sring, code name of dataset, dataEntry:= data frame with 2 columns predictionFeature character string specifying which is the prediction feature in dataEntry parameters list with parameter values for ontology and hierarrchical clustering. 7 objects should be included, i.e. ’key’a character sring for gene/protein/etc names id (for dat1 ’UNIPROT’ is the right key), ’onto’ a character vector showing the ontology and sub.ontologies used ( c(’GO’,’MF’)), ’pvalCutoff’ a numeric value for hypergeometric p-values cutoff (e.g. 0.05), ’distMethod’ distance method (could be one of those provided via vegan R package), ’hclustMethod’ (could be one of those provided via vegan R package) , ’nORh’ either a numeric value or character giving number of clusters or a function to define height respectively, ’FUN’ a string, R function to summarize vector’s groups (e.g. mean).
Hierarchical clustering algorithm implementation was taken from the vegan cran pckage https://cran.r- project.org/web/packages/vegan/index.html
A List rawModel A serialized GO descriptors object (class raw) giving the cluster memberships of proteins/genes in the data. pmmlModel A pmml GO descriptors object - now empty predictedFeatures A character vector with names for the new descriptors independentFeatures A list with Ambit names for all genes/ proteins features included in the model additionalInfo A data frame with all independent features included in the model and their dummy name in the model - here empty
Georgia Tsiliki
1.vegan 2.GSEABase 3.GOstats
generate.biclust.model
predF<- list()
required.param<- list(key=¹UNIPROT¹,onto=c(¹GO¹,¹MF¹),pvalCutoff=0.05, distMethod=¹euclidean¹,hclustMethod=¹ward.D2¹,nORh=¹mean¹,FUN=¹mean¹)
clust.memb<- generate.hierar.model(dat4h,predF,required.param)
Produces GO descriptors based on the data and model supplied. Also a summary statistic needs to be specified.
pred.descr(dataset, rawModel, additionalInfo)
dataset Data for prediction. A list of two objects: datasetURI (a character string ), dataEntry (a data frame). rawModel R model serialized (GO cluster memberships for proteins/genes). additionalInfo Any additional information needed for rawModel. Here a data frame with one cell giving the function to summarize dataset based on rawModel (e.g. ’mean’).
This function reads in the information provided by either generate.biclust.model() or generate.hierar.model() functions to convert their output in suitable form that can be easily read in JSON format.
A data.frame with the new GO descriptors given the function’s parameters. Number of columns is the number of new GO descriptors, number of rows is the number of features used.
Georgia Tsiliki
jsonlite
data("dat1p")
data("dat1i")
data("dat1m")
pred.res<- pred.descr(dat1p, dat1m, dat1i)
This function reads in a json data file and produces a list with independent features, GO clustering memberships raw model.
read.in.json.for.pred 11
read.in.json.for.pred(dataset, rawModel, additionalInfo)
dataset Data for prediction. A list of two objects: datasetURI (a character string ), dataEntry (a data frame). rawModel R model serialized (GO cluster memberships for proteins/genes). additionalInfo Any additional information needed for rawModel. Here a data frame with one cell giving the function to summarize dataset based on rawModel (e.g. ’mean’).
No further details required
A List including: x.mat data frame with independent variables (proteins/genes/ etc) model R model (GO cluster memberships for proteins/genes) additionalInfo Any additional information needed for rawModel. Here a data frame with one cell giving the function to summarize dataset based on rawModel (e.g. ’mean’).
Georgia Tsiliki
jsonlite
data("dat1p")
data("dat1i")
data("dat1m")
data.file<- read.in.json.for.pred(dat1p, dat1m, dat1i)
GOdescrCalculus-package dat1 dat1i1 dat1i2 dat1m1 dat1m2 dat1p generate.descr.biclust generate.descr.hierar generate.param.model read.in.json.for.pred
This package expects an omics data (e.g. proteomics/genomics), and produces a set of GO de- scriptors given the data and the parameters supplied based on the data and the algorirthm selected; bi-clustering and hierarchical clustering algorithms are currently implemented. The clustering al- gorithm parameters can be stored as an R raw model and used for ’prediction’, i.e. to construct GO decsriptors for the same data.
Package: GOdescrCalculus Type: Package Version: 1.0 Date: 2015-04-16 License: GPL-2
Generate GO descriptors given an omics data. Important functions are generate.descr.biclust and generate.descr.hierar.
Georgia Tsiliki Maintainer: Georgia Tsiliki gtsiliki@central.ntua.gr
1.blockcluster 2.vegan 3.GSEABase 4.GOstats
data("dat1p") data("dat1m1") data("dat1i1")
res1<- read.in.json.for.pred(dat1p, dat1m1, dat1i1)
The dataset for this test is a data frame
data("dat1")
A list of two objects datasetURI a character vector- ambit data set uri dataEntry a data frame containing two columns: compound and values. Compound is a character vector with all compound anbit uris, and values is a data frame with all numberic values of the proteomics data set (compounds by features). Only dependent features are included.
Please find more about the data set in $dataEntry at http://pubs.acs.org/doi/abs/10.1021/nn406018q?journalCode=ancac3
Walkey et al., 2014
data(dat1) maybe str(dat1) ; plot(dat1) …
Additionali information, i.e. the names of the predicted new features as generated by biclustering algorithm
data("dat1i1")
A list with one object ’predictedFeatures’including the names of the new descriptors
Example of new descriptor names as generated by generate.param.model
There are no references
data(dat1i1)
maybe str(dat1i1) ; plot(dat1i1) ...
Additionali information, i.e. the names of the predicted new features as generated by hierarchical clustering algorithm
data("dat1i2")
A list with one object ’predictedFeatures’including the names of the new descriptors
Example of new descriptor names as generated by generate.param.model
There are no references
data(dat1i2)
maybe str(dat1i2) ; plot(dat1i2) ...
A character string for a serialized parameters list, i.e. a set of values needed from the biclusteruing algorithm in order to produce the GO descriptors. The list includes ’key’ (e.g. ’UNIPROT’), ’onto’ (e.g. c(’GO’,’MF’)), ’pvalCutoff’ for hypergeometric test (e.g. 0.05), ’nclust’ for the number of clusters in the x and y axis respectively (e.g c(4,2)), and ’FUN’ to specify the function used to summarize the omics data (e.g. ’mean’).
data("dat1m1")
A character string
Example set of parameters needed for generate.descr.biclust function
There are no references
data(dat1m1)
maybe str(dat1m1) ; plot(dat1m1) ...
A character string for a serialized parameters list, i.e. a set of values needed from the biclusteruing algorithm in order to produce the GO descriptors. The list includes ’key’ (e.g. ’UNIPROT’), ’onto’ (e.g. c(’GO’,’MF’)), ’pvalCutoff’ for hypergeometric test (e.g. 0.05), ’distMethod’ (e.g. ’euclidean’ or other alternatives from the vegan package), ’hclustMethod’ (e.g. ’ward.D2’ or other alternatives from the vegan package), ’nORh’ th enumber of clusters in the data (e.g. 10), and ’FUN’ to specify the function used to summarize the omics data (e.g. ’mean’).
data("dat1m2")
A character string
Example set of parameters needed for generate.descr.hierar function
There are no references
data(dat1m2)
maybe str(dat1m2) ; plot(dat1m2) ...
The dataset for this test is a data frame
data("dat1p")
A list of two objects datasetURI a character vector- ambit data set uri dataEntry a data frame containing two columns: compound and values. Compound is a character vector with all compound anbit uris, and values is a proteomics data frame with all numberic values of the data set (compounds by features)
Data set for prediction with dat1m
Walkey et al., 2014
data(dat1p)
maybe str(dat1p) ; plot(dat1p) ...
This function is used to estimate GO descriptors for omics data, based on GO ontology. Data are clustered based on bi-clustering algorithm from the blockcluster R package using default values. The user needs to specify the number of clusters for each axes.
generate.descr.biclust(dataset, rawModel, additionalInfo)
dataset Data for prediction. A list of two objects: datasetURI (a character string ), dataEntry (a data frame). rawModel A serialized list of parameters for biclustering, as produced by generate.param.model additionalInfo Any additional information needed for rawModel. A list with one objects called predictedFeatures which are character names of the new descriptors.
More details can be found in https://cran.r-project.org/web/packages/blockcluster/index.html.
A list giving the new GO descriptors given the function’s parameters. Number of columns are the number of new GO descriptors, number of rows is the number of features used. The new descriptors are given as data frames per feature.
Georgia Tsiliki
1.blockcluster 2.GSEABase 3.GOstats
data("dat1p") data("dat1m1") data("dat1i1")
pred.res<- generate.descr.biclust(dat1p, dat1m1, dat1i1)
This function is used to estimate GO descriptors for omics data, based on GO ontology. Data are clustered based on hierarchical clustering algorithm from vegan R package using default values. The user needs to specify the number of clusters, the distance matrix method, and the hierarchical clust method.
generate.descr.hierar(dataset, rawModel, additionalInfo)
dataset Data for prediction. A list of two objects: datasetURI (a character string ), dataEntry (a data frame). rawModel A serialized list of parameters for hierarchical clustering, as produced by gener- ate.param.model additionalInfo Any additional information needed for rawModel. A list with one objects called predictedFeatures which are character names of the new descriptors.
Hierarchical clustering algorithm implementation was taken from the vegan cran pckage https://cran.r- project.org/web/packages/vegan/index.html
A list giving the new GO descriptors given the function’s parameters. Number of columns are the number of new GO descriptors, number of rows is the number of features used. The new descriptors are given as data frames per feature.
Georgia Tsiliki
1.vegan 2.GSEABase 3.GOstats
data("dat1p") data("dat1m2") data("dat1i2")
pred.res<- generate.descr.hierar(dat1p, dat1m2, dat1i2)
This function is used to pass on all the neccessary information to generate.descr.biclust and gener- ate.descr.hierar functions.
generate.param.model(dataset, predictionFeature, parameters)
dataset list of 2 objects, datasetURI:= character sring, code name of dataset, dataEntry:= data frame with 2 columns predictionFeature character string specifying which is the prediction feature in dataEntry, here empty parameters list with parameter values for ontology and biclustering. 5 or 7 objects should be included depending on whether we then intend to use generate.descr.biclust or generate.descr.hierar function, respectively. In the first case: ’key’a character sring for gene/protein/etc names id (for dat1 ’UNIPROT’ is the right key), ’onto’ a character vector showing the ontology and sub.ontologies used ( c(’GO’,’MF’)),’pvalCutoff’ a numeric value for hypergeometric p-values cutoff (e.g. 0.05), ’nclust’ a numeric vector indicating the number of clusters for GOs(x axis) and genes/proteins (y axis) (e.g. c(5,4)), ’FUN’ a string, R function to sum- marize vector’s groups (e.g. mean). In the second case: ’key’a character sring for gene/protein/etc names id (for dat1 ’UNIPROT’ is the right key), ’onto’ a character vector showing the ontology and sub.ontologies used ( c(’GO’,’MF’)), ’pvalCutoff’ a numeric value for hypergeometric p-values cutoff (e.g. 0.05), ’distMethod’ distance method (could be one of those provided via vegan R package), ’hclustMethod’ (could be one of those provided via vegan R pack- age) , ’nORh’ either a numeric value or character giving number of clusters or a function to define height respectively, ’FUN’ a string, R function to summarize vector’s groups (e.g. mean).
Parameters are structured in a suitable format so that they can be read in by other functions of the package.
rawModel A serialized object of the parameters list supplied. pmmlModel A pmml GO descriptors object - now empty independentFeatures A list with Ambit names for all genes/ proteins features included in the model predictedFeatures A character vector with dummy names for the GO descriptors that will be pro- duced by functions generate.descr.biclust or generate.descr.hierar additionalInfo A list with one objects called predictedFeatures which are character names of the new descriptors
Georgia Tsiliki
No references for this function.
predF<- list()
required.param<- list(key=¹UNIPROT¹,onto=c(¹GO¹,¹MF¹),pvalCutoff=0.05,nclust=c(3,2),FUN=¹mean¹)
read.in.json.for.pred 11
params1<- generate.param.model(dat1,predF,required.param)
This function reads in a json data file and produces a list with independent features, parameters list for GO clustering saved as raw model
read.in.json.for.pred(dataset, rawModel, additionalInfo)
dataset Data for prediction. A list of two objects: datasetURI (a character string ), dataEntry (a data frame) which should include an omics data set (e.g. pro- teomics/genomics/etc). rawModel Raw model for prediction. Here a seerilized list of parameters to be used by genrate.descr.biclust or generate.descr.hierar functions. additionalInfo A list with one objects called predictedFeatures which are character names of the new descriptors.
No further details required
A List including: x.mat data frame with independent variables values (proteins/genes/ etc) model R model (a list of parameters to be used by genrate.descr.biclust or generate.descr.hierar functions) additionalInfo Any additional information needed for rawModel. Here a list with two ob- jects, data frame giving the Ambit names of the independent features included in x.mat, and the predictedFeatures as given in the input.
Georgia Tsiliki
jsonlite
data("dat1p") data("dat1m1") data("dat1i1")
res1<- read.in.json.for.pred(dat1p, dat1m1, dat1i1)