RRegrs is a collection of R regression tools based on the caret package. It is used to find the best regression models for any numerical dataset. The initial use of the script is aimed at finding QSAR models for chemoinformatics / nanotoxicology for eNanoMapper European project.
RRegrs represents a simple tool to screen any dataset for the best regression model using ten implemented regression method 1.Linear Multi-regression (L)
2.Generalized Linear Model with Stepwise Feature Selection (GL)
3.Partial Least Squares Regression (PL)
4.Lasso regressi
5.Elastic Net regression (ENE)
6.Support vector machine using radial functions (SVM radia)
7.Neural Networks regression (NN)
8.Random Forest (RF)
9.Random Forest-Recursive Feature Elimination (RF-RFE)
10.Support Vector Machines Recursive Feature Elimination (SVM-RFE)
RRegrs permits you to run all the methods by using only one function call. The main func-tion of the package (RRegrs) contains several sections: loading parameters and dataset, remove near zero variance features, scaling dataset, remove correlated features, dataset splitting, run the 10 regression methods, summary of statistics for all methods and splittings, averages for each method and cross-validation type for all splittings, automatic best model statistics, best model Y-randomization. Assessment of Applicability Domain was included in each method.
In order to use RRegrs function, it is needed to specify a minimum set of parameters (the others will get default values). All the parameters will be written into a CSV file (ex: Parameters.csv), in the same working folder where it should be present the input dataset file and the outputs files.The dataset needs to have CSV format, with the first column as the dependent variable. The input and output files will be placed into a working folder. The output files are CSV statistics files, PDF and PNG plots.
The prerequisite libraries needed are data.table, corrplot, caret, kernlab, pls, randomForest,
RSNNS, doSNOW, foreach, doMC. The minimal call for the RRegrs() function could be:
> library(RRegrs)
> #Run RRegrs with all default parameters
> #(use default data set file and working folder,
> #run all regression methods, without wrappers,
> #10 splitings, 100 times Y−randomization,
> #no parallel calculation= 1 CPU core)
>RRegrs Results= RRegrs()
> #Run RRegrs for a specific data set file and the rest
> #default parameters
>RRegrs Results = RRegrs(DataFileName=”MyDataSet.csv”)
> #Run RRegrs for a specific data set file ,
> #working folder(it should exist and contains data set file)
> # and the rest default parameters
>RRegrs Results= RRegrs(DataFileName=”MyDataSet.csv”,
> PathDataSet=”MyResultsFolder”)
The default values could be found into the RRegrs definition:
>RRegrs <− function(DataFileName=” ds.House.csv ” ,
> PathDataSet=” Data Results ” ,
> noCores =1,
> ResAvgs=”RRegsResAvgs.csv ” ,
> ResBySplits=” RRegrsResAllSplits.csv ” ,
> ResBest=” RRegrsResBest.csv ” ,
> fD et=”T” , fFilters=”F” , fScaling=”T” ,
> fRemNear0Var=”T” , fRemCorr=”T” ,
> fLM=”T” ,fGLM=”T” , fPLS=”T” , fLASSO=”T” ,
> fENET=”T” ,fSVRM=”T” , fNN=”T” ,
> fRF=”T” ,fRFREF=”T” ,fSVMRFE=”T” ,
> RFE_SVM_C=” 1 ; 5 ; 1 5 ; 5 0 ” ,
> RFE_SVM_epsilon=” 0 . 0 1 ; 0 . 1 ; 0 . 3 ” ,
> cut off =0.9 , iScaling =1, iScalCol =1,
> TrainFrac =0.75 , iSplitTimes =10 , noYrand =100 ,
> CVtypes=” repeatedcv ;LOOCV” ,
> No0NearVarFile=”ds.No0Var.csv”,
> ScaledFile=”ds.scaled.csv”,
> NoCorrFile=”ds.scaled.NoCorrs.csv”,
> lmFile=”LM.details.csv”,
> glmFile=”GLM.details.csv”,
> plsFile=”PLS.details.csv”,
> lassoFile=”Lasso.details.csv”,
> svrmFile=”SVMRadial.details.csv”,
> nnFile=”NN.details.csv”,
> rfFile=”RF.details.csv”,
> rfrefFile=”RFREF.details.csv”,
> svmrfeFile=”SVMRFE.details.csv”,
> enetFile=”ENET.details.csv”,
> fR2rule=”T”)
The calculations need to be done using a specific folder where all the input, output files can be found. RRegrs main function is using an extended set of parameters*
Even if RRegrs will create outputs files with all statistics and plots, the function return a list with model’s statistics and the the full regression model (resulted from caret traini
Each regression function:
– a CSV file with detailed statistics about the regression mo
– 5 − 12 plots for fitting statistics as a PDF file for each splitting and cross-validation me
The outputs can differ depending on the regression method used.
After filtering the dataset for correlated variables, near-zero variance features and splitting the dataset into training and test sets, the user’s selected regression methods will be executed for each splitting and cross-validation type. Some of the complex regression methods (RF, SVM-RFE, RF-RFE) are using only 10-fold cross-validation (other validation methods could be very slow for these complex functions). The parallel support for calculations is presented only for the complex functi
In the next step, a CSV output file will be created with all the basic statistics (17 values) for each method type, splitting and cross-validation type. These summary statistics are used to generate another CSV file with the averaged statistics by all splittings, for each regression method and cross-validation type. The best regression model is chosen based on the following criteria: from the best test R-squared (+/- 0.05), the model with minimum RMSE is the final one. If fR2rule is False, adjR2 will be used to select the best model. For the best model, an additional CSV file is generated containing detailed statistics as well as PDF plots for important statist
Please note that the Cross-validation types available are the resampling methods available from trainControl caret and rfeControl ca
If the wrapper function are selected, SVMRFEreg and RFRFEreg functions could be used. The following section is presenting the code flow with details about the data objects, input and output files, and functions used into the main RRegrs function.
This section presents datails about variable names, input and output files, order of the sections:
1.Load parameters and dataset
Load Parameters from the function call as data frame: Params.df
Write all parameters into a CSV file (default: Parameters.csv)
2.Remove the NA values
3.Remove near zero variance columns using RemNear0VarCols = ds - No0Var CSV
4.Scaling dataset (normalization - default, standardization, etc.) using ScalingDS = ds - ScaledCSV
5.Remove correlated features using RemCorrs = ds
Dataset without correlated features: Scaled NoCorrs CSV
Correlation matrix: Scaled NoCorrs CorrMAT CSV
Correlation plot before removal of features: Scaled NoCorrs Corrs PNG
6.Dataset splitting: Training and Test sets using DsSplit = ds.train, ds.test - CSVs for train and test; for each dataset splitting (default = 10) repeat steps 7 − 9 = dfMod
7.Regression Methods
Executed for each cross-validation type (non-wrapper or wrapper)
Resulted PDF plots for each method, split and cross-validation type
Resulted CSV files for each method with detailed statistics
Each method return a list containing: statistics values and the fitter model obtained with caret package
All models are memorized into variable dfMod
(a)Basic LM using LMreg ! lm.model appended to dfMod
(b)GLM based on AIC regression using GLMreg ! glm.model appended to dfMod
(c)PLS using PLSreg ! pls.model appended to dfMod
(d)Lasso using LASSOreg ! lasso.model appended to dfMod
(e)Elastic Nets ENETreg ! enet.model appended to dfMod
(f)SVM radial regression using SVLMreg ! SVRM.model appended to dfMod
(g)Neural Networks Regression using NNreg ! nn.model appended to dfMod
(h)Random Forest Regression RFreg ! rf.model appended to dfMod
(i)Random Forest-Recursive Feature Elimination (RF-RFE)
(j)Support Vector Machines Recursive Feature Elimination SVMRFEreg ! svmrfe.model appended to dfMod
8.Fitted models comparison plots.
Models are compared in terms of their resampling results. The resamples() caret function
is used to collect, summarize and contrast the resampling results. Plots are produced per training set (i.e. per data split), and for di↵erent estimated R2 and RMSE values, e.g. ’DifModels.RMSE.iSplits.1.pdf’, ’DifModels.R2.iSplits.1.pdf’. Note that only models with the same resampling scheme are compared.
9.Results
All statistics results (not ordered) = df.res - RRegrsResBySplit.csv
Averaged statistics of the results by each Regression Method and CV type
All results as data table - dt.res
Averaged results = dt.mean
Ordered averaged results = dt.mean.ord - RRegsResAvgs.csv
10.Best model selection – max R2.ts (+/ − 0.005), min RMSE
Max R2.ts = Best model statistics - best.dt
R2.ts for the best model = best.R2 .ts
Add new conditions (max adjR2.ts (+/ − 0.005), min RMSE) = best.dt
Regression method for the best model = best.reg
11.Best model detailed statistics
Detailed statistics for the best model = RRegrsResBest.csv
Run the caret function with the method from the best method = my.stats.reg
Plots for best model with 10-fold CV and last split: RRegrsResBest.csv.repeatedcv.split2.pdf
Regression method for the best model = best.reg
12.Y-randomization for best model – default = 100 times
4 Resampling methods
Model performance is estimated using resampling techniques, namely k-fold cross-validation (CV), leave-one-out (LOO) CV, bootstrapping. Particularly, repeated data splitting is performed (de-fault value 10), whereas during the procedure of building the model a set of modified data sets are created from the training samples based on the options o↵ered by train() and rfe() functions in caret package- NAMES in RRgres. Both functions consider a grid of candidate tunning param-eters, and the final tunning parameter set is chosen based on aggegating resampling performance estimates for each of the hold-out sample set. These performance estimates are used to evaluate which combination(s) of the tuning parameters are appropriate. Once the final tuning values are assigned, the final model is refit using the entire training set. Finally the performance of the model is evaluated on the test set.
By default the root mean square error (RMSE) is used to calculate performance but R2 and adjusted-R2 options are also available.
5 RRegrs package functions
5.1 RRegrs function
RRegrs is the main function of the current package and it permits to execute all regression methods for any dataset in only one call.
RRegrs function is based on 11 regression methods that use caret package. The following subsections will present the regression methods which can also be used individually. In order to do that we need to specify all parameters and the data set used, by either defining the parameters file as we have done before:
>#flag to calculate and print details for all the functions
>fDet <− as.logical(
> Param.df[which(Param.df$RRegrs.Parameters==”fDet”),2])
>
>#to reapeat the ds splitting, different values of seed will be used
>iSeed <− i
>#the fraction of training set from the entire dataset;
>#trainFrac=therestof dataset, the test set
> trainFrac<−as.numeric(
> as.character(
> Param.df[which(Param.df$RRegrs.Parameters==”trainFrac”),2]
> )
>)
>
>#data set folder for input and out put files
>PathDataSet <− as.character(
> Param.df[which(Param.df$RRegrs.Parameters==”PathDataSet”),2]
>)
>
>#input step1 = ds original file name
>DataFileName <− as.character(
> Param.df[which(Param.df$RRegrs.Parameters==”DataFileName”),2]
>)
># Generate path + file name = original dataset
> inFile <− file.path(PathDataSet, DataFileName)
>
> #scaleddsfile name(in the same folder)
> ScaledFile = as.character (
> Param.df[which (Param.df$RRegrs.Parameters==” ScaledFile” ),2]
>)
>
>ds<−read.csv(inFile,header=T)
>
>#returnalistwith2datasets=dsList$train,dsList$test
>dsList<−DsSplit(ds,trainFrac,fDet,PathDataSet,iSeed)
>
>#gettrainandtestfromtheresultedlist
>ds.train<−dsList$train
>ds.test<−dsList$test
>
>#typesofcross−validationmethods
>CVtypes<−strsplit(as.character(
> Param.df[which(Param.df$RRegrs.Parameters==”CVtypes”),2]),”;”
>)[[1]]
Or by individually defining all parameters:
>#flag to calculate and print details for all the functions
>fDet<−FALSE
>
>#to reapeat the ds splitting, different values of seed will be used
>iSeed<−i
>
>#the fraction of training set from the entire dataset;
>trainFrac<− 0.75
>
>
>#dataset folder for input and out put files
>PathDataSet<−’DataResults’
>
>#upload data set
>ds<−read.csv(ds.Housing,header=T)#!!!!
>
>#return a list with 2 data sets= dsList$train, dsList$test
>dsList<−DsSplit(ds, trainFrac, fDet, PathDataSet, iSeed)
>
>#get train and test from the resulted list
>ds.train<−dsList$train
>ds.test<−dsList$test
>
>#types of cross− validation methods
>CVtypes<−c(’repeatedcv’,’LOOCV’)
For this particular example we are looking at the Boston Housing data [1].
5.2 Basic Linear regression (LM) function
LM is called via train() caret function having no extra tuning parameters. RMSE was chosen as the summary metric used to select the optimal model. trainControl() caret function is used to set the resampling method used and its parameters, namely for the k-fold CV we set k=10, and the default value is to repeat the resampling procedure 10 times.
>#define the output file where all results(CSV,PDFfiles)
>#will be stored
>outLM<−’LMoutput.csv’
>LM.fit<−LMreg(
>ds.train,ds.test,CVtypes[1],iSplit=1,fDet=F,outFile=outLM
>)
If the details are used, the function is creating several output files such as a CSV file with all calculation details and PDF files for each cross-validation type and split.
5.3 Generalized Linear Model with Stepwise Feature Selection (GLM) function
GLM is called via train() caret function using the glmStepAIC function from the MASS package. No tuning parameters are set. RMSE was chosen as the summary metric used to select the optimal model. trainControl() caret function is used to set the resampling method used and its parameters, namely for the k-fold CV we set k=10, and the default value is to repeat the resampling procedure 10 times.
>#definetheoutputfilewhereallresults(CSV,PDFfiles)>#willbestored
>outGLM<−’GLMoutput.csv’
>GLM.fit<−GLMreg(
> ds.train,ds.test,CVtype[1],iSplit=1,fDet=F,outFile=outGLM
>)
If the details are used, the function is creating several output files such as a CSV file with all calculation details and PDF files for each cross-validation type and split.
5.4 Partial Least Squares Regression (PLS) function
PLS is called via train() caret function using the mvr function of the pls package. RMSE was chosen as the summary metric used to select the optimal model. The number of components is the tuning parameter of the model, which we set to a sequence of integers from 1 to one fifth of the number of features in the training data set. (If the later is smaller than 1, tuning parameter is set to 1.)
trainControl()
caret function is used to set the resampling method used and its parameters, namely for the k-fold CV we set k=10, and the default value is to repeat the resampling procedure 10 times.
>#define the output file where all results(CSV, PDFfiles)
>#will bestored
>outPLS <− ’PLSoutput.csv’
>PLS.fit <− PLSreg(
> ds.train,ds.test,CVtype[1],iSplit=1,fDet=F,outFile=outPLS
>)
If the details are used, both functions are creating several output files such as a CSV file with all calculation details and PDF files for each cross-validation type and split.
5.5 Lasso regression function
Lasso is called via train() caret function using the enet function of the elasticnet package. RMSE was chosen as the summary metric used to select the optimal model. Fraction is the tuning parameter of the model which is the ratio of the L1 norm of the coeﬃcient vector, relative to the norm at the full LS solution. We have set fraction to vary in a sequence of 10 values between zero and one.
trainControl()
caret function is used to set the resampling method used and its parameters, namely for the k-fold CV we set k=10, and the default value is to repeat the resampling procedure 10 times.
>#define the output file where all results(CSV,PDFfiles)
>#will bestored
>outLASSO<−’LASSOoutput.csv’
>LASSO.fit<−LASSOreg(
>ds.train, ds.test, CVtype[1], iSplit=1, fDet=F, outFile=outLASSO
>)
Following the guidelines by the elasticnet package, LASSOreg only runs for k-fold CV schemes.
If the details are used, the function is creating several output files such as a CSV file with all calculation details and PDF files for each cross-validation type and split.
5.6 Elastic Net regression (ENET) function
Elastic net from glmnet package have mainly two parameters, alpha and lambda. Instead of using the standard caret package sCV parameterization, the proper alpha value is chosen by sCV (alpha =1 lasso, alpha =0 ridge), and labda is chosen using the glmnet package. RMSE was chosen as the summary metric used to select the optimal model.
trainControl() caret function is used to set the resampling method used and its parameters, namely for the k-fold CV we set k=10, and the default value is to repeat the resampling procedure 10 times.
sCV can take the following values: boot, boot632, cv, repeatedcv, LOOCV, LGOCV (for repeated training/test splits), none (only fits one model to the entire training set), oob (only for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models), adaptive cv, adaptive boot or adaptive LGOCV.
>#define the output file where all results(CSV,PDFfiles)
>#will be stored
>outENET<−’ENEToutput.csv’
>ENET.fit<−ENETreg(my.datf.train,my.datf.test,sCV,iSplit1,fDet=F,outFile=outENET)
If the details are used, the function is creating several output files such as a CSV file with all calculation details and PDF files for each cross-validation type and split.
5.7 Support vector machine using radial functions (SVM radial) regres-sion function
SVM radial is called via train() caret function using the ksvm function of the kernlab package (the kernel function used is ’rbfdot’ Radial Basis Gaussian kernel). RMSE was chosen as the summary metric used to select the optimal model. Tuning parameters in this case are sigma (inverse kernel width) and a regularization parameter C cost (controls how much the regression line can adapt to the data smaller values result in more linear, i.e. flat surfaces.). We have set sigma to be estimated by sigest from kernlab package, and C to vary in c(1,5,10,15,20).
trainControl()
caret function is used to set the resampling method used and its parameters, namely for the k-fold CV we set k=10, and the default value is to repeat the resampling procedure 10 times.
>#define the output file where all results(CSV,PDFfiles)
>#will bestored
>outSVRM<−’SVRMoutput.csv’
>SVRM.fit<−SVRMreg(
> ds.train,ds.test,CVtype[1],iSplit=1,fDet=F,outFile=outSVRM
>)
If the details are used, the function is creating several output files such as a CSV file with all calculation details and PDF files for each cross-validation type and split.
5.8 Neural Networks regression (NN) function
ANis called via train()
caret function using the nnet function of the nnet package. RMSE was chosen as the summary metric used to select the optimal model. Tuning parameters in this case are size and decay, where size refers to the number of units in the hidden layer and decay to the weight decay. Size is set to vary in c(1, 5, 10, 15) and decay within a sequence of values in [0, 0.001].
trainControl()
caret function is used to set the resampling method used and its parameters, namely for the k-fold CV we set k=10, and the default value is to repeat the resampling procedure 10 times.
>#define the output file where all results(CSV,PDFfiles)
>#will be stored
>outNN<−’NNoutput.csv’
>NN.fit<−NNreg(
> ds.train,ds.test,CVtype[1],iSplit=1,fDet=F,outFile=outNN
>)
5.9 Random Forest (RF) regression function
RF (random forest) is called via train() caret function using the randomForest function of the randomForest package. RMSE was chosen as the summary metric used to select the optimal model. Tuning parameters in this case are the number of features selected randomly for each tree in the forest. The recommended value is the square root of the total number of features, however, we recommend a set of values between this value and the total number of features. Of course, the bigger this number, the lower the algorithm. The adjustment values are TO COMPLETE
trainControl()
caret function is used to set the resampling method used and its parameters, namely for the k-fold CV we set k=10, and the default value is to repeat the resampling procedure 10 times. Each random forest grows 1500 trees.
During the model selection process, the sCV method try to find the best number of features automaticaly chosen in each tree of the RF. The possible values are: numberFeatures/3 (default in randomForest Package), numberFeatures and numberFeatures/2.
sCV can take the following values: boot, boot632, cv, repeatedcv, LOOCV, LGOCV (for repeated training/test splits), none (only fits one model to the entire training set), oob (only for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models), adaptive cv, adaptive boot or adaptive LGOCV.
>#definetheoutputfilewhereallresults(CSV,PDFfiles)>#willbestored
>outRF<−’RFoutput.csv’
>RF.fit<−RFreg(
> my.datf.train,my.datf.test,sCV,iSplit1,fDet=F,outFile=outRF
>)
If the details are used, the function is creating several output files such as a CSV file with all calculation details and PDF files for each cross-validation type and split.
5.10 Support Vector Machines Recursive Feature Elimination (SVM-RFE) regression function
SVM is called via eps-svr function of the kernlab package and uses the RFE function of the caret package to obtaing the best SVM model with the best feature set. RMSE was chosen as the summary metric used to select the optimal model.
trainControl() caret function is used to set the resampling method used and its parameters, namely for the k-fold CV we set k=10, and the default value is to repeat the resampling procedure 10 times.
sCV can take the following values: boot, boot632, cv, repeatedcv, LOOCV, LGOCV (for repeated training/test splits), none (only fits one model to the entire training set), oob (only for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models), adaptive cv, adaptive boot or adaptive LGOCV.
>#define the output file where all results(CSV,PDFfiles)
>#will bestored
>outSVMRFE<−’SVMRFEoutput.csv’
>SVMRFE.fit<−SVMRFEreg(
> my.datf.train,my.datf.test,sCV,iSplit1,
> fDet=F,outFile=outSVMRFE,cs=c(1,5,15,50),eps=c(0.01,0.1,0.3)
>)
If the details are used, the function is creating several output files such as a CSV file with all calculation details and PDF files for each cross-validation type and split.
5.11 Random Forest-Recursive Feature Elimination (RF-RFE) regres-sion function
RF-RFE represents a wrapper version of RF and, therefore, it will be executed only of feature selection flag was choose.
RF (random forest) is called via train() caret function using the randomForest function of the randomForest package and uses the RFE function of the caret package to obtaing the best SVM model withe the best feature set. RMSE was chosen as the summary metric used to select the optimal model. Tuning parameters in this case are the number of features selected randomly for each tree in the forest. The recommended value is the square root of the total number of features, however, we recommend a set of values between this value and the total number of features. Of course, the bigger this number, the lower the algorithm. The adjustment values are TO COMPLETE
trainControl()
caret function is used to set the resampling method used and its parameters, namely for the k-fold CV we set k=10, and the default value is to repeat the resampling procedure 10 times. Each random forest grows 1500 trees.
During the model selection process, the sCV method try to find the best number of features automaticaly chosen in each tree of the RF. The possible values are: numberFeatures/3 (default in randomForest Package), numberFeatures and numberFeatures/2.
sCV can take the following values: boot, boot632, cv, repeatedcv, LOOCV, LGOCV (for repeated training/test splits), none (only fits one model to the entire training set), oob (only for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models), adaptive cv, adaptive boot or adaptive LGOCV.
>#define the output file where all results(CSV,PDFfiles)
>#will be stored
>outRFRFE<−’RFRFEoutput.csv’
>RFRFE.fit<−RFRFEreg(
> my.datf.train,my.datf.test,sCV,iSplit1,
> fDet=F,outFile=outRFRFE
>)
If the details are used, the function is creating several output files such as a CSV file with all calculation details and PDF files for each cross-validation type and split.
Several function have been created in order to split the dataset, remove near-zero variance features, remove correlated features, etc. (the the code flow section).
5.12 Removal of near zero variance columns
This function is based on nearZeroVar function from caret and it has several parameters as input: ds = dataset features without predicted variable (as data frame), fDet = if details (default is FALSE), outFile = output file with modified dataset (default is ”ds3.No0Var.csv”):
>#Removal of near zero variance columns and add the predicted variable
>#=>ds.new=newdataset(asdataframe)
>ds.new<−cbind(
> ”net.c”=ds[,1],RemNear0VarCols(ds[,2:dim(ds)[2]],fDet,outFile)
>)
5.13 Scaling dataset
This function is based on scale function from caret and it has several parameters as input: ds = dataset features (as data frame), s = 1,2,3 - type of scaling: 1 = normalization, 2 = standard-ization, 3 = other (default = 1 = Normalization), c = the number of column into the dataset to start scaling (default = 1; if c = 1, included the dependent variable; if c = 2, only the features will be scaled), fDet = if details (default is FALSE), outFile = output file with modified dataset (default is ”ds4.scaled.csv”). If s di↵erent of 1,2,3 is used, there is no scaling.
>#Scaling dataset=>ds.new=newdataset (as data frame)
>ds.new<−ScalingDS(ds,iScaling,iScalCol,fDet,outFile)
5.14 Remove correlated features
This function is based on scale functions from caret and corrplot packages. It has several param-eters as input: ds = dataset frame, fDet = flag for details (TRUE/FALSE), cutoff= correlation cut o↵(ex: 0.9), outFileName = file name with the corrected dataset (it could include the path).
>#Remove the correlated columns => ds.new=new
>#(as data frame)and rebuild the new data frame
>#predicted variable
>ds.new <− cbind(data set with the
> ”net.c”=ds[,1],RemCorrs(ds[,2:dim(ds)[2]],fDet,cutoff,outFile)
>)
Additional files are generated if fDet is TRUE such as correlation matrix (CSV) and plot the correlation plot before correlation removal (PDF).
5.15 Dataset splitting in Training and Test
This function is based on createDataPartition function from caret and it has several parameters as input: ds = frame dataset object, trainFrac = training set ratio from the entire dataset (default=3/4 = 75%), fDet = flag for detais (default = FALSE), PathDataSet = pathway for results, iSeed = number to be used as seed.
>#Dataset spliting in Training and Test => dsList=list with
>#two data frames for trainingand test
>dsList <− DsSplit(ds,trainFrac,fDet,PathDataSet,iSeed)
5.16 Y-randomization for the best model
It uses only one splitting, 10-cross validation (repeatedcv) and the best method. Best R2 for test (best.R2.ts) will be compared with R2 in test with Y-randomization (Yrand.R2.ts) and it returns ratios between the di↵erence of R2 and best R2 (Di↵sR2/bestR2). DsSplit is using for splitting the dataset.
The function is appending outputs to the general CSV statistics file and a histogram as PDF plot.
It is based on createDataPartition function from caret and it has several parameters as input: ds = frame dataset object, trainFrac = training set ratio from the entire dataset, best.reg = label of the method, best.r2.ts = value of R2 for the best model in test set, noYrand = number of randomization (optimal = 100), ResBetF = file name for the best model output CSV file.
>#Y−randomizationtest
>#parallelsupportusing2CPUcores
>R2Diff.Yrand<−Yrandom(
> ds,trainFrac,best.reg,best.R2.ts,
> noYrand,ResBestF
>)
This function calls the best model regression method and, if it is a complex method, there is a need of parallel calculation.
5.17 Auxiliary functions
Several functions have been developed in order to print specific data types to text or CSV files or to calculate several statistics:
r2.adj.funct = calculates adjusted R2
r2.adj.lm.funct = calculates adjusted R2 for LM
rmse.funct = calculates RMSE
r2.funct = calculates R2
AppendList2CSv = writes a list to CSV file
AppendList2txt = writes a list to TXT file
findResamps.funct = find the number of re-samples for caret, rfe or sbf objects from caret package
svmFuncsW$fit = calculates the best model with the best feature set (c and epsilon)
svmFuncsW$rank = calculates the k w2 k as ranking criterion for measuring the importance of a particular featur in the RFE process.
svmFuncsW$pred = [to be completed]
svmFuncsGradW$rank = calculates gradient w, as proposed by Rakotomamonjy et. al based on the gradient of SVM coeﬃcients.
6 Final model
When all models are build based on the resampling scheme discussed above, the best model is selected given RMSE values on the test set, averaged over the 10 data splits. In fact only one winning model is reported at the end of the process with all model parameters and performance statistics. Nevertheless, the user can ask for full details in the working folder.
7 Output
RRegrs returns a list with three items: the name of the best method, the statistics for the best model, the list with all the fitted models based on caret functions (including the best model).
8 Example: Regression model for Boston House dataset
The following examples show simple calls of the RRegrs() function using a specific dataset file entitled ”MyDataSet.csv” that it should be provided by the user:
>library(RRegrs)
>
>#Run RRegrs with all default parameters
>#default data set file(”ds.House.csv”) and
>#working directory (”DataResults”)
>#run all regression methods
>#10 splittings, 100 timesY−randomization,
>#no parallel support for CPUcores
>RRegrsResults=RRegrs()
>
>#Run RRegrs for a specific data set file with default parameters
>#including the default directory(”DataResults”)
>RRegrsResults=RRegrs(DataFileName=”MyDataSet.csv”)
>
>#Run RRegrs for aspecific data set file(”MyDataSet.csv”)and
>#working folder (”MyResultsFolder”); both should exist
>#the rest of RRegrs parameters have default values
>RRegrsResults=RRegrs(DataFileName=”MyDataSet.csv”,
> PathDataSet=”MyResultsFolder”)
Method | repeatedcv | LOOCV |
---|---|---|
LM | 18.50 | 1.65 |
GLM | 3.02 | 8.86 |
PLS | 1.14 | 1.58 |
Lasso | 1.30 | - |
ENET | 13.74 | 49.45 |
SVM radial | 5.55 | 14.61 |
NN | 12.70 | 49.97 |
RF | 92.44 | - |
RF-RFE | 3.65 | - |
SVM-RFE | 60.75 | - |
Table 1: RRegrs execution time for a split of Boston House dataset.
The output variable RRegrsResults is a complex object which contains the object of the fitted models and the main statistics for each regression model. Details about each function are presented into the tutorial of the RRegrs package.
The following example could be used to test the RRegrs package using a the Boston housing dataset [1] from RRegrs GitHub URL. It has 13 features and 506 cases:
>library(RRegrs)
>
>#Create defaul tworking directory ”DataResults”
>dir.create(”DataResults”)
>
>#Get Housing dataset ”ds.House.csv” from RRegrs GitHub
>#in the default RRegrs directory ”DataResults”
>download.file(”https://raw.githubusercontent.com/enanomapper/RRegrs/master/TEST/d
>”DataResults/ds.House.csv”,method=”auto”,quiet=FALSE)
>
>#RRegrs call with default parameters
>RRegrsResults=RRegrs()
If you already have this dataset locally in the default working directory of RRegrs (”DataRe-sults”), the following call could be used:
>#Search for the best regression model using parameter file
>ComplexOutput<− RRegrs(DataFileName=”ds.House.csv”,
> noCores=0,iSplitTimes=2,noYrand=2)
The RRegrs call uses all available CPU cores for the complex methods, 2 splittings, 2 Y-randomizations, and all the regression methods.Table 1 presents the execution times on an Win-dows 8.1 64bit with i7-4790 CPU (3.60GHz, 4 cores, 8 logical cores), 16G RAM. repeatedcv rep-resents the 10-fold cross-validation. The total execution time was 11.63 minutes. The most important files into the working folder are:
Data set file = ds.House.csv (Fig. 1). First column represents the predicted variable (depen-dent variable) and the other 13 columns are the original features.
Parameter file = Parameters.csv (Fig. 2). It contains all the parameters used to run RRegrs function.
RRegrsResAllSplits.csv = Statistics for all data set splitting, method and CV type (Fig. 3)
RRegsResAvgs.csv = Averages statistics by method/CV type (Fig. 4)
RRegrsResBest.csv = Detailed statistics for the automatically best model (best R2 in test dataset with minimum RMSE)
Comparison plots = DifModels.R2.iSplits.1.pdf (Fig. 5), DifModels.RMSE.iSplits.1.pdf (Fig. 6), ModelsComp.iSplits.1.pdf (Fig. 7), DifModels.R2.iSplits.2.pdf, DifModels.RMSE.iSplits.2.pdf, ModelsComp.iSplits.2.pdf.
ResBestF,”.repeatedcv.split”,i,”.pdf” = Best model plots only for repeatedcv
ResBestF,”.Yrand.Hist.pdf” = Best model Y-randomization plot
Additional files are presented for each regression method such as CSV detailed statistics and PDF plots. One PNG is created (ds.scaled.NoCorrs.csv.corrs.png) in order to show the correlated features in the original dataset (Fig. 8). The initial names of the features have been replaced with V2-V14.
Figure 1: Boston House data set header.