% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/utilities.R
\name{data_process}
\alias{data_process}
\title{data_process: process tabular data into the format for the PIE model.}
\usage{
data_process(
  X,
  y,
  num_col,
  cat_col,
  y_col,
  k = 5,
  validation_rate = 0.2,
  spline_num = 5,
  random_seed = 1
)
}
\arguments{
\item{X}{Feature columns in dataset}

\item{y}{Target column in dataset}

\item{num_col}{Index of the columns that are numerical features}

\item{cat_col}{Index of the columns that are categorical features.}

\item{y_col}{Index of the column that is the response.}

\item{k}{Number of fold for cross validation dataset setup. By default `k = 5`.}

\item{validation_rate}{Validation ratio within training dataset. By default `validation_rate = 0.2`}

\item{spline_num}{The degree of freedom for natural splines. By default `spline_num = 5`}

\item{random_seed}{Random seed for cross validation data split. By default `random_seed = 1`}
}
\value{
A list containing:
\item{spl_train_X}{A list of splined training dataset where all numerical features are splined
into `spline_num` columns. The number of element in list equals `k` the number of fold. }
\item{orig_train_X}{A list of original training dataset where the numerical features remains the
original format. The number of element in list equals `k` the number of fold.}
\item{train_y}{A list of vectors representing target variable for training dataset. The number of
element in list equals `k` the number of fold.}
\item{spl_validation_X}{A list of splined validation dataset where all numerical features are splined
into `spline_num` columns. The number of element in list equals `k` the number of fold.
It could be None, when `validation_rate == 0`}
\item{orig_validation_X}{A list of original validation dataset where the numerical features remains the
original format. The number of element in list equals `k` the number of fold.
It could be None, when `validation_rate == 0`}
\item{validation_y}{A list of vectors representing target variable for validation dataset. The number of
element in list equals `k` the number of fold. It could be None, when `validation_rate == 0`}
\item{spl_test_X}{A list of splined testing dataset where all numerical features are splined
into `spline_num` columns. The number of element in list equals `k` the number of fold. }
\item{orig_test_X}{A list of original testing dataset where the numerical features remains the
original format. The number of element in list equals `k` the number of fold.}
\item{test_y}{A list of vectors representing target variable for testing dataset. The number of
element in list equals `k` the number of fold.}
\item{lasso_group}{A vector of consecutive integers describing the grouping of the coefficients}
}
\description{
This function take tabular dataset and meta-data (such as numerical columns and categorical columns), then output k fold cross validation dataset with
splines on numerical features in order to capture the non-linear relationship among numerical features. Within this function, numerical features and target
variable are normalized and reorganize into order: (numerical features, categorical features, target).
}
\details{
The function generates a suitable cross-validation dataset for PIE model. It contains training dataset,
validation dataset, testing dataset and also group indicator for group lasso. When `k=5`, the training
testing splits in 80/20. When `validation_rate=0.2`, 20% of the training data turns into validation data.
Setting `validation_rate=0` will only generate training and testing data without validation data.
}
\examples{
\donttest{
# Load the training data
data("winequality")

# Which columns are numerical?
num_col <- 1:11
# Which columns are categorical?
cat_col <- 12
# Which column is the response?
y_col <- ncol(winequality)

# Data Processing (the first 200 rows are sampled for demonstration)
dat <- data_process(X = as.matrix(winequality[1:200, -y_col]), 
  y = winequality[1:200, y_col], 
  num_col = num_col, cat_col = cat_col, y_col = y_col)
}
}
