\name{DDPstar}
\alias{DDPstar}
\title{
Density Regression via Dirichlet Process Mixtures (DDP) of Normal Structured Additive Regression (STAR) Models
}
\description{
This function estimates the DDPstar model proposed by Rodriguez-Alvarez, Inacio, et al. (2025).
}
\usage{
DDPstar(formula, data, standardise = TRUE, compute.lpml = FALSE, compute.WAIC = FALSE, 
  compute.DIC = FALSE, prior = priorcontrol(), mcmc = mcmccontrol(), verbose = FALSE)
}
\arguments{
\item{formula}{A \code{\link[stats]{formula}} object specifying the regression function associated to each component of the DDPstar model used to estimate the conditional density function of the response variable. Regarding the modelling of continuous covariates, both linear and nonlinear effects are allowed, with nonlinear effects being modelled through P-splines (see Note)}
\item{data}{A data frame representing the data and containing all needed variables.}
\item{standardise}{A logical value. If TRUE both the responses and the continuous covariates are standardised (i.e., the resulting variables have mean zero and standard deviation of one). The default is TRUE.}
\item{compute.lpml}{A logical value. If TRUE, the log pseudo marginal likelihood (LPML, Geisser and Eddy, 1979) and the conditional predictive ordinates (CPO) are computed.}
\item{compute.WAIC}{A logical value. If TRUE, the widely applicable information criterion (WAIC, Gelman et al., 2014; Watanabe, 2010) is computed.}
\item{compute.DIC}{A logical value. If TRUE, the deviance information criterion (DIC)(Celeux et al., 2006, Spiegelhalter et al., 2002) is computed.}
\item{prior}{Hyparameter specification. A list of control values to replace the default values returned by the function \code{\link{priorcontrol}}. See \code{\link{priorcontrol}} for details.}
\item{mcmc}{A list of control values to replace the default values returned by the function \code{\link{mcmccontrol}}. See \code{\link{mcmccontrol}} for details.}
\item{verbose}{A logical value indicating whether additional information should be displayed while the MCMC algorithm is running. . The default is FALSE}
}
\details{
Our DDPstar model for the conditional density takes the following form 
\deqn{p(y|\boldsymbol{x}) = \sum_{l=1}^{L}\omega_{l}\phi(y\mid\mu_{l}(\boldsymbol{x}),\sigma_{l}^2),}
where \eqn{\phi(y|\mu, \sigma^2)} denotes the density function of the normal distribution, evaluated at \eqn{y}, with mean \eqn{\mu} and variance \eqn{\sigma^2}. The regression function \eqn{\mu_{l}(\boldsymbol{x})} can incorportate both linear and nonlinear (through P-splines) effects of continuous covariates, random effects, categorical covariates (factors) as well as interactions. Interactions between categorical and (nonlinear) continuous covariates are also allowed (factor-by curve interactions). 

The mean (regression function) of each mixture component, \eqn{\mu_l(\cdot)}, is compactly specified (see Rodriguez-Alvarez, Inacio, et al. (2025) for details) as
\deqn{
\mu_{l}(\boldsymbol{x}) = \underbrace{\mathbf{x}^{\top}\boldsymbol{\beta}_{l}}_{\mbox{Parametric}} + \underbrace{\sum_{r = 1}^{R}\mathbf{z}_{r}(\boldsymbol{v}_r)^{\top}\boldsymbol{\gamma}_{lr}}_{\mbox{Smooth/Random}}}

where \eqn{\mathbf{x}^{\top}\boldsymbol{\beta}_{l}} contains all parametric effects, including, by default, the intercept, parametric effects of categorical covariates and linear or parametric effects of continuous covariates and their interactions, as well as, the unpenalised terms associated with the P-splines formulation, and \eqn{\sum_{r = 1}^{R}\mathbf{z}_{r}(\boldsymbol{v}_r)^{\top}\boldsymbol{\gamma}_{lr}} contains the P-splines' penalised terms, as well as, the random effects.

In our DDPstar model, \eqn{L} is a pre-specified upper bound on the number of mixture components. The \eqn{\omega_{l}}'s result from a truncated version of the stick-breaking construction (\eqn{\omega_{1} = \eta_{1}}; \eqn{\omega_{l} = \eta_{l}\prod_{r<l}(1-\eta{r})}, \eqn{l=2,\ldots,L}; \eqn{\eta_{1},\ldots,\eta_{L-1}\sim} Beta \eqn{(1,\alpha)}; \eqn{\eta_{L} = 1}). Further, for \eqn{\boldsymbol{\beta}_{l}} we consider
\deqn{
p\left(\boldsymbol{\beta}_{l} \mid \mathbf{m}, \mathbf{S}\right) \propto \exp\left(-\frac{1}{2}\left(\boldsymbol{\beta}_{l} - \mathbf{m}\right)^{\top}\mathbf{S}^{-1}\left(\boldsymbol{\beta}_{l} - \mathbf{m}\right)\right),
}
with conjugate hyperpriors \eqn{\mathbf{m} \sim N_{Q}(\mathbf{m}_{0},\mathbf{S}_{0})} and \eqn{\mathbf{S}^{-1}\sim W_{Q}(\nu,(\nu\Psi)^{-1})}, where \eqn{W(\nu,(\nu\Psi)^{-1})} denotes a Wishart distribution with \eqn{\nu} degrees of freedom and expectation \eqn{\Psi^{-1}}, and \eqn{Q} denotes the length of vector \eqn{\boldsymbol{\beta}_{l}}. Regarding the prior distributions for the P-splines' penalised coefficients and random effect coefficients we have
\deqn{
p(\boldsymbol{\gamma}_{lr}\mid \tau_{lr}^{2}) \propto \exp\left(-\frac{1}{2\tau_{lr}^{2}}\boldsymbol{\gamma}_{lr}^{\top}\mathbf{K}_{r}\boldsymbol{\gamma}_{lr}\right),
}
where the specific forms of \eqn{\mathbf{K}_{r}} depend on the components type (see Rodriguez-Alvarez, Inacio, et al. (2025) for details). Finally, \eqn{\sigma_{l}^{-2}\sim\Gamma(a_{\sigma^2},b_{\sigma^2})}, \eqn{\tau_{lr}^{-2}\sim\Gamma(a_{\tau^2},b_{\tau^2})} and \eqn{\alpha \sim \Gamma(a_{\alpha}, b_{\alpha})}, where \eqn{\Gamma(a,b)} denotes a Gamma distribution with shape parameter \eqn{a} and rate parameter \eqn{b}. For a detailed description, we refer to Rodriguez-Alvarez, Inacio, et al. (2025).

It is worth referring that with respect to the computation of the DIC, when \eqn{L=1}, it is computed as in Spiegelhalter et al. (2002), and when \eqn{L>1}, DIC3 as described in Celeux et al. (2006) is computed. Also, for the computation of the conditional predictive ordinates (CPO) we follow the stable version proposed by Gelman et al. (2014).
}
\value{
An object of class \code{DDPstar}. It is a list containing the following objects:
\item{call}{The matched call.}
\item{data}{The original supplied data argument.}
\item{data_model}{A list with the data used in the fit: response variable and design matrix.}
\item{missing.ind}{A logical value indicating whether for each pair of observations (responses and covariates) missing values occur.}
\item{mcmc}{A list of control values for the Markov chain Monte Carlo (MCMC)}
\item{fit}{Named list with the following information: (1) \code{formula}: the value of the argument \code{formula} used in the call. (2) \code{mm}: information needed to construct the model matrix associated with the DDPstar model. (3) \code{beta}: array of dimension \code{nsave}\eqn{\times}\code{L}\eqn{\times}\code{K} with the sampled regression coefficients. (4) \code{sd}: matrix of dimension \code{nsave}\eqn{\times}\code{L} with the sampled variances. (5) \code{probs}: matrix of dimension \code{nsave}\eqn{\times}\code{L} with the sampled components' weights. (6) \code{tau_sm}: matrix of dimension \code{nsave}\eqn{\times}\code{L}\eqn{\times}\code{R} with the sampled variances for smooth/random terms. (7)\code{S}: matrix of dimension \code{n}\eqn{\times}\code{nsave} with the sampled allocations. (8) \code{alpha}: vector of length \code{nsave} with the sampled precision parameter
of the Dirichlet Process. Here, \code{nsave} is the number of Gibbs sampler iterations saved, \code{L} is the maximum number of mixture components, \code{K} is the dimension of the coefficient vector (including both parametric and nonlinear/smooth/random effects), \code{R} is the number of nonlinear/smooth/random components, and \code{n} is the number of observations.}
\item{lpml}{If computed, a list with two components: the log pseudo marginal likelihood (LPML) and the conditional predictive ordinates (CPO).}
\item{WAIC}{If computed, a list with two components: widely applicable information criterion (WAIC) and associated complexity penalty (pW).}
\item{DIC}{If computed, a list with two components: deviance information criterion (DIC) and associated complexity penalty (pD).}
}
\references{
Celeux, G., Forbes, F., Robert C. P., and Titerrington, D. M. (2006). Deviance information criteria for missing data models. Bayesian Analysis, \bold{1}, 651-674.

Eilers, P.H.C. and Marx, B.D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, \bold{11}(2), 89-121.

Eilers, P.H.C. and Marx, B.D. (2003). Multidimensional calibration with temperature interaction using two- dimensional penalized signal regression. Chemometrics and Intelligence Laboratory Systems, \bold{66}, 159-174.

De Iorio, M., Muller, P., Rosner, G. L., and MacEachern, S. N. (2004). An anova model for dependent random measures. Journal of the American Statistical Association, \bold{99}(465), 205-215

De Iorio, M., Johnson, W. O., Muller, P., and Rosner, G. L. (2009). Bayesian nonparametric nonproportional hazards survival modeling. Biometrics, \bold{65}, 762--775.

Geisser, S. and Eddy, W.F. (1979) A Predictive Approach to Model Selection, Journal of the American Statistical Association, \bold{74}, 153-160.

Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., and Rubin, D.B. (2014). Bayesian Data Analysis, 3rd ed. CRC Press: Boca Raton, FL.

Gelman, A., Hwang, J., and Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing, \bold{24}, 997-1010.

Lang, S. and Brezger, A. (2004). Bayesian P-splines. Journal of Computational and Graphical Statistics, \bold{13}(1), 183-212.

Lee, D.-J., Durban, M., and Eilers, P. (2013). Efficient two-dimensional smoothing with P-
spline ANOVA mixed models and nested bases. Computational Statistics & Data Analysis, \bold{61}, 22-37.

Rodriguez-Alvarez, M. X, Inacio, V. and Klein N. (2025). Density regression via Dirichlet process mixtures of normal structured additive regression models. Accepted for publication in Statistics and Computing (DOI: 10.1007/s11222-025-10567-0).

Speigelhalter, D. J., Best, N. G., Carlin, B. P., and van der Linde, A. (2002). Bayesian measures of model comparison and fit. Journal of the Royal Statistical Society, Ser. B, \bold{64}, 583-639.

Watanabe, S. (2010). Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory. Journal of Machine Learning Research, \bold{11}, 3571-3594.
}
\note{
The input argument \code{formula} is similar to that used for the \code{\link{glm}} function, except that flexible specifications can be added by means of function \code{\link{f}}. For instance, specification \code{y ~ x1 + f(x2)} would assume a linear effect of \code{x1} (if \code{x1} is continuous) and the effect of \code{x2} (assumed continuous)) would be modeled using Bayesian P-splines. Categorical variables (factors) can be also incorporated, as well as random effects (see \code{\link{rae}}) and interaction terms. For example, to include the factor-by-curve interaction between \code{x4} (continuous covariate) and \code{x3} (categorical covariate) we need to specify \code{y ~ x3 + f(x4, by = x3)}.}

\seealso{
\code{\link{f}}, \code{\link{rae}}, \code{\link{predict.DDPstar}}
}
\examples{
library(DDPstar)
data(dde)
dde$GAD <- dde$GAD/7 # GAD in weeks
\donttest{
formula.dde <- GAD ~ f(DDE, bdeg = 3, nseg = 20, pord = 2, atau = 1, btau = 0.005)
mcmc <- mcmccontrol(nburn = 20000, nsave = 15000, nskip = 1)
prior <- priorcontrol(a = 2, b = 0.5, aalpha = 2, balpha = 2, L = 20)

set.seed(10) # For reproducibility
fit_dde <- DDPstar(formula = formula.dde, 
  data = dde, mcmc = mcmc, prior = prior, 
  standardise = TRUE, compute.lpml = TRUE, compute.WAIC = TRUE, compute.DIC = TRUE)
summary(fit_dde)
}
\dontshow{
mcmc <- list(nburn = 20, nsave = 15, nskip = 1)

set.seed(10) # For reproducibility
fit_dde <- DDPstar(formula = GAD ~ f(DDE, bdeg = 3, nseg = 20, pord = 2, atau = 1, btau = 0.005), 
data = dde, mcmc = mcmc, prior = list(a = 2, b = 0.5, aalpha = 2, balpha = 2, L = 20), 
standardise = TRUE, compute.lpml = TRUE, compute.WAIC = TRUE)
summary(fit_dde)
}
}