% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/diagnose.R
\name{diagnose_category}
\alias{diagnose_category}
\alias{diagnose_category.data.frame}
\title{Diagnose data quality of categorical variables}
\usage{
diagnose_category(.data, ...)

\method{diagnose_category}{data.frame}(.data, ..., top = 10,
  add_character = TRUE)
}
\arguments{
\item{.data}{a data.frame or a \code{\link{tbl_df}}.}

\item{...}{one or more unquoted expressions separated by commas.
You can treat variable names like they are positions.
Positive values select variables; negative values to drop variables.
If the first expression is negative, diagnose_category() will automatically
start with all variables.
These arguments are automatically quoted and evaluated in a context where
column names represent column positions.
They support unquoting and splicing.}

\item{top}{an integer. Specifies the upper top rank to extract.
Default is 10.}

\item{add_character}{logical. Decide whether to include text variables in the
diagnosis of categorical data. The default value is TRUE, which also includes character variables.}
}
\value{
an object of tbl_df.
}
\description{
The diagnose_category() produces information for
diagnosing the quality of the variables of data.frame or tbl_df.
}
\details{
The scope of the diagnosis is the occupancy status of the levels
in categorical data. If a certain level of occupancy is close to 100%,
then the removal of this variable in the forecast model will have to be
considered. Also, if the occupancy of all levels is close to 0%, this
variable is likely to be an identifier.
}
\section{Categorical diagnostic information}{

The information derived from the categorical data diagnosis is as follows.

\itemize{
\item variables : variable names
\item levels: level names
\item N : number of observation
\item freq : number of observation at the levles
\item ratio : percentage of observation at the levles
\item rank : rank of occupancy ratio of levels
}

See vignette("diagonosis") for an introduction to these concepts.
}

\examples{
# Generate data for the example
carseats <- ISLR::Carseats
carseats[sample(seq(NROW(carseats)), 20), "Income"] <- NA
carseats[sample(seq(NROW(carseats)), 5), "Urban"] <- NA

# Diagnosis of categorical variables
diagnose_category(carseats)

# Select the variable to diagnose
diagnose_category(carseats, ShelveLoc, Urban)
diagnose_category(carseats, -ShelveLoc, -Urban)
diagnose_category(carseats, "ShelveLoc", "Urban")
diagnose_category(carseats, 7)

# Using pipes ---------------------------------
library(dplyr)

# Diagnosis of all categorical variables
carseats \%>\%
  diagnose_category()
# Positive values select variables
carseats \%>\%
  diagnose_category(Urban, US)
# Negative values to drop variables
carseats \%>\%
  diagnose_category(-Urban, -US)
# Positions values select variables
carseats \%>\%
  diagnose_category(7)
# Positions values select variables
carseats \%>\%
  diagnose_category(-7)
# Top rank levels with top argument
carseats \%>\%
  diagnose_category(top = 2)

# Using pipes & dplyr -------------------------
# Extraction of level that is more than 60\% of categorical data
carseats \%>\%
  diagnose_category()  \%>\%
  filter(ratio >= 60)
}
\seealso{
\code{\link{diagnose_category.tbl_dbi}}, \code{\link{diagnose.data.frame}}, \code{\link{diagnose_numeric.data.frame}}, \code{\link{diagnose_outlier.data.frame}}.
}
