dataProfilerR

Automated exploratory data analysis for R. Point it at a data frame and it returns a structured profile — column types, missingness, distributional statistics, normality tests, outliers, correlations, a data-quality score, and ggplot2 figures — through a single function, profile_data().

The aim is to cover the first hour of EDA that you’d otherwise write by hand for every new dataset, while keeping the result a plain, inspectable object you can build on.

Installation

# install.packages("remotes")
remotes::install_github("mqfarooqi1/dataProfilerR")

Depends on ggplot2. The Anderson–Darling normality test additionally uses the suggested nortest package; if it isn’t installed, only Shapiro–Wilk is run.

Quick start

library(dataProfilerR)

p <- profile_data(iris)
p                                  # concise overview + quality score
summary(p)                         # numeric summary, missingness, normality, outliers, correlations
plot(p, which = "correlation")     # retrieve a figure
plot(p, which = "distribution", column = "Sepal.Length")

# components are just list elements
p$metadata$column_types
p$diagnostics$quality$score
p$statistics$numeric

# grouped comparison + a self-contained HTML report (needs pandoc)
p <- profile_data(iris, group_by = "Species")
p$diagnostics$groups$numeric_by_group
report(p, "iris_report.html")

See the vignette (vignette("dataProfilerR")) for a full walkthrough on a messy dataset.

Architecture and design decisions

The package is organised as a pipeline of independent, individually-callable functions, with one orchestrator on top:

                       profile_data()                 <- orchestrator
        ┌───────────────────┼───────────────────────────────┐
   profiling            statistics                      visualization
   ─────────            ──────────                       ────────────
   infer_column_types   normality_tests                 plot_missing
   analyze_missing      detect_outliers / outlier_summary plot_distribution
   summarize_columns    correlation_analysis            plot_correlation
   data_quality_score                                   plot_boxplots
                                                        plot_pairs
                          │
                          ▼
                  data_profile (S3 object)  ──  print() / summary() / plot()

Design choices worth calling out:

Function reference

Profiling

Function Purpose
infer_column_types(df) Classify each column; character columns split into categorical vs text.
analyze_missing(df) Per-column and overall missingness; complete-row count.
summarize_columns(df) Numeric summary (mean, sd, variance, quartiles, IQR, skewness, kurtosis) and categorical cardinality / top level.
data_quality_score(df) 0–100 score and letter grade from completeness, row uniqueness, column variability, and (optionally) outlier rate.

Statistics

Function Purpose
normality_tests(df) Shapiro–Wilk (and Anderson–Darling if nortest is present) per numeric column; large columns subsampled to 5000.
detect_outliers(x, method) "iqr", "zscore", or "robust" (median/MAD) on a vector.
outlier_summary(df, method) Per-column outlier counts and an overall rate.
correlation_analysis(df, method) Pearson and/or Spearman matrices over numeric columns.
categorical_association(df) Cramer’s V matrix between categorical columns.
analyze_dates(df) Range, unique count, and largest gap for date/datetime columns.
compare_groups(df, group) Numeric summaries within the levels of a grouping column.
skewness(x), kurtosis(x) Moment-based, exported for direct use.

Visualization (ggplot2)

Function Purpose
plot_missing(df) Missing-value heatmap (rows subsampled when large).
plot_distribution(df, column) Histogram + density (numeric) or bar chart (categorical).
plot_correlation(df, method) Annotated correlation heatmap.
plot_association(df) Cramer’s V heatmap for categorical columns.
plot_boxplots(df) Faceted boxplots for the numeric columns.
plot_pairs(df, columns) Scatterplot matrix for selected numeric columns.

Pipeline, reporting & object

Function Purpose
profile_data(df, ...) Run everything; return a data_profile. Options include group_by and distributions.
report(x, file) Render the profile to a self-contained HTML file (needs pandoc).
print / summary / plot methods Overview / detail / figures (plot() adds which = "association").
is_data_profile(x) Class predicate.

The data_profile object

profile_data() returns an S3 list with four parts plus the call:

Folder structure

dataProfilerR/
├── DESCRIPTION
├── NAMESPACE                # generated by roxygen2
├── LICENSE
├── NEWS.md
├── R/
│   ├── dataProfilerR-package.R
│   ├── utils.R              # validation + skewness/kurtosis
│   ├── profiling.R          # types, missingness, summaries, quality score
│   ├── statistics.R         # normality, outliers, correlation
│   ├── association.R        # Cramer's V for categoricals
│   ├── dates.R              # date/datetime profiling
│   ├── groups.R             # grouped comparison
│   ├── visualization.R      # ggplot2 functions
│   ├── report.R             # HTML report (rmarkdown)
│   ├── profile_data.R       # orchestrator + S3 constructor
│   └── methods.R            # print / summary / plot
├── man/                     # generated by roxygen2
├── tests/testthat/          # unit + edge-case tests
└── vignettes/dataProfilerR.Rmd

Testing

testthat (edition 3) covers each function plus edge cases — empty frames, wrong types, all-NA columns, single-column frames, missing-column plot requests, and output-shape consistency. Run with devtools::test().

Limitations and future improvements

Added in 0.2.0: report() (HTML), categorical_association() (Cramer’s V), analyze_dates(), compare_groups(), and a distributions = FALSE switch to avoid eager per-column plots on wide data. See NEWS.md.

Still open / honest gaps:

License

MIT © Muhammad Farooqi