integrityIncreasing concerns about the trustworthiness of research have prompted calls to scrutinise studies’ Individual Participant Data (IPD), but guidance on how to do this was lacking. integrity has been developed to screen randomised controlled trials (RCTs) for integrity issues. The software guides decision-making by determining whether a trial has no concerns, some concerns requiring further information, or major concerns warranting exclusion from evidence synthesis or publication.
Since the functionality is implemented in R, please import the data set into R. There are are variety of functions in R or cRAN packages to do this.
read.csv and read.table functions to
import comma-separated and tab-separated text files.read.sas for SAS, read.sav for SPSS and
read_dta for STATA in the CRAN package haven.read_excel function for Microsoft Excel in the CRAN
package readxl.An accompanying YAML file also needs to be written to describe the expected characteristics of each column.
The top-level elements are required to be named:
participantID: The column name of the column which
corresponds to the unique participant identifier. Mandatory.enrollemnt: A list with three mandatory lists named
start, randomisation and end
specifying the column names of the date of enrollment, date or
randomisation and date of the end of participation.baseline: Lists named dichotomous,
polytomous, numeric are for specifying the
column name(s) of the column(s) which correspond(s) to baseline
measurements.intervention: A column name of the column specifying
the intervention applied to the individuals.outcome: A list of length up two, named
common and rare, with sublists named by
dichotomous or polytomous, containing the
names of columns of those data types.correlated: A named list of two entries of column names
that are expected to be correlated.unexpected: A named list of column names with values
that are not expected to be seen. days is a special sublist
of this list and applies to date columns, which are converted into days
of the week before comparison. It must have two elements:
names, which are the unexpected day names, and
locale, which is the locale of the unexpected day names
specified.participantID, enrollment,
baseline, intervention and
outcome are mandatory. Others only need to be specified if
there is a column to annotate.
View the YAML file corresponding to this dataset at C:/Users/dstr7320/AppData/Local/Temp/RtmpUdLtCJ/Rinst41842e0b1ab8/integrity/extdata/variables.yaml for an example of the expected contents and structure.
The checks are categorised into several domains.
Item 1: Repeating patterns across baseline variables. Item 2: Repeating patterns within baseline variables. Item 3: Repeating patterns across baseline variables for rare outcome. Item 4: Bias in terminal digits.
Item 5: Excessively homogeneous distribution of binary baseline variables. Item 6: Excessive imbalances of continuous baseline variables between groups. Item 7: Excessive imbalances of categorical baseline variables between groups. Item 8: Differential variability of numerical baseline characteristics between groups.
Item 9: Expected correlations between variables (e.g. height and weight).
Item 10: Randomisation dates outside of the study period.
Item 11: Deviation from randomness of allocation of participants to treatments over time. Item 12: Deviation from randomness of allocation on days of the week.
Item 13: Impossible or implausible values, e.g. Age at Menarche for a male participant.
Item 14: Discrepancies between summary statistics calculated from data set and those presented in the corresponding journal article.
Item 15: Too few missing data values or missing data overly similar between treatment groups. Item 16: Implausible event rates based on expert knowledge.
Based on the YAML file, only checks that are relevant to the data set will be executed.
The data set bundled with this package is an extract from the iCOMP study. The main goal was to determine the optimal umbilical cord management strategy at preterm birth, such as milking or delayed cord clamping.
The data is in a Microsoft Excel file. There is one sheet.
library(readxl)
examplePath <- system.file("extdata", "dataset.xlsx", package = "integrity")
dataset <- read_excel(examplePath)
dataset[1:5, ]
## # A tibble: 5 × 18
## infant_ID rand_date mat_age blood_loss treatment_cat GA_weeks
## <dbl> <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 1 2019-03-21 00:00:00 36 200 2 30
## 2 2 2020-07-17 00:00:00 18 200 1 28
## 3 3 2019-06-14 00:00:00 20 300 1 32
## 4 4 2019-10-08 00:00:00 30 500 2 29
## 5 5 2019-03-02 00:00:00 34 400 1 32
## # ℹ 12 more variables: birthweight <dbl>, sex <dbl>, hospital_days <dbl>,
## # temp <dbl>, inf_transfusion_any <dbl>, Hct <dbl>, CLD <dbl>, IVH <dbl>,
## # NEC <dbl>, inf_death <dbl>, enrol_start <dttm>, enrol_end <dttm>
The sample identifiers can be seen, as well as the first few clinical covariates. At this stage, categorical variables which only have one distinct value should be removed. This data has no such variables.
The variable types and expectations need to be defined. The metadata representation language YAML is used for this purpose.
library(yaml)
example_path <- system.file("extdata", "variables.yaml", package = "integrity")
dataset_info <- read_yaml(example_path)
On your computer, the file is located at C:/Users/dstr7320/AppData/Local/Temp/RtmpUdLtCJ/Rinst41842e0b1ab8/integrity/extdata/dataset.xlsx.
Simply provide the data frame and data information to
run_checks. The first step which automatically happens is
data checking and cleaning, which ensures that all variables defined in
the YAML file are present in the dataset, converts any variables
annotated as factors but not factors into factors, and removes any
columns that are entirely missing values.
library(integrity)
result <- run_checks(dataset, dataset_info)
## Repeating pattern within each baseline algorithm in development
## No duplicate combinations found of: sex, mat_age, GA_weeks, birthweight
names(result)
## [1] "check_table" "images" "summary_table"
This creates a list of three result types.
Firstly, there is a check table with Pass or Fail statuses based on appropriate statistical tests.
head(result[["check_table"]])
## Domain Item Status
## 1 Unusual or Repeated Patterns Repeated Baselines Fail
## 3 Unusual or Repeated Patterns Consecutive Baseline Binary Fail
## 7 Correlations Unexpectedly Uncorrelated Fail
## 10 Date Violations Implausible Randomisation Date Fail
## 11 Internal Inconsistency Implausible Day Fail
## 12 Internal Inconsistency Implausible Day Fail
## Details
## 1 sex:1, mat_age:30, GA_weeks:33, birthweight:1568 occurs 2 times.
## 3 Variable sex has statistically significant runs of values using χ² test.
## 7 GA_weeks, birthweight
## 10 Participants 38, 49
## 11 All participants start on Saturday
## 12 5 randomisation on Saturday
There are some interesting issues which may be examined further. Next is a list of four images. Here, the unexpected lacks of correlation between gestational age and birthweight is shown.
names(result[["images"]])
## [1] "Terminal Digits" "timeAndSize" "Cumulative Allocation"
## [4] "Days"
result[["images"]][["timeAndSize"]]
Finally, there is list of clinical summary tables; one for the measurements and one for the missingness.
result[["summary_table"]]
| Characteristic | 1 N = 501 |
2 N = 701 |
|---|---|---|
| infant_ID | 66 (35) | 57 (34) |
| rand_date | 2019-12-07 06:43:12 (14840940.1535606) | 2019-10-29 01:42:51.428571 (13740220.8211391) |
| mat_age | 29 (7) | 30 (7) |
| blood_loss | 298 (169) | 266 (171) |
| GA_weeks | ||
| 28 | 3 (6.0%) | 6 (8.6%) |
| 29 | 3 (6.0%) | 3 (4.3%) |
| 30 | 3 (6.0%) | 10 (14%) |
| 31 | 6 (12%) | 10 (14%) |
| 32 | 19 (38%) | 13 (19%) |
| 33 | 16 (32%) | 28 (40%) |
| birthweight | 1,835 (421) | 1,757 (361) |
| sex | ||
| 1 | 27 (54%) | 35 (50%) |
| 2 | 23 (46%) | 35 (50%) |
| hospital_days | 30 (20) | 36 (24) |
| temp | 36.72 (0.65) | 36.84 (0.58) |
| inf_transfusion_any | 9 (18%) | 14 (20%) |
| Hct | 53.7 (5.8) | 54.0 (5.7) |
| CLD | ||
| 0 | 40 (80%) | 45 (64%) |
| 1 | 1 (2.0%) | 16 (23%) |
| 2 | 9 (18%) | 7 (10%) |
| 3 | 0 (0%) | 2 (2.9%) |
| IVH | ||
| 0 | 35 (73%) | 52 (74%) |
| 1 | 13 (27%) | 18 (26%) |
| Unknown | 2 | 0 |
| NEC | ||
| 0 | 46 (92%) | 67 (96%) |
| 1 | 4 (8.0%) | 3 (4.3%) |
| inf_death | ||
| 0 | 49 (98%) | 65 (94%) |
| 1 | 1 (2.0%) | 4 (5.8%) |
| Unknown | 0 | 1 |
| enrol_start | ||
| 2019-03-02 | 50 (100%) | 70 (100%) |
| enrol_end | ||
| 2020-08-09 | 50 (100%) | 70 (100%) |
| 1 Mean (SD); n (%) | ||
This vignette was executed on the following computing system:
sessionInfo()
## R Under development (unstable) (2026-02-04 r89376 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=C LC_CTYPE=English_Australia.utf8
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Australia.utf8
##
## time zone: Australia/Sydney
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] integrity_1.0 yaml_2.3.12 readxl_1.4.5
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.57 bslib_0.10.0 ggplot2_4.0.2
## [5] rstatix_0.7.3 lattice_0.22-9 vctrs_0.7.1 tools_4.6.0
## [9] generics_0.1.4 tibble_3.3.1 pkgconfig_2.0.3 Matrix_1.7-4
## [13] RColorBrewer_1.1-3 S7_0.2.1 gt_1.3.0 lifecycle_1.0.5
## [17] compiler_4.6.0 farver_2.1.2 stringr_1.6.0 janitor_2.2.1
## [21] carData_3.0-6 snakecase_0.11.1 litedown_0.9 htmltools_0.5.9
## [25] sass_0.4.10 Formula_1.2-5 pillar_1.11.1 car_3.1-5
## [29] ggpubr_0.6.3 jquerylib_0.1.4 tidyr_1.3.2 cachem_1.1.0
## [33] abind_1.4-8 nlme_3.1-168 commonmark_2.0.0 tidyselect_1.2.1
## [37] digest_0.6.39 stringi_1.8.7 gtsummary_2.5.0 dplyr_1.2.0
## [41] purrr_1.2.1 labeling_0.4.3 splines_4.6.0 fastmap_1.2.0
## [45] grid_4.6.0 cli_3.6.5 magrittr_2.0.4 cards_0.7.1
## [49] broom_1.0.12 withr_3.0.2 scales_1.4.0 backports_1.5.0
## [53] cardx_0.3.2 lubridate_1.9.5 timechange_0.4.0 rmarkdown_2.30
## [57] otel_0.2.0 ggsignif_0.6.4 cellranger_1.1.0 evaluate_1.0.5
## [61] knitr_1.51 markdown_2.0 mgcv_1.9-4 rlang_1.1.7
## [65] glue_1.8.0 xml2_1.5.2 rstudioapi_0.18.0 jsonlite_2.0.0
## [69] R6_2.6.1 fs_1.6.7