---
title: "Advanced Features: EWMA, Quality Control, and Cohort Analysis"
author: "Athlytics Package"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Advanced Features: EWMA, Quality Control, and Cohort Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 10,
  fig.height = 7,
  out.width = "100%",
  dpi = 150
)
```

# Introduction

This vignette demonstrates the advanced features of Athlytics for sports physiology analysis:

1. **Data Quality Control**: Detecting and flagging HR/power spikes, GPS drift
2. **ACWR with EWMA**: Exponentially weighted moving average with confidence bands
3. **Cohort Reference Analysis**: Multi-athlete percentile comparisons

These features make Athlytics suitable for both individual training analysis and multi-athlete research studies.

---

# 1. Data Quality Control

Before calculating physiological metrics, it's crucial to ensure data quality. The `flag_quality()` function detects common issues in activity stream data.

## Basic Quality Checking

```{r eval=FALSE}
library(Athlytics)
library(dplyr)

# Parse stream data from a FIT/TCX/GPX file
stream_data <- parse_activity_file("path/to/activity.fit")

# Flag quality issues
flagged_data <- flag_quality(
  stream_data,
  sport = "Run",
  hr_range = c(30, 220), # Valid HR range
  pw_range = c(0, 1500), # Valid power range
  max_run_speed = 7.0, # Max speed (m/s) ≈ 2:23/km pace
  max_hr_jump = 10, # Max HR change per second
  max_pw_jump = 300, # Max power change per second
  min_steady_minutes = 20, # Min duration for steady-state
  steady_cv_threshold = 0.08 # CV fraction (0-1) threshold for steady-state
)

# View quality summary
summary_stats <- summarize_quality(flagged_data)
print(summary_stats)
```

**Output interpretation:**
- `quality_score`: Overall quality (0-1), where 1 = perfect
- `flagged_pct`: Percentage of data points with issues
- `steady_state_pct`: Percentage suitable for EF/decoupling calculations

## Why Quality Control Matters

**Without quality control:**
- HR spikes from sensor errors inflate HRSS calculations
- GPS drift creates artificially high speeds, affecting pace metrics
- Non-steady-state segments contaminate Efficiency Factor (EF) calculations

**With quality control:**
- Only clean data contributes to metrics
- Steady-state detection ensures valid EF/decoupling calculations
- Researchers can report data quality in publications

---

# 2. ACWR with EWMA Method

Traditional ACWR uses rolling averages (RA). The EWMA method, introduced by Williams et al. (2017), offers:
- **Greater sensitivity** to recent training changes
- **Customizable decay rates** via half-life parameters
- **Confidence bands** via bootstrap to quantify uncertainty

> **Reference:** Williams, S., et al. (2017). Better way to determine the acute:chronic workload ratio? *British Journal of Sports Medicine*, 51(3), 209-210. [DOI: 10.1136/bjsports-2016-096589](https://doi.org/10.1136/bjsports-2016-096589)

## Comparing RA vs EWMA

```{r eval=FALSE}
# Load activities
activities <- load_local_activities("export_12345678.zip")

# Calculate ACWR using Rolling Average (traditional)
acwr_ra <- calculate_acwr_ewma(
  activities,
  activity_type = "Run",
  method = "ra",
  acute_period = 7,
  chronic_period = 28,
  load_metric = "duration_mins"
)

# Calculate ACWR using EWMA
acwr_ewma <- calculate_acwr_ewma(
  activities,
  activity_type = "Run",
  method = "ewma",
  half_life_acute = 3.5, # Acute load half-life (days)
  half_life_chronic = 14, # Chronic load half-life (days)
  load_metric = "duration_mins"
)

# Compare the two methods
library(ggplot2)
ggplot() +
  geom_line(
    data = acwr_ra, aes(x = date, y = acwr_smooth),
    color = "blue", linewidth = 1, linetype = "solid"
  ) +
  geom_line(
    data = acwr_ewma, aes(x = date, y = acwr_smooth),
    color = "red", linewidth = 1, linetype = "dashed"
  ) +
  labs(
    title = "ACWR: Rolling Average vs EWMA",
    subtitle = "Blue = RA (7:28d) | Red = EWMA (3.5:14d half-life)",
    x = "Date", y = "ACWR"
  ) +
  theme_minimal()
```

**Key differences:**
- **RA**: Equal weight to all days in window; sudden jumps when old workouts exit window
- **EWMA**: Exponential decay; smooth transitions; more responsive to recent changes

## Adding Confidence Bands

Confidence bands quantify uncertainty in ACWR estimates, essential for research reporting.

```{r eval=FALSE}
# Calculate EWMA with 95% confidence bands
acwr_ewma_ci <- calculate_acwr_ewma(
  activities,
  activity_type = "Run",
  method = "ewma",
  half_life_acute = 3.5,
  half_life_chronic = 14,
  load_metric = "duration_mins",
  ci = TRUE, # Enable confidence intervals
  B = 200, # Bootstrap iterations
  block_len = 7, # Block length (days) for bootstrap
  conf_level = 0.95 # 95% CI
)

# Plot with confidence bands
ggplot(acwr_ewma_ci, aes(x = date)) +
  geom_ribbon(aes(ymin = acwr_lower, ymax = acwr_upper),
    fill = "gray70", alpha = 0.5
  ) +
  geom_line(aes(y = acwr_smooth), color = "black", linewidth = 1) +
  geom_hline(yintercept = c(0.8, 1.3), linetype = "dotted", color = "green") +
  geom_hline(yintercept = 1.5, linetype = "dotted", color = "red") +
  labs(
    title = "ACWR with 95% Confidence Bands",
    subtitle = "Gray band = bootstrap confidence interval",
    x = "Date", y = "ACWR"
  ) +
  theme_minimal()
```

**Interpretation:**
- **Gray band**: 95% confidence interval from moving-block bootstrap
- **Wider bands** = more uncertainty (e.g., during high load variability)
- **Narrower bands** = more stable estimates

### Technical Note: Bootstrap Method

The confidence bands use **moving-block bootstrap**:
1. Daily load sequence is resampled in overlapping weekly blocks (preserves within-week correlation)
2. ACWR is recalculated for each bootstrap sample (B = 200 iterations)
3. 2.5th and 97.5th percentiles form the 95% confidence band

This approach is computationally efficient and doesn't require distributional assumptions.

---

# 3. Cohort Reference Analysis

When analyzing multiple athletes (teams, research cohorts), it's valuable to compare individuals to group norms.

## Multi-Athlete Setup

```{r eval=FALSE}
# Load data for multiple athletes
athlete1 <- load_local_activities("athlete1_export.zip") |>
  mutate(athlete_id = "athlete1", sex = "M", age_band = "25-35")

athlete2 <- load_local_activities("athlete2_export.zip") |>
  mutate(athlete_id = "athlete2", sex = "F", age_band = "25-35")

athlete3 <- load_local_activities("athlete3_export.zip") |>
  mutate(athlete_id = "athlete3", sex = "M", age_band = "35-45")

# Combine
cohort_data <- bind_rows(athlete1, athlete2, athlete3)

# Calculate ACWR for each athlete
cohort_acwr <- cohort_data |>
  group_by(athlete_id) |>
  group_modify(~ calculate_acwr_ewma(.x, activity_type = "Run", method = "ewma")) |>
  ungroup() |>
  mutate(type = "Run") |>
  left_join(
    cohort_data |> select(athlete_id, sex, age_band) |> distinct(),
    by = "athlete_id"
  )
```

## Calculate Reference Percentiles

```{r eval=FALSE}
# Calculate cohort reference by activity type and sex
reference <- calculate_cohort_reference(
  cohort_acwr,
  metric = "acwr_smooth",
  by = c("type", "sex"), # Group by activity type and sex
  probs = c(0.05, 0.25, 0.5, 0.75, 0.95), # Percentiles
  min_athletes = 3 # Minimum athletes per group
)

# View reference structure
head(reference)
```

**Output:**
```
# A tibble: 6 × 5
  date       type  sex   percentile value n_athletes
  <date>     <chr> <chr> <chr>      <dbl>      <int>
1 2024-01-01 Run   M     p05        0.65          5
2 2024-01-01 Run   M     p25        0.82          5
3 2024-01-01 Run   M     p50        0.98          5
4 2024-01-01 Run   M     p75        1.15          5
5 2024-01-01 Run   M     p95        1.38          5
```

## Plot Individual vs Cohort

```{r eval=FALSE}
# Extract one athlete's data
individual <- cohort_acwr |>
  filter(athlete_id == "athlete1")

# Plot with cohort reference
p <- plot_with_reference(
  individual = individual,
  reference = reference |> filter(sex == "M"), # Match athlete's group
  metric = "acwr_smooth",
  bands = c("p25_p75", "p05_p95", "p50")
)

print(p)
```

**Interpretation:**
- **Black line**: Individual athlete's ACWR trend
- **Dark shaded area (P25-P75)**: "Normal" range (50% of cohort)
- **Light shaded area (P5-P95)**: Extended range (90% of cohort)
- **Dashed line (P50)**: Median (cohort center)

**Use cases:**
- **Outlier detection**: Athletes consistently above P95 or below P5
- **Load matching**: Ensure similar load profiles when comparing injury rates
- **Benchmarking**: Compare individual's trajectory to team norms

---

# 4. Integrated Workflow Example

Here's a complete research workflow combining all features:

```{r eval=FALSE}
library(Athlytics)
library(dplyr)
library(ggplot2)

# --- Step 1: Load multi-athlete cohort ---
cohort_files <- c("athlete1_export.zip", "athlete2_export.zip", "athlete3_export.zip")
cohort_data <- lapply(cohort_files, function(file) {
  load_local_activities(file) |>
    mutate(athlete_id = tools::file_path_sans_ext(basename(file)))
}) |> bind_rows()

# --- Step 2: Quality control (conceptual - for stream data) ---
# If you have stream data, flag quality before calculating metrics
# stream <- parse_activity_file("path/to/activity.fit")
# flagged <- flag_quality(stream, sport = "Run")
# summary <- summarize_quality(flagged)

# --- Step 3: Calculate ACWR with EWMA + CI ---
cohort_acwr <- cohort_data |>
  group_by(athlete_id) |>
  group_modify(~ calculate_acwr_ewma(
    .x,
    activity_type = "Run",
    method = "ewma",
    half_life_acute = 3.5,
    half_life_chronic = 14,
    load_metric = "duration_mins",
    ci = TRUE,
    B = 100 # Reduced for speed in example
  )) |>
  ungroup() |>
  mutate(type = "Run")

# --- Step 4: Calculate cohort reference ---
reference <- calculate_cohort_reference(
  cohort_acwr,
  metric = "acwr_smooth",
  by = c("type"),
  probs = c(0.05, 0.25, 0.5, 0.75, 0.95),
  min_athletes = 3
)

# --- Step 5: Visualize individual with confidence + cohort ---
individual <- cohort_acwr |> filter(athlete_id == "athlete1")

p <- plot_with_reference(individual, reference, metric = "acwr_smooth") +
  # Add individual's confidence bands
  geom_ribbon(
    data = individual,
    aes(x = date, ymin = acwr_lower, ymax = acwr_upper),
    fill = "blue", alpha = 0.2
  )

print(p)

# --- Step 6: Export for statistical analysis ---
write.csv(cohort_acwr, "cohort_acwr_with_ci.csv", row.names = FALSE)
write.csv(reference, "cohort_reference.csv", row.names = FALSE)
```

---

# 5. Methodological Notes for Research

## ACWR Method Selection

| Method | Pros | Cons | Recommended Use |
|--------|------|------|----------------|
| **RA (Rolling Average)** | Stable, well-established, simple interpretation | Sudden jumps, equal weight to all days | **Descriptive studies**, replicating prior work |
| **EWMA** | Smooth, responsive, customizable decay | More parameters, less established norms | **Monitoring**, sensitive change detection |

**Default parameters:**
- RA: 7-day acute, 28-day chronic (most common in literature)
- EWMA: 3.5-day acute half-life, 14-day chronic half-life (equivalent decay)

## Confidence Bands Limitations

- **Not causal inference**: Bands reflect sampling variability, not prediction of injury
- **Assumes stationarity**: If training pattern changes fundamentally, past data may not be representative
- **Block length matters**: Default 7-day blocks preserve weekly structure; adjust if training cycle differs

## Cohort Reference Interpretation

⚠️ **Critical distinction**: Percentile bands are **descriptive population variability**, NOT **confidence intervals** for an individual's "true" value.

**Correct interpretation:**
- "Athlete was above the 75th percentile of the cohort" ✅
- "Athlete's ACWR was significantly higher than cohort" ❌ (requires statistical test)

**Minimum sample sizes:**
- P25-P75: n ≥ 5 athletes
- P5-P95: n ≥ 20 athletes (for stable extremes)

---

# 6. FAQ for Advanced Features

**Q: How do I choose EWMA half-life?**  
A: Start with defaults (3.5:14d ≈ 7:28d RA). For more responsive monitoring, reduce both proportionally (e.g., 2.5:10d). For smoother trends, increase (e.g., 5:20d).

**Q: Bootstrap is slow. Can I reduce B?**  
A: Yes. B = 50–100 is often sufficient for visual inspection. For publication, use B = 200–500.

**Q: What if my cohort has < 5 athletes?**  
A: Set `min_athletes = 3`, but report this limitation. Percentiles with n < 5 are unstable.

**Q: Can I use these methods for injury prediction?**  
A: **No**. These are **descriptive monitoring tools**. Injury prediction requires proper case-control studies, controlling for confounders, and validation. See: Carey, D. L., et al. (2018). Predictive modelling of training loads and injury in Australian football. *International Journal of Computer Science in Sport*, 17(1), 49-66. [DOI: 10.2478/ijcss-2018-0002](https://doi.org/10.2478/ijcss-2018-0002); Impellizzeri, F. M., et al. (2021). What role do chronic workloads play in the acute to chronic workload ratio? Time to dismiss ACWR and its underlying theory. *Sports Medicine*, 51(3), 581-592. [DOI: 10.1007/s40279-020-01378-6](https://doi.org/10.1007/s40279-020-01378-6)

**Q: How do I cite Athlytics?**  
A: Use the standard software citation format with version number and repository URL. See the main tutorial for the complete citation.

---


# Session Info

```{r}
sessionInfo()
```