---
title: "Comparing topics over time"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Comparing topics over time}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
# The plotting chunks below need ggplot2 (a suggested package); skip them
# gracefully when it is not installed.
has_ggplot2 <- requireNamespace("ggplot2", quietly = TRUE)
```

```{r setup}
library(scopusflow)
```

A common bibliometric question is not how large a literature is, but how its
internal emphasis shifts over time. Within deep-learning research, say, is the
share of work that also concerns medical imaging growing faster than the share
about computer vision? `scopus_compare_topics()` answers exactly this, and
`plot_scopus_comparison()` shows the answer. The comparison itself contacts the
API, so it is shown but not run; the plotting is reproduced offline from an object
of the same shape.

## What the comparison measures

For each year and each comparison term, the function counts the records matching
the reference topic *and* that term, and expresses it as a percentage of the
records matching the reference *alone*. A value of 30% for "computer vision" in
2020 means that 30% of the deep-learning records that year also mention computer
vision. The reference is the denominator, so it sits at 100% by construction and
is not drawn.

```{r eval = FALSE}
cmp <- scopus_compare_topics(
  reference_query  = "deep learning",
  comparison_terms = c("computer vision", "natural language processing",
                       "medical imaging", "drug discovery"),
  years            = 2013:2021,
  field            = "TITLE-ABS-KEY"
)
```

## The shape of the result

The result is a tidy table with one row per topic and year. We build one here
with the same columns so the rest of the article runs without a key. The
reference set grows over the period, which the uncertainty band will reflect.

```{r}
years <- 2013:2021
ref_n <- round(seq(400, 1600, length.out = length(years)))
mk <- function(from, to) round(seq(from, to, length.out = length(years)))
counts <- list(
  "computer vision" = mk(140, 720),
  "natural language processing" = mk(90, 540),
  "medical imaging" = mk(15, 260),
  "drug discovery" = mk(8, 170)
)
cmp <- tibble::tibble(
  query = "q",
  query_type = c(rep("reference", length(years)),
                 rep("comparison", length(counts) * length(years))),
  abridged_query = c(rep("deep learning", length(years)),
                     rep(names(counts), each = length(years))),
  year = rep(years, length(counts) + 1),
  n = c(ref_n, unlist(counts, use.names = FALSE)),
  reference_n = rep(ref_n, length(counts) + 1),
  comparison_percentage = 100 * c(ref_n, unlist(counts, use.names = FALSE)) /
    rep(ref_n, length(counts) + 1),
  average_comparison_percentage = c(rep(100, length(years)),
                                    rep(c(40, 33, 15, 9), each = length(years)))
)
class(cmp) <- c("scopus_comparison", class(cmp))
cmp
```

The `comparison_percentage` column is the per-year share, and
`average_comparison_percentage` is the same ratio computed over the whole period,
which is what orders the topics. A year in which the reference has no records has
no defined share and is recorded as `NA` rather than as a misleading zero.

## A first plot

```{r eval = has_ggplot2, fig.alt = "Four application areas' share of the deep-learning literature from 2013 to 2021, with shaded uncertainty bands", fig.width = 8, fig.height = 4.6}
plot_scopus_comparison(cmp)
```

The chart uses whole-number year breaks, a colour-blind-safe palette and, because
there are only a few topics, labels the lines directly so the reader need not
match colours to a legend. Each label carries the topic's total record count. The
shaded band around each line is a Wilson stability range: it is wide in the early
years, when the reference set is small and the share would move easily, and
narrows as the literature grows. Because 'Scopus' returns exact counts rather than
a sample, the band is illustrative rather than a confidence interval, a point the
`plot_scopus_comparison()` help page sets out.

## Drawing the eye to one topic

When one topic is the focus of a figure, `highlight` draws it in an accent colour
and greys the rest, which keeps the context visible without letting it compete.

```{r eval = has_ggplot2, fig.alt = "The same chart with the medical-imaging topic highlighted against the others in grey", fig.width = 8, fig.height = 4.6}
plot_scopus_comparison(cmp, highlight = "medical imaging")
```

## Adjusting the labels

The count suffix on each label can be turned off, and the uncertainty band can be
removed, when a cleaner look is wanted.

```{r eval = has_ggplot2, fig.alt = "The comparison chart without record counts or bands", fig.width = 8, fig.height = 4.6}
plot_scopus_comparison(cmp, pub_count_in_legend = FALSE, interval = FALSE)
```

The return value is an ordinary [ggplot2](https://ggplot2.tidyverse.org) object,
so any further adjustment, a different theme or a saved file, is one `+` or one
`ggplot2::ggsave()` away.

## Reading the result as a table

Sometimes the numbers matter more than the picture. Because the output is a
tibble, the usual tools apply: here are the topics ranked by their average share.

```{r}
comp <- cmp[cmp$query_type == "comparison", ]
unique(comp[, c("abridged_query", "average_comparison_percentage")])
```