% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/lexpand.R
\name{lexpand}
\alias{lexpand}
\title{Split case-level observations}
\usage{
lexpand(data, birth = bi_date, entry = dg_date, exit = ex_date,
  event = NULL, status = status != 0, entry.status = NULL,
  breaks = list(fot = c(0, Inf)), id = NULL, overlapping = TRUE,
  aggre = NULL, aggre.type = c("unique", "cross-product"), drop = TRUE,
  pophaz = NULL, pp = TRUE, subset = NULL, merge = TRUE,
  verbose = FALSE, ...)
}
\arguments{
\item{data}{dataset of e.g. cancer cases as rows}

\item{birth}{birth time in date format
or fractional years; quoted or unquoted}

\item{entry}{entry time in date format
or fractional years; quoted or unquoted}

\item{exit}{exit from follow-up time in date
format or fractional years; quoted or unquoted}

\item{event}{advanced: time of possible event differing from \code{exit};
typically only used in certain SIR/SMR calculations - see Details;
keep \code{NULL} if \code{exit} is the time of the event; quoted or unquoted}

\item{status}{variable indicating type of event at \code{exit} or \code{event};
e.g. \code{status = status != 0}; expression or quoted variable name}

\item{entry.status}{input in the same way as \code{status};
status at \code{entry}; see Details}

\item{breaks}{a named list of vectors of time breaks;
e.g. \code{breaks = list(fot=0:5, age=c(0,45,65,Inf))}; see Details}

\item{id}{an id variable; e.g. \code{id = my_id};  quoted or unquoted}

\item{overlapping}{advanced, logical; if \code{FALSE} AND if \code{data} contains
multiple rows per subject AND \code{event} is defined,
ensures that the timelines \code{lex.id}-specific rows do not overlap;
this ensures e.g. that person-years are only computed once per subject
in a multi-state paradigm}

\item{aggre}{e.g. \code{aggre = list(sex, fot)};
a list of unquoted variables and/or expressions thereof,
which are interpreted as factors; data events and person-years will
be aggregated by the unique combinations of these; see Details}

\item{aggre.type}{either \code{"unique"} or \code{"cross-product"};
can be abbreviated;
state transitions and person-year will be calculated either for all
existing levels of expressions in \code{aggre}, or
for the cross-product of all possible existing levels (with some
possibly having zero person-years and transitions); see Details}

\item{drop}{logical; if \code{TRUE}, drops all resulting rows
after splitting that reside outside
the time window as defined by the given breaks (all time scales)}

\item{pophaz}{a dataset of population hazards to merge
with splitted data; see Details}

\item{pp}{logical; if \code{TRUE}, computes Pohar-Perme weights using
\code{pophaz}; adds variable with reserved name \code{pp};
see Details for computing method}

\item{subset}{a logical vector or any logical condition; data is subsetted
before splitting accordingly}

\item{merge}{logical; if \code{TRUE}, retains all
original variables from the data}

\item{verbose}{logical; if \code{TRUE}, the function is chatty and
returns some messages along the way}

\item{...}{e.g. \code{fot = 0:5}; instead of specifying a \code{breaks} list,
correctly named breaks vectors can be given
for \code{fot}, \code{age}, and \code{per}; these override any breaks in the
\code{breaks} list}
}
\value{
If \code{aggre = NULL}, returns
a \code{data.table} or \code{data.frame}
(depending on \code{options("popEpi.datatable")}; see \code{?popEpi})
object expanded to accommodate split observations with time scales as
fractional years and \code{pophaz} merged in if given. Population
hazard levels in new variable \code{pop.haz}, and Pohar-Perme
weights as new variable \code{pp} if requested.

If \code{aggre} is defined, returns a long-format
\code{data.table}/\code{data.frame} with the variable \code{pyrs} (person-years),
and variables for the counts of transitions in state or state at end of
follow-up formatted \code{fromXtoY}, where \code{X} and \code{Y} are
the states transitioned from and to, respectively.
}
\description{
Given subject-level data, data is split
by calendar time (\code{per}), \code{age}, and follow-up
time (\code{fot}, from 0 to the end of follow-up)
into subject-time-interval rows according to
given \code{breaks} and additionally processed if requested.
}
\details{
\strong{Basics}

\code{\link{lexpand}} splits a given data set (with e.g. cancer diagnoses
as rows) to subintervals of time over
calendar time, age, and follow-up time with given time breaks
using \code{\link{splitMulti}}.

The dataset must contain appropriate
\code{Date} / \code{IDate} / \code{date} format or
other numeric variables that can be used
as the time variables.

You may take a look at a simulated cohort
\code{\link{sire}} as an example of the
minimum required information for processing data with \code{lexpand}.

\strong{Breaks}

You should define all breaks as left inclusive and right exclusive
time points (e.g.\code{[a,b)} )
for 1-3 time dimensions so that the last member of a breaks vector
is a meaningful "final upper limit",
 e.g. \code{per = c(2002,2007,2012)}
to create a last subinterval of the form \code{[2007,2012)}.

All breaks are explicit, i.e. if \code{drop = TRUE},
any data beyond the outermost breaks points are dropped.
If one wants to have unspecified upper / lower limits on one time scale,
use \code{Inf}: e.g. \code{breaks = list(fot = 0:5, age = c(0,45,Inf))}.
Breaks for \code{per} can also be given in
\code{Date}/\code{IDate}/\code{date} format, whereupon
they are converted to fractional years before used in splitting.

\strong{Time variables}

If any of the given time variables
(\code{birth}, \code{entry}, \code{exit}, \code{event})
is in any kind of date format, they are first coerced to
fractional years before splitting
using \code{\link{get.yrs}} (with \code{year.length = "actual"}).

Sometimes in e.g. SIR/SMR calculation one may want the event time to differ
from the time of exit from follow-up, if the subject is still considered
to be at risk of the event. If \code{event} is specified, the transition to
 \code{status} is moved to \code{event} from \code{exit}
 using \code{\link[Epi]{cutLexis}}. See Examples.

\strong{The status variable}

The statuses in the expanded output (\code{lex.Cst} and \code{lex.Xst})
are determined by using either only \code{status} or both \code{status}
and \code{entry.status}. If \code{entry.status = NULL}, the status at entry
is guessed according to the type of variable supplied via \code{status}:
For numeric variables it will be zero, for factors the first level
(\code{levels(status)[1]}) and otherwise the first unique value in alphabetical
order (\code{sort(unique(status))[1]}).

Using numeric or factor status
variables is strongly recommended. Logical expressions are also allowed
(e.g. \code{status = my_status != 0L}) and are converted to integer internally.

\strong{Merging population hazard information}

To enable computing relative/net survivals with \code{\link{survtab}}
and \code{\link{relpois}}, \code{lexpand} merges an appropriate
population hazard data (\code{pophaz}) to the expanded data
before dropping rows outside the specified
time window (if \code{drop = TRUE}). \code{pophaz} must, for this reason,
contain at a minimum the variables named
\code{agegroup}, \code{year}, and \code{haz}. \code{pophaz} may contain additional variables to specify
different population hazard levels in different strata; e.g. \code{popmort} includes \code{sex}.
All the strata-defining variables must be present in the supplied \code{data}. \code{lexpand} will
automatically detect variables with common names in the two datas and merge using them.

Currently \code{year} must be an integer variable specifying the appropriate year. \code{agegroup}
must currently also specify one-year age groups, e.g. \code{popmort} specifies 101 age groups
of length 1 year. In both
\code{year} and \code{agegroup} variables the values are interpreted as the lower bounds of intervals
(and passed on to a \code{cut} call). The mandatory variable \code{haz}
must specify the appropriate average rate at the person-year level;
e.g. \code{haz = -log(survProb)} where \code{survProb} is a one-year conditional
survival probability will be the correct hazard specification. **tajuan, mutta en osaa korjata!**

The corresponding \code{pophaz} population hazard value is merged by using the mid points
of the records after splitting as reference values. E.g. if \code{age=89.9} at the start
of a 1-year interval, then the reference age value is \code{90.4} for merging.
This way we get a "typical" population hazard level for each record.

\strong{Computing Pohar-Perme weights}

If \code{pp = TRUE}, Pohar-Perme weights
(the inverse of cumulative population survival) are computed. This will
create the new \code{pp} variable in the expanded data. \code{pp} is a
reserved name and \code{lexpand} throws exception if a variable with that name
exists in \code{data}.

When a survival interval contains one or several rows per subject
(e.g. due to splitting by the \code{per} scale),
\code{pp} is cumulated from the beginning of the first record in a survival
interval for each subject to the mid-point of the remaining time within that
survival interval, and  that value is given for every other record
that a given person has within the same survival interval.

E.g. with 5 rows of duration \code{1/5} within a survival interval
\code{[0,1)]}, \code{pp} is determined for all records by a cumulative
population survival from \code{0} to \code{0.5}. Th existing accuracy is used,
so that the weight is cumulated first up to the end of the second row
and then over the remaining distance to the mid-point (first to 0.4, then to
0.5). This ensures that more accurately merged population hazards are fully
used.

\strong{Aggregating}

Certain analyses such as SIR/SMR calculations require tables of events and
person-years by the unique combinations (interactions) of several variables.
For this, \code{aggre} can be specified as a list of such variables
(preferably \code{factor} variables but nto mandatory)
 and any arbitrary functions of the
variables at one's disposal. E.g.

\code{aggre = list(sex, agegr = cut(dg_age, 0:100))}

would tabulate events and person-years by sex and an ad-hoc age group
variable. Every ad-hoc-created variable should be named.

\code{fot}, \code{per}, and \code{age} are special reserved variables which,
when present in the \code{aggre} list, are outputted as categories of the
corresponding time scale variables by using
e.g.

\code{cut(fot, breaks$fot, right=FALSE)}.

This only works if
the corresponding breaks are defined in \code{breaks} or via \code{...}.
E.g.

\code{aggre = list(sex, fot.int = fot)} with

\code{breaks = list(fot=0:5)}.

The outputted variable \code{fot.int} in the above example will have
the lower limits of the appropriate intervals as values.

\code{aggre} as a named list will output numbers of events and person-years
with the given new names as categorizing variable names, e.g.
\code{aggre = list(follow_up = fot, gender = sex, agegroup = age)}.

The ouputted table has person-years (\code{pyrs}) and event (mutation) counts
(e.g. \code{from0to1}) as columns. Event counts are the numbers of mutations
(\code{lex.Cst != lex.Xst}) or the \code{lex.Xst} value at a subject's
last record (subject possibly defined by \code{id}).

If \code{aggre.type = "unique"}, the above results are computed for existing
combinations of expressions given in \code{aggre}, but also for non-existing
combinations if \code{aggre.type = "cross-product"}. E.g. if a
factor variable has levels \code{"a", "b", "c"} but the data is limited
to only have levels \code{"a", "b"} present
(more than zero rows have these level values), the former setting only
computes results for \code{"a", "b"}, and the latter also for \code{"c"}
and any combination with other variables or expression given in \code{aggre}.
}
\examples{
\dontrun{
## prepare data for e.g. 5-year cohort survival calculation
x <- lexpand(sire, breaks=list(fot=seq(0, 5, by = 1/12)),
             status =  status != 0, pophaz=popmort)

## prepare data for e.g. 5-year "period analysis" for 2008-2012
BL <- list(fot = seq(0, 5, by = 1/12), per = c("2008-01-01", "2013-01-01"))
x <- lexpand(sire, breaks = BL, pophaz=popmort, status =  status != 0)

## aggregating
BL <- list(fot = 0:5, per = c("2003-01-01","2008-01-01", "2013-01-01"))
ag <- lexpand(sire, breaks = BL, status = status != 0,
              aggre=list(sex, period = per, surv.int = fot))

## using "..."
x <- lexpand(sire, fot=0:5, pophaz=popmort, status =  status != 0)

x <- lexpand(sire, fot=0:5, status =  status != 0,
             aggre=list(sex, surv.int = fot))

## using the "event" argument: it just places the transition to given "status"
## at the "event" time instead of at the end, if possible using cutLexis
x <- lexpand(sire, status = status, event = dg_date, birth=bi_date, entry=bi_date, exit=ex_date)

## aggregating with custom "event" time

x <- lexpand(sire, status = status, event = dg_date, birth=bi_date, entry=bi_date, exit=ex_date,
             per = 1970:2014, age = c(0:100,Inf),
             aggre = list(sex, year = per, agegroup = age))

}
}
\author{
Joonas Miettinen
}
\seealso{
\code{\link{splitMulti}}, \code{\link[Epi]{Lexis}}, \code{\link{survtab}}, \code{\link{relpois}}, \code{\link{popmort}} \code{\link{sir}}
}

