% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/tar_repository_cas.R
\name{tar_repository_cas}
\alias{tar_repository_cas}
\title{Define a custom content-addressable storage
(CAS) repository (an experimental feature).}
\usage{
tar_repository_cas(
  upload,
  download,
  exists = NULL,
  list = NULL,
  consistent = FALSE,
  substitute = base::list()
)
}
\arguments{
\item{upload}{A function with arguments \code{key} and \code{path}, in that order.
This function should upload the file or directory from \code{path}
to the CAS system.
\code{path} is where the file is originally saved to disk outside the CAS
system. It could be a staging area or a custom \code{format = "file"}
location. \code{key} denotes the name of the destination data object
in the CAS system.

To differentiate between
\code{format = "file"} targets and non-\code{"file"} targets, the \code{upload}
method can use \code{\link[=tar_format_get]{tar_format_get()}}. For example, to make
\code{\link[=tar_repository_cas_local]{tar_repository_cas_local()}} efficient, \code{upload} moves the file
if \code{targets::tar_format_get() == "file"} and copies it otherwise.

See the "Repository functions" section for more details.}

\item{download}{A function with arguments \code{key} and \code{path}, in that order.
This function should download the data object at \code{key} from
the CAS system to the file or directory at \code{path}.
\code{key} denotes the name of the data object in the CAS system.
\code{path} is a temporary staging area and not the final destination.

Please be careful to avoid deleting the object at \code{key} from the CAS
system. If the CAS system is a local file system, for example,
\code{download} should copy the file and not simply move it
(e.g. please avoid \code{file.rename()}).

See the "Repository functions" section for more details.}

\item{exists}{A function with a single argument \code{key},
where \code{key} is a single character string (\code{length(key)} is 1)
to identify a single object in the CAS system.

The \code{exists} function should check if there is a single object at
a single \code{key} in the CAS system.
It is ignored if \code{list} is given and \code{consistent} is \code{TRUE}.

See the "Repository functions" section for more details.}

\item{list}{Either \code{NULL} or an optional function with a single
argument named \code{keys}.

The \code{list} function increases efficiency by reducing repeated calls
to the \code{exists} function (see above) or entirely avoiding them
if \code{consistent} is \code{TRUE}.

The \code{list} function should return a character vector of keys that
already exist in the CAS system.
The \code{keys} argument of \code{list} is a character vector of
CAS keys (hashes) which are already recorded in the pipeline metadata
(\code{tar_meta()}).
For greater efficiency, the \code{list} function can restrict its query
to these existing keys instead of trying to list the billions of keys
that could exist in a CAS system.
See the source code of \code{\link[=tar_cas_l]{tar_cas_l()}}
for an example of how this can work for a local file system CAS.

See the "Repository functions" section for more details.}

\item{consistent}{Logical. Set to \code{TRUE} if the storage platform is
strongly read-after-write consistent. Set to \code{FALSE} if the platform
is not necessarily strongly read-after-write consistent.

A data storage system is said to have strong read-after-write consistency
if a new object is fully available for reading as soon as the write
operation finishes. Many modern cloud services like Amazon S3 and
Google Cloud Storage have strong read-after-write consistency,
meaning that if you upload an object with a PUT request, then a
GET request immediately afterwards will retrieve the precise
version of the object you just uploaded.

Some storage systems do not have strong read-after-write consistency.
One example is network file systems (NFS). On a computing cluster,
if one node creates a file on an NFS, then there is a delay before
other nodes can access the new file. \code{targets} handles this situation
by waiting for the new file to appear with the correct hash
before attempting to use it in downstream computations.
\code{consistent = FALSE} imposes a waiting period in which \code{targets}
repeatedly calls the \code{exists} method until the file becomes available
(at time intervals configurable with \code{\link[=tar_resources_network]{tar_resources_network()}}).
These extra calls to \code{exists} may come with a
little extra latency / computational burden,
but on systems which are not strongly read-after-write consistent,
it is the only way \code{targets} can safely use the current results
for downstream computations.}

\item{substitute}{Named list of values to be inserted into the
body of each custom function in place of symbols in the body.
For example, if
\code{upload = function(key, path) do_upload(key, path, bucket = X)}
and \code{substitute = list(X = "my_aws_bucket")}, then
the \code{upload} function will actually end up being
\code{function(key, path) do_upload(key, path, bucket = "my_aws_bucket")}.

Please do not include temporary or sensitive information
such as authentication credentials.
If you do, then \code{targets} will write them
to metadata on disk, and a malicious actor could
steal and misuse them. Instead, pass sensitive information
as environment variables using \code{\link[=tar_resources_repository_cas]{tar_resources_repository_cas()}}.
These environment variables only exist in the transient memory
spaces of the R sessions of the local and worker processes.}
}
\description{
Define a custom storage repository that uses
content-addressable storage (CAS).
}
\section{Content-addressable storage}{

Normally, \code{targets} organizes output data
based on target names. For example,
if a pipeline has a single target \code{x} with default settings,
then \code{\link[=tar_make]{tar_make()}} saves the output data to the file
\verb{_targets/objects/x}. When the output of \code{x} changes, \code{\link[=tar_make]{tar_make()}}
overwrites \verb{_targets/objects/x}.
In other words, no matter how many changes happen to \code{x},
the data store always looks like this:

\if{html}{\out{<div class="sourceCode">}}\preformatted{_targets/
    meta/
        meta
    objects/
        x
}\if{html}{\out{</div>}}

By contrast, with content-addressable storage (CAS),
\code{targets} organizes outputs based on the hashes of their contents.
The name of each output file is its hash, and the
metadata maps these hashes to target names. For example, suppose
target \code{x} has \code{repository = tar_repository_cas_local("my_cas")}.
When the output of \code{x} changes, \code{\link[=tar_make]{tar_make()}} creates a new file
inside \verb{my_cas/} without overwriting or deleting any other files
in that folder. If you run \code{\link[=tar_make]{tar_make()}} three different times
with three different values of \code{x}, then storage will look like this:

\if{html}{\out{<div class="sourceCode">}}\preformatted{_targets/
    meta/
        meta
my_cas/
    1fffeb09ad36e84a
    68328d833e6361d3
    798af464fb2f6b30
}\if{html}{\out{</div>}}

The next call to \code{tar_read(x)} uses \code{tar_meta(x)$data}
to look up the current hash of \code{x}. If \code{tar_meta(x)$data} returns
\code{"1fffeb09ad36e84a"}, then \code{tar_read(x)} returns the data from
\verb{my_cas/1fffeb09ad36e84a}. Files \verb{my_cas/68328d833e6361d3} and
\verb{my_cas/798af464fb2f6b30} are left over from previous values of \code{x}.

Because CAS accumulates historical data objects,
it is ideal for data versioning and collaboration.
If you commit the \verb{_targets/meta/meta} file to version control
alongside the source code,
then you can revert to a previous state of your pipeline with all your
targets up to date, and a colleague can leverage your hard-won
results using a fork of your code and metadata.

The downside of CAS is the cost of accumulating many data objects
over time. Most pipelines that use CAS
should have a garbage collection system or retention policy
to remove data objects when they are no longer needed.

The \code{\link[=tar_repository_cas]{tar_repository_cas()}} function lets you create your own CAS system
for \code{targets}. You can supply arbitrary custom methods to upload,
download, and check for the existence of data objects. Your custom
CAS system can exist locally on a shared file system or remotely
on the cloud (e.g. in an AWS S3 bucket).
See the "Repository functions" section and the documentation
of individual arguments for advice on how
to write your own methods.

The \code{\link[=tar_repository_cas_local]{tar_repository_cas_local()}} function has an example
CAS system based on a local folder on disk.
It uses \code{\link[=tar_cas_u]{tar_cas_u()}} for uploads,
\code{\link[=tar_cas_d]{tar_cas_d()}} for downloads, and
\code{\link[=tar_cas_l]{tar_cas_l()}} for listing keys.
}

\section{Repository functions}{

In \code{\link[=tar_repository_cas]{tar_repository_cas()}}, functions \code{upload}, \code{download},
\code{exists}, and \code{keys} must be completely pure and self-sufficient.
They must load or namespace all their own packages,
and they must not depend on any custom user-defined
functions or objects in the global environment of your pipeline.
\code{targets} converts each function to and from text,
so it must not rely on any data in the closure.
This disqualifies functions produced by \code{Vectorize()},
for example.

\code{upload} and \code{download} can assume \code{length(path)} is 1, but they should
account for the possibility that \code{path} could be a directory. To simply
avoid supporting directories, \code{upload} could simply call an assertion:

\if{html}{\out{<div class="sourceCode r">}}\preformatted{targets::tar_assert_not_dir(
  path,
  msg = "This CAS upload method does not support directories."
)
}\if{html}{\out{</div>}}

Otherwise, support for directories may require handling them as a
special case. For example, \code{upload} and \code{download} could copy
all the files in the given directory,
or they could manage the directory as a zip archive.

Some functions may need to be adapted and configured based on other
inputs. For example, you may want to define
\verb{upload = \\(key, path) file.rename(path, file.path(folder, key))}
but do not want to hard-code a value of \code{folder} when you write the
underlying function. The \code{substitute} argument handles this situation.
For example, if \code{substitute} is \code{list(folder = "my_folder")},
then \code{upload} will end up as
\verb{\\(key, path) file.rename(path, file.path("my_folder", key))}.

Temporary or sensitive such as authentication credentials
should not be injected
this way into the function body. Instead, pass them as environment
variables using \code{\link[=tar_resources_repository_cas]{tar_resources_repository_cas()}}.
}

\examples{
if (identical(Sys.getenv("TAR_EXAMPLES"), "true")) { # for CRAN
tar_dir({ # tar_dir() runs code from a temp dir for CRAN.
tar_script({
  library(targets)
  library(tarchetypes)
  repository <- tar_repository_cas(
    upload = function(key, path) {
      if (dir.exists(path)) {
        stop("This CAS repository does not support directory outputs.")
      }
      if (!file.exists("cas")) {
        dir.create("cas", recursive = TRUE)
      }
      file.rename(path, file.path("cas", key))
    },
    download = function(key, path) {
      file.copy(file.path("cas", key), path)
    },
    exists = function(key) {
      file.exists(file.path("cas", key))
    },
    list = function(keys) {
      keys[file.exists(file.path("cas", keys))]
    },
    consistent = FALSE
  )
  write_file <- function(object) {
    writeLines(as.character(object), "file.txt")
    "file.txt"
  }
  list(
    tar_target(x, c(2L, 4L), repository = repository),
    tar_target(
      y,
      x,
      pattern = map(x),
      format = "qs",
      repository = repository
    ),
    tar_target(z, write_file(y), format = "file", repository = repository)
  )
})
tar_make()
tar_read(y)
tar_read(z)
list.files("cas")
tar_meta(any_of(c("x", "z")), fields = any_of("data"))
})
}
}
\seealso{
Other content-addressable storage: 
\code{\link{tar_repository_cas_local}()},
\code{\link{tar_repository_cas_local_gc}()}
}
\concept{content-addressable storage}
