% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/clean_dataset.R
\name{clean_dataset}
\alias{clean_dataset}
\alias{CleanCoordinatesDS}
\title{Coordinate Cleaning using Dataset Properties}
\usage{
clean_dataset(x, lon = "decimallongitude", lat = "decimallatitude",
  ds = "dataset", tests = c("ddmm", "periodicity"),
  value = "dataset", verbose = TRUE, ...)
}
\arguments{
\item{x}{data.frame. Containing geographical coordinates and species
names.}

\item{lon}{character string. The column with the longitude coordinates.
Default = \dQuote{decimallongitude}.}

\item{lat}{character string. The column with the latitude coordinates.
Default = \dQuote{decimallatitude}.}

\item{ds}{a character string. The column with the dataset of each record. In
case \code{x} should be treated as a single dataset, identical for all
records. Default = \dQuote{dataset}.}

\item{tests}{a vector of character strings, indicating which tests to run.
See details for all tests available. Default = c("ddmm", "periodicity")}

\item{value}{a character string.  Defining the output value. See value.
Default = \dQuote{dataset}.}

\item{verbose}{logical. If TRUE reports the name of the test and the number
of records flagged.}

\item{...}{additional arguments to be passed to \code{\link{cd_ddmm}} and
\code{\link{cd_round}} to customize test sensitivity.}
}
\value{
Depending on the \sQuote{value} argument:
\describe{
\item{\dQuote{dataset}}{a \code{data.frame} with the
the test summary statistics for each dataset in \code{x}}
\item{\dQuote{clean}}{a \code{data.frame} containing only
records from datasets in \code{x} that passed the tests}
\item{\dQuote{flagged}}{a logical vector of the same length as
rows in \code{x}, with TRUE = test passed and
FALSE = test failed/potentially problematic.}
}
}
\description{
Tests for problems associated with coordinate conversions and rounding,
based on dataset properties. Includes test to identify contributing datasets with
potential errors with converting ddmm to dd.dd, and
periodicity in the data decimals indicating rounding or a raster basis
linked to low coordinate precision. Specifically:
\itemize{
\item ddmm  tests for erroneous conversion from a degree
minute format (ddmm) to a decimal degree (dd.dd) format
\item periodicity test for periodicity in the data,
which can indicate imprecise coordinates, due to rounding or rasterization.
}
}
\details{
These tests are based on the statistical distribution of coordinates and
their decimals within
datasets of geographic distribution records to identify datasets with
potential errors/biases. Three potential error sources can be identified.
The ddmm flag tests for the particular pattern that emerges if geographical
coordinates in a degree minute annotation are transferred into decimal
degrees, simply replacing the degree symbol with the decimal point. This
kind of problem has been observed by in older datasets first recorded on
paper using typewriters, where e.g. a floating point was used as symbol for
degrees. The function uses a binomial test to check if more records than
expected have decimals below 0.6 (which is the maximum that can be obtained
in minutes, as one degree has 60 minutes) and if the number of these records
is higher than those above 0.59 by a certain proportion. The periodicity
test uses rate estimation in a Poisson process to estimate if there is
periodicity in the decimals of a dataset (as would be expected by for
example rounding or data that was collected in a raster format) and if there
is an over proportional number of records with the decimal 0 (full degrees)
which indicates rounding and thus low precision. The default values are
empirically optimized by with GBIF data, but should probably be adapted.
}
\note{
See \url{https://ropensci.github.io/CoordinateCleaner/} for more details
and tutorials.
}
\examples{
#Create test dataset
clean <- data.frame(dataset = rep("clean", 1000),
                    decimallongitude = runif(min = -43, max = -40, n = 1000),
                    decimallatitude = runif(min = -13, max = -10, n = 1000))
                    
bias.long <- c(round(runif(min = -42, max = -40, n = 500), 1),
               round(runif(min = -42, max = -40, n = 300), 0),
               runif(min = -42, max = -40, n = 200))
bias.lat <- c(round(runif(min = -12, max = -10, n = 500), 1),
              round(runif(min = -12, max = -10, n = 300), 0),
              runif(min = -12, max = -10, n = 200))
bias <- data.frame(dataset = rep("biased", 1000),
                   decimallongitude = bias.long,
                   decimallatitude = bias.lat)
test <- rbind(clean, bias)

\dontrun{                  
#run clean_dataset
flags <- clean_dataset(test)

#check problems
#clean
hist(test[test$dataset == rownames(flags[flags$summary,]), "decimallongitude"])
#biased
hist(test[test$dataset == rownames(flags[!flags$summary,]), "decimallongitude"])

}
}
\seealso{
\code{\link{cd_ddmm}} \code{\link{cd_round}}

Other Wrapper functions: \code{\link{clean_coordinates}},
  \code{\link{clean_fossils}}
}
\concept{Wrapper functions}
\keyword{Coordinate}
\keyword{cleaning}
\keyword{wrapper}
