% Generated by roxygen2 (4.0.1): do not edit by hand
\name{preprocessData}
\alias{preprocessData}
\title{Check and preprocess an allele dataset}
\usage{
preprocessData(adata, numLoci, ploidy, dataType, dioecious,
  selfCompatible = NULL, mothersOnly = NULL, lociMin = 1)
}
\arguments{
\item{adata}{data frame: an allele dataset.}

\item{numLoci}{integer: the number of loci in the allele dataset.}

\item{ploidy}{integer: the species' ploidy, one of \code{2},
\code{4}, \code{6}, or \code{8}.}

\item{dataType}{character: either \code{"genotype"} or
\code{"phenotype"}.}

\item{dioecious}{logical: is the species dioecious or monoecious?}

\item{selfCompatible}{logical: In monoecious species
(\code{dioecious=FALSE}), can individuals self-fertilise?  When
\code{dioecious=FALSE}, this argument may be left at its
default value of NULL - it will be set to \code{FALSE} by
\code{preprocessData}.}

\item{mothersOnly}{logical: in dioecious species, should females
without progeny present be removed from the dataset?  If
\code{dioecious=TRUE}, then \code{mothersOnly} must be set to
either \code{TRUE} or \code{FALSE}.  If \code{dioecious=FALSE},
argument \code{mothersOnly} should be left at its default value of
\code{NULL}.}

\item{lociMin}{integer: the minimum number of loci in a individual
that must have alleles present for the individual (and its
progeny, if any) to be retained in the dataset (default 1).}
}
\value{
A data frame, containing the checked and pre-processed
allele data, ready for further analysis by other \pkg{PolyPatEx}
functions.  All columns in the returned data frame will be of mode
\code{character}.
}
\description{
Check and preprocess the input allele data frame prior to
subsequent analysis.
}
\details{
If \code{\link{inputData}} is used to load the allele data set
into R, then \code{preprocessData} will be called automatically on
the data frame before it is returned by \code{\link{inputData}}.
However, if the user loads their data into R by some means other
than \code{\link{inputData}}, then \code{preprocessData} MUST be
called on the data frame prior to using any other PolyPatEx
functions to analyse the allele data---\code{preprocessData}
performs a series of critical checks and preprocessing steps on
the data frame, without which other analysis functions in
PolyPatEx will fail.

Note that \code{\link{inputData}} strips leading or trailing
spaces (whitespace) from each entry in the allele dataset as it is
read in.  If you load your data by a means other than
\code{\link{inputData}}, you should ensure that you perform this
step yourself, as \code{preprocessData} will not carry out this
necessary step.

Note also that you should not use spaces in any of your allele
codes - PolyPatEx functions use spaces to separate allele codes as
they process the data - if allele codes already contains spaces,
errors will occur in this processing. If you need a separator, I
recommend using either \sQuote{code{.}} (a period) or
\sQuote{code{_}} (an underscore) rather than a space.

\code{preprocessData} first performs a number of simple checks on
the format and validity of the data set.  These checks look for
the presence of certain required columns and correct naming and
content of these columns.  \code{preprocessData} will usually stop
with an error message should the data fail these basic checks.
Correct the indicated problem in the CSV file or R allele data
frame, then call \code{\link{inputData}} or \code{preprocessData}
again as appropriate.  If you use a spreadsheet to edit the CSV
file, don't forget that you may also need to call
\code{\link{fixCSV}} on the CSV file, prior to calling
\code{\link{inputData}} again.

If the data is \sQuote{genotypic} data PolyPatEx requires that all
\eqn{p} alleles must be present in each allele set, where \eqn{p}
is the species' ploidy.  If an allele set contains fewer than
\eqn{p} alleles, then it is reset to contain no alleles and is
subsequently ignored by other PolyPatEx functions.  ID and locus
information is printed to the R terminal, to help the user locate
these cases in their original dataset.

Further checks look for mismatches between progeny and their
mothers' allele sets at each locus---these are situations where a
progeny's allele set could not have arisen from any gamete that
the mother can provide.  When only one such mismatching locus
occurs in a mother-progeny pair, the offending allele set in the
progeny is reset to contain no alleles (we term these
\sQuote{missing} allele sets).  When mismatches occur in more than
one locus, the progeny is removed entirely from the dataset.
Information is printed to the R terminal to assist the user in
identifying the affected individuals and loci---in particular,
note that removal of several (or all) of a single mother's progeny
may indicate an error in the mother's allele data, rather than in
her progeny.

After the mother/progeny mismatch check above, a subsequent check
removes individuals from the dataset that have fewer than
\code{lociMin} non-missing allele sets remaining.  The default
value for \code{lociMin} is 1---an individual must have at least
one non-missing locus to remain in the dataset.  If any mothers
are removed from the dataset at this stage, all of her progeny are
removed also.  Again, information about these removals is printed
to the R terminal.

Note that in the data frame that is returned by
\code{preprocessData}, the alleles in each allele set (i.e,
corresponding to each locus) will be sorted into alphanumeric
order---this sort order is necessary for the correct operation of
other PolyPatEx routines, and should not be altered.

PolyPatEx needs to know the characteristics of the dataset being
analysed.  These are specified in the \code{\link{inputData}} or
\code{preprocessData} calls and are invisibly attached to the
allele data frame that is returned, for use by other PolyPatEx
functions. The required characteristics are:
\itemize{
 \item \code{numLoci}: the number of loci in the dataset
 \item \code{ploidy}: the ploidy (\eqn{p}) of the species
       (currently allowed to be 4, 6, or 8.  \code{ploidy} can
       also be 2, provided \code{dataType="genotype"})
 \item \code{dataType}: whether the data is genotypic (all \eqn{p}
       alleles at each locus are observed) or phenotypic (only
       the distinct allele states at a locus are observed -
       alleles that appear more than once in the genotype of
       a locus only appear once in the phenotype)
 \item \code{dioecious}: whether the species is dioecious
       or monoecious
 \item \code{selfCompatible}: whether a monoecious species is
       self compatible (i.e., whether an individual can
       fertilise itself)
 \item \code{mothersOnly}: whether a dioecious dataset should
       retain only adult females that are mothers of progeny
       in the dataset.
}
}
\examples{
## If using inputData to input the allele dataset from CSV file,
## preprocessData() is applied automatically before the dataset is
## returned by inputData().

## Otherwise, if the allele dataset is created or loaded into R
## by other means, such preprocessData() must be applied before
## other PolyPatEx analysis routines are applied:

## Using the example dataset 'GF_Phenotype':
data(GF_Phenotype)

## Since we did not load this dataset using inputData(), we must
## first process it with preprocessData() before doing anything
## else:
pData <- preprocessData(GF_Phenotype,
                        numLoci=7,
                        ploidy=6,
                        dataType="phenotype",
                        dioecious=FALSE,
                        selfCompatible=FALSE)

head(pData)  ## Checked and Cleaned version of GF_Phenotype

## pData is now ready to be passed to other PolyPatEx analysis
## functions...
}
\author{
Alexander Zwart (alec.zwart at csiro.au)
}

