% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/vcfR_to_tidy_functions.R
\name{Convert to tidy data frames}
\alias{Convert to tidy data frames}
\alias{vcfR2tidy}
\alias{extract_info_tidy}
\alias{extract_gt_tidy}
\alias{vcf_field_names}
\title{Convert vcfR objects to tidy data frames}
\usage{
vcfR2tidy(
  x,
  info_only = FALSE,
  single_frame = FALSE,
  toss_INFO_column = TRUE,
  ...
)

extract_info_tidy(x, info_fields = NULL, info_types = TRUE, info_sep = ";")

extract_gt_tidy(
  x,
  format_fields = NULL,
  format_types = TRUE,
  dot_is_NA = TRUE,
  alleles = TRUE,
  allele.sep = "/",
  gt_column_prepend = "gt_",
  verbose = TRUE
)

vcf_field_names(x, tag = "INFO")
}
\arguments{
\item{x}{an object of class vcfR}

\item{info_only}{if TRUE return a list with only a \code{fix} component
(a single data frame that has the parsed INFO information) and 
a \code{meta} component. Don't extract any of the FORMAT fields.}

\item{single_frame}{return a single tidy data frame in list component
\code{dat} rather returning it in components
\code{fix} and/or \code{gt}.}

\item{toss_INFO_column}{if TRUE (the default) the INFO column will be removed from output as
its consituent parts will have been parsed into separate columns.}

\item{...}{more options to pass to \code{\link{extract_info_tidy}} and 
\code{\link{extract_gt_tidy}}.  See parameters listed below.}

\item{info_fields}{names of the fields to be extracted from the INFO column
into a long format data frame.  If this is left as NULL (the default) then
the function returns a column for every INFO field listed in the metadata.}

\item{info_types}{named vector of "i" or "n" if you want the fields extracted from the INFO column to be converted to integer or numeric types, respectively.
When set to NULL they will be characters.  
The names have to be the exact names of the fields.  
For example \code{info_types = c(AF = "n", DP = "i")} will convert column AF to numeric and DP to integer.
If you would like the function to try to figure out the conversion from the metadata information, then set \code{info_types = TRUE}.  
Anything with Number == 1 and (Type == Integer or Type == Numeric) will then be converted accordingly.}

\item{info_sep}{the delimiter used in the data portion of the INFO fields to 
separate different entries.  By default it is ";", but earlier versions of the VCF
standard apparently used ":" as a delimiter.}

\item{format_fields}{names of the fields in the FORMAT column to be extracted from 
each individual in the vcfR object into 
a long format data frame.  If left as NULL, the function will extract all the FORMAT
columns that were documented in the meta section of the VCF file.}

\item{format_types}{named vector of "i" or "n" if you want the fields extracted according to the FORMAT column to be converted to integer or numeric types, respectively.
When set to TRUE an attempt to determine their type will be made from the meta information.
When set to NULL they will be characters.  
The names have to be the exact names of the format_fields.  
Works equivalently to the \code{info_types} argument in 
\code{\link{extract_info_tidy}}, i.e., if you set it to TRUE then it uses the information in the
meta section of the VCF to coerce to types as indicated.}

\item{dot_is_NA}{if TRUE then a single "." in a character field will be set to NA.  If FALSE
no conversion is done.  Note that "." in a numeric or integer field 
(according to format_types) with Number == 1 is always
going to be set to NA.}

\item{alleles}{if TRUE (the default) then this will return a column, \code{gt_GT_alleles} that
has the genotype of the individual expressed as the alleles rather than as 0/1.}

\item{allele.sep}{character which delimits the alleles in a genotype (/ or |) to be passed to
\code{\link{extract.gt}}. Here this is not used for a regex (as it is in other functions), but merely
for output formatting.}

\item{gt_column_prepend}{string to prepend to the names of the FORMAT columns}

\item{verbose}{logical to specify if verbose output should be produced
in the output so that they
do not conflict with any INFO columns in the output.  Default is "gt_". Should be a 
valid R name. (i.e. don't start with a number, have a space in it, etc.)}

\item{tag}{name of the lines in the metadata section of the VCF file to parse out.
Default is "INFO".  The only other one tested and supported, currently is, "FORMAT".}
}
\value{
An object of class tidy::data_frame or a list where every element is of class tidy::data_frame.
}
\description{
Convert the information in a vcfR object to a long-format data frame
suitable for analysis or use with Hadley Wickham's packages, 
\href{https://cran.r-project.org/package=dplyr}{dplyr},
\href{https://cran.r-project.org/package=tidyr}{tidyr}, and
\href{https://cran.r-project.org/package=ggplot2}{ggplot2}.
These packages have been
optimized for operation on large data frames, and, though they can bog down
with very large data sets, they provide a good framework for handling and filtering
large variant data sets.  For some background
on the benefits of such "tidy" data frames, see 
\href{https://www.jstatsoft.org/article/view/v059i10}{this article}.

For some filtering operations, such as those where one wants to filter genotypes
upon GT fields in combination with INFO fields, or more complex 
operations in which one wants to filter
loci based upon the number of individuals having greater than a certain quality score,
it will be advantageous to put all the information into a long format data frame 
and use \code{dplyr} to perform the operations.  Additionally, a long data format is
required for using \code{ggplot2}.  These functions convert vcfR objects to long format
data frames.
}
\details{
The function \strong{vcfR2tidy} is the main function in this series.  It takes a vcfR
object and converts the information to a list of long-format data frames.  The user can
specify whether only the INFO or both the INFO and the FORMAT columns should be extracted, and also
which INFO and FORMAT fields to extract.  If no specific INFO or FORMAT fields are asked
for, then they will all be returned.  If \code{single_frame == FALSE} and 
\code{info_only == FALSE} (the default), 
the function returns a list with three components: \code{fix}, \code{gt}, and \code{meta} as follows:
\enumerate{
\item \code{fix} A data frame of the fixed information columns and the parsed INFO columns, and 
an additional column, \code{ChromKey}---an integer identifier
for each locus, ordered by their appearance in the original data frame---that serves
together with POS as a key back to rows in \code{gt}.  
\item \code{gt} A data frame of the genotype-related fields. Column names are the names of the 
FORMAT fields with \code{gt_column_prepend} (by default, "gt_") prepended to them.  Additionally
there are columns \code{ChromKey}, and \code{POS} that can be used to associate
each row in \code{gt} with a row in \code{fix}.
\item\code{meta} The meta-data associated with the columns that were extracted from the INFO and FORMAT
columns in a tbl_df-ed data frame.  
}
This is the default return object because it might be space-inefficient to
return a single tidy data frame if there are many individuals and the CHROM names are
long and/or there are many INFO fields.  However, if
\code{single_frame = TRUE}, then the results are returned as a list with component \code{meta}
as before, but rather than having \code{fix} and \code{gt} as before, both those data frames
have been joined into component \code{dat} and a ChromKey column is not returned, because
the CHROM column is available.

If \code{info_only == FALSE}, then just the fixed columns and the parsed INFO columns are 
returned, and the FORMAT fields are not parsed at all.  The return value is a list with
components \code{fix} and \code{meta}.  No column ChromKey appears.

The following functions are called by \strong{vcfR2tidy} but are documented below because
they may be useful individually.

The function \strong{extract_info_tidy} let's you pass in a vector of the INFO fields that
you want extracted to a long format data frame. If you don't tell it which fields to 
extract it will extract all the INFO columns detailed in the VCF meta section.
The function returns a tbl_df data frame of the INFO fields along with with an additional
integer column \code{Key} that associates
each row in the output data frame with each row (i.e. each CHROM-POS combination) 
in the original vcfR object \code{x}.  

The function \strong{extract_gt_tidy} let's you pass in a vector of the FORMAT fields that
you want extracted to a long format data frame. If you don't tell it which fields to 
extract it will extract all the FORMAT columns detailed in the VCF meta section.
The function returns a tbl_df data frame of the FORMAT fields with an additional
integer column \code{Key} that associates
each row in the output data frame with each row (i.e. each CHROM-POS combination),
in the original vcfR object \code{x}, and an additional column \code{Indiv} that gives
the name of the individual.  

The function \strong{vcf_field_names} is a helper function that
parses information from the metadata section of the
VCF file to return a data frame with the \emph{metadata} information about either the INFO 
or FORMAT tags.  It
returns a \code{tbl_df}-ed data frame with column names: "Tag", "ID", "Number","Type",
"Description", "Source", and "Version".
}
\note{
To run all the examples, you can issue this:
\code{example("vcfR2tidy")}
}
\examples{
# load the data
data("vcfR_test")
vcf <- vcfR_test


# extract all the INFO and FORMAT fields into a list of tidy
# data frames: fix, gt, and meta. Here we don't coerce columns
# to integer or numeric types...
Z <- vcfR2tidy(vcf)
names(Z)


# here is the meta data in a table
Z$meta


# here is the fixed info
Z$fix


# here are the GT fields.  Note that ChromKey and POS are keys
# back to Z$fix
Z$gt


# Note that if you wanted to tidy this data set even further
# you could break up the comma-delimited columns easily
# using tidyr::separate




# here we put the data into a single, joined data frame (list component
# dat in the returned list) and the meta data.  Let's just pick out a 
# few fields:
vcfR2tidy(vcf, 
          single_frame = TRUE, 
          info_fields = c("AC", "AN", "MQ"), 
          format_fields = c("GT", "PL"))


# note that the "gt_GT_alleles" column is always returned when any
# FORMAT fields are extracted.




# Here we extract a single frame with all fields but we automatically change
# types of the columns according to the entries in the metadata.
vcfR2tidy(vcf, single_frame = TRUE, info_types = TRUE, format_types = TRUE)




# for comparison, here note that all the INFO and FORMAT fields that were
# extracted are left as character ("chr" in the dplyr summary)
vcfR2tidy(vcf, single_frame = TRUE)





# Below are some examples with the vcfR2tidy "subfunctions"


# extract the AC, AN, and MQ fields from the INFO column into
# a data frame and convert the AN values integers and the MQ
# values into numerics.
extract_info_tidy(vcf, info_fields = c("AC", "AN", "MQ"), info_types = c(AN = "i", MQ = "n"))

# extract all fields from the INFO column but leave 
# them as character vectors
extract_info_tidy(vcf)

# extract all fields from the INFO column and coerce 
# types according to metadata info
extract_info_tidy(vcf, info_types = TRUE)

# get the INFO field metadata in a data frame
vcf_field_names(vcf, tag = "INFO")

# get the FORMAT field metadata in a data frame
vcf_field_names(vcf, tag = "FORMAT")



}
\seealso{
\href{https://cran.r-project.org/package=dplyr}{dplyr},
\href{https://cran.r-project.org/package=tidyr}{tidyr}.
}
\author{
Eric C. Anderson <eric.anderson@noaa.gov>
}
