
<!-- README.md is generated from README.Rmd. Please edit that file -->

# ProActive

<!-- badges: start -->
<!-- badges: end -->

**`ProActive` automatically detects regions of gapped and elevated read
coverage using a 2D pattern-matching algorithm. `ProActive` detects,
characterizes and visualizes read coverage patterns in both genomes and
metagenomes. Optionally, users may provide gene annotations associated
with their genome or metagenome in the form of a .gff file. In this
case, `ProActive` will generate an additional output table containing
the gene annotations found within the detected regions of gapped and
elevated read coverage. Additionally, users can search for gene
annotations of interest in the output read coverage plots.**

Visualizing read coverage data is important because gaps and elevations
in coverage can be indicators of a variety of biological and
non-biological scenarios, for example-

- Elevations and gaps in read coverage may be caused by some types of
  structural variants. Deletions can cause gaps while duplications can
  cause elevations in read coverage \[1\].
- Highly active and/or abundant mobile genetic elements, like
  transposable elements \[2\] and prophage \[3\] for example, can create
  elevations in read coverage at their respective integration sites.
- Genetic regions with high mutation rates and/or high variability
  within the population can generate gaps in read coverage \[4\].
- Poor quality sequencing reads and chimeric reference sequences may
  cause gaps and elevations in read coverage.

**Since the cause for gaps and elevations in read coverage can be
ambiguous, ProActive is best used as a screening method to identify
genetic regions for further investigation with other tools!**

**References:**

1.  Tattini L., D’Aurizio R., & Magi A. (2015). Detection of Genomic
    Structural Variants from Next-Generation Sequencing Data. Frontiers
    in bioengineering and biotechnology, 3, 92.
    <https://doi.org/10.3389/fbioe.2015.00092>
2.  Kleiner M., Bushnell B., Sanderson K.E. et al. (2020)
    Transductomics: sequencing-based detection and analysis of
    transduced DNA in pure cultures and microbial communities.
    Microbiome 8, 158. <https://doi.org/10.1186/s40168-020-00935-5>
3.  Kieft K., Anantharaman K. (2022). Deciphering Active Prophages from
    Metagenomes. mSystems 7:e00084-22.
    https://doi.org/10.1128/msystems.00084-22
4.  Fogarty E., Moore R. (2019). Visualizing contig coverages to better
    understand microbial population structure.
    <https://merenlab.org/2019/11/25/visualizing-coverages/>

### Input files

#### Pileup file:

ProActive detects read coverage patterns using a pattern-matching
algorithm that operates on pileup files. A pileup file is a file format
where each row summarizes the ‘pileup’ of reads at specific genomic
locations. Pileup files can be used to generate a rolling mean of read
coverages and associated base pair positions which reduces data size
while preserving read coverage patterns. **ProActive requires that input
pileups files** **be generated using a 100 bp window/bin size.**

Pileup files can be generated by mapping sequencing reads to a
metagenome or genome fasta. **Read mapping should be performed using a
high** **minimum identity (0.97 or higher) and random mapping of
ambiguous reads.** The pileup files needed for ProActive are generated
using the .bam files produced during read mapping. Some read mappers,
like
[BBMap](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmap-guide/),
allow for the generation of pileup files in the
[`bbmap.sh`](https://github.com/BioInfoTools/BBMap/blob/master/sh/bbmap.sh)
command with use of the `bincov` output with the `covbinsize=100`
parameter/argument. **Otherwise, BBMap’s**
**[`pileup.sh`](https://github.com/BioInfoTools/BBMap/blob/master/sh/pileup.sh)**
**can convert .bam files produced by any read mapper to pileup files**
**compatible with ProActive using the `bincov` output with
`binsize=100`.**

**NOTE:** For detailed information on input file format, please see the
vignette. Users may also use the ‘sampleMetagenomePileup’ and
‘sampleGenomePileup’ files that come pre-loaded with ProActive as a
reference.

#### gffTSV:

ProActive optionally accepts a .gff file as input. The .gff file must be
associated with the same metagenome or genome used to create your pileup
file. The .gff file should be a TSV and should follow the same general
format described
[here](https://en.wikipedia.org/wiki/General_feature_format#:~:text=In%20bioinformatics%2C%20the%20general%20feature,DNA%2C%20RNA%20and%20protein%20sequences.).

## Installation

Install ProActive from CRAN with:

``` r
install.packages("ProActive")
library(ProActive)
```

Install the development version of ProActive from
[GitHub](https://github.com/) with:

``` r
if (!require("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

devtools::install_github("jlmaier12/ProActive")
library(ProActive)
```

## Quick start

``` r
library(ProActive)


## Metagenome mode

MetagenomeProActive <- ProActiveDetect(
  pileup = sampleMetagenomePileup,
  mode = "metagenome",
  gffTSV = sampleMetagenomegffTSV
)
#> Preparing input file for pattern-matching...
#> Starting pattern-matching...
#> A quarter of the way done with pattern-matching
#> Half of the way done with pattern-matching
#> Almost done with pattern-matching!
#> Summarizing pattern-matching results
#> Finding gene predictions in elevated or gapped regions of read coverage...
#> Finalizing output
#> Execution time: 2.09secs
#> 0 contigs were filtered out based on low read coverage
#> 0 contigs were filtered out based on length (< minContigLength)
#> 
#> Elevation       Gap NoPattern 
#>         3         3         1

MetagenomePlots <- plotProActiveResults(pileup = sampleMetagenomePileup,
                                        ProActiveResults = MetagenomeProActive)

MetagenomeGeneMatches <- geneAnnotationSearch(ProActiveResults = MetagenomeProActive, 
                                              pileup = sampleMetagenomePileup, 
                                              gffTSV = sampleMetagenomegffTSV,
                                              geneOrProduct = "product",
                                              keyWords = c("transport", "chemotaxis"))
#> Cleaning gff file...
#> Cleaning pileup file...
#> Searching for matching annotations...
#> 3 contigs/chunks have gene annotations that match one or more of the provided keyWords


## Genome mode

GenomeProActive <- ProActiveDetect(
  pileup = sampleGenomePileup,
  mode = "genome",
  gffTSV = sampleGenomegffTSV
)
#> Preparing input file for pattern-matching...
#> Starting pattern-matching...
#> A quarter of the way done with pattern-matching
#> Half of the way done with pattern-matching
#> Almost done with pattern-matching!
#> Summarizing pattern-matching results
#> Finding gene predictions in elevated or gapped regions of read coverage...
#> Finalizing output
#> Execution time: 29.7secs
#> 0 contigs were filtered out based on low read coverage
#> 0 contigs were filtered out based on length (< minContigLength)
#> 
#> Elevation       Gap NoPattern 
#>        25         3        21

GenomePlots <- plotProActiveResults(pileup = sampleGenomePileup,
                                    ProActiveResults = GenomeProActive)

GenomeGeneMatches <- geneAnnotationSearch(ProActiveResults = GenomeProActive, 
                                          pileup = sampleGenomePileup, 
                                          gffTSV = sampleGenomegffTSV,
                                          geneOrProduct = "product",
                                          keyWords = c("ribosomal"), 
                                          inGapOrElev = TRUE,
                                          bpRange = 5000)
#> Cleaning gff file...
#> Cleaning pileup file...
#> Searching for matching annotations...
#> 8 contigs/chunks have gene annotations that match one or more of the provided keyWords
```
