% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/feamiR_runningpython.R
\name{preparedataset}
\alias{preparedataset}
\title{Dataset preparation
This step performs all preparation necessary to perform feamiR analysis, taking a set of mRNAs, a set of miRNAs and an interaction dataset and creating corresponding positive and negative datasets for ML modelling.}
\usage{
preparedataset(
  pythonversion = "python",
  mRNA_3pUTR = "",
  miRNA_full = "",
  interactions = "",
  annotations = "",
  fullchromosomes = "",
  seed = 1,
  nonseed_miRNA = 0,
  flankingmRNA = 0,
  UTR_output = "",
  chr = "",
  o = "feamiR_",
  positiveset = "",
  negativeset = "",
  sreformatpath = "sreformat",
  patmanpath = "patman",
  patmanoutput = "",
  minvalidationentries = 40,
  num_runs = 100,
  check_python = TRUE
)
}
\arguments{
\item{pythonversion}{File path for installed Python version (default: python)}

\item{mRNA_3pUTR}{Fasta file of only 3'UTRs, with gene name as name attribute (e.g. Serpinb8)}

\item{miRNA_full}{Fasta file of full mature miRNA hairpins, with miRNA ID as name attribute (e.g. hsa-miR-576-3p)}

\item{interactions}{CSV file containing only validated interactions between miRNA and mRNA (e.g. from miRTarBase). Must have columns miRNA (e.g. hsa-miR-576-3p), Target Gene (e.g. Serpinb8) and optionally Experiments (e.g. qRT-PCR) and/or Support Type (with values Functional MTI, Functional MTI (Weak), Non-Functional MTI, Non-Functional MTI (Weak))}

\item{annotations}{GTF file (e.g. from Ensembl) with attributes seqname (chromosome), feature (with 3'UTRs labelled exactly 'three_prime_utr'), transcript_id, gene_id and gene_name matching fullchromosomes and interactions}

\item{fullchromosomes}{Fasta file (e.g. top level file from Ensembl) containing full sequence for each chromosome with name as chromosome (e.g. 1, matching seqname from annotations)}

\item{seed}{Binary, 1 if full miRNA seed features should be included in statistical analysis. Default: 1.}

\item{nonseed_miRNA}{Binary, 1 if full miRNA features should be included in statistical analysis. Seed features are always included. Default: 0.}

\item{flankingmRNA}{Binary, 1 if flanking region mRNA features should be included in statistical analysis. Seed features are always included. Default: 0.}

\item{UTR_output}{String. File name 3'UTR fasta file should be saved as (when annotations and full chromosomes files are supplied)}

\item{chr}{Number of chromosomes for species in question.}

\item{o}{Output prefix for any files created and saved.}

\item{positiveset}{CSV file containing validated pairs of miRNAs and mRNAs as output by initial stage of analysis. If positiveset and negative set are input, analysis begins at final statistical analysis stage.}

\item{negativeset}{CSV file containing non-validated pairs of miRNAs and mRNAs as output by initial stage of analysis. If positiveset and negative set are input, analysis begins at final statistical analysis stage.}

\item{sreformatpath}{File path for installed sreformat (default: sreformat)}

\item{patmanpath}{File path for installed patman (default: patman)}

\item{patmanoutput}{TXT file containing patman output (saved as output_prefix + patman_seed.txt). If supplied, analysis begins at patman output processing stage.}

\item{minvalidationentries}{Minimum number of entries for a validation category to be considered separately in statistical analysis (default: 40)}

\item{num_runs}{Number of subsamples to create (default: 100)}

\item{check_python}{Whether the Python version should be checked (default: TRUE)}
}
\value{
CSV containing full positive and negative sets. Folder statistical_analysis of heatmaps showing significance of various features under Fisher exact and Chi-squared tests. Seed analysis will always be run, full miRNA and flanking analysis if the respective parameters are set to 1. Folder subsamples containing CSVs for 100 subsamples with positive and negative samples equal for use in classifiers and feature selection.
}
\description{
PLEASE NOTE:
This analysis is run in Python so python must be installed and location specified if not on PATH.
Both sreformat and PaTMaN must also be installed and path specified if not on PATH.
Python >= 3.6 is required to use the neccesary packages.
The Python component required the following libraries: os, Bio, gtfparse, pandas, numpy, math, scipy.stats, matplotlib.pyplot, seaborn as sns, statistics, logging. Please ensure these are installed for the verison of Python you supply.
}
\details{
The function saves various files (using specified output_prefix) and if you wish to start preparation using one of these pre-output files then these can be specified and preparation will skip to that point (this should only be done with files output by the function).
}
\examples{
preparedataset(
   pythonversion=Sys.which('python'),
   positiveset = system.file('samples','test_seed_positive.csv',package='feamiR'),
   negativeset=system.file('samples','test_seed_negative.csv',package='feamiR'),
   o='examples_',
   num_runs=0,
   check_python=FALSE)
}
