% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/computeTfIdf.R
\name{computeTfIdf}
\alias{computeTfIdf}
\title{Compute Term Frequency - Inverse Document Frequency on a corpus.}
\usage{
computeTfIdf(channel, tableName, docId, textColumns, parser, top = NULL,
  rankFunction = "rank", idSep = "-", idNull = "(null)",
  adjustDocumentCount = FALSE, where = NULL, stopwords = NULL,
  test = FALSE)
}
\arguments{
\item{channel}{connection object as returned by \code{\link{odbcConnect}}}

\item{tableName}{Aster table name}

\item{docId}{vector with one or more column names comprising unique document id. 
Values are concatenated with \code{idSep}. Database NULLs are replaced with
\code{idNull} string.}

\item{textColumns}{one or more names of columns with text. Multiple coumn are
concatenated into single text field first.}

\item{parser}{type of parser to use on text. For example, \code{ngram(2)} parser
generates 2-grams (ngrams of length 2), \code{token(2)} parser generates 2-word 
combinations of terms within documents.}

\item{top}{specifies threshold to cut off terms ranked below \code{top} value. If value
is greater than 0 then included top ranking terms only, otherwise all terms returned 
(also see paramter \code{rankFunction}). Terms are always ordered by their term frequency -
inverse document frequency (tf-idf) within each document. Filtered out terms have their 
rank ariphmetically greater than threshold \code{top} (see details): term is more 
important the smaller value of its rank.}

\item{rankFunction}{one of \code{rownumber, rank, denserank, percentrank}. Rank computed and
returned for each term within each document. function determines which SQL window function computes 
term rank value (default \code{rank} corresponds to SQL \code{RANK()} window function). 
When threshold \code{top} is greater than 0 ranking function used to limit number of 
terms returned (see details).}

\item{idSep}{separator when concatenating 2 or more document id columns (see \code{docId}).}

\item{idNull}{string to replace NULL value in document id columns.}

\item{adjustDocumentCount}{logical: if TRUE then number of documents 2 will be increased by 1.}

\item{where}{specifies criteria to satisfy by the table rows before applying
computation. The criteria are expressed in the form of SQL predicates (inside 
\code{WHERE} clause).}

\item{stopwords}{character vector with stop words. Removing stop words takes place in R after 
results are computed and returned from Aster.}

\item{test}{logical: if TRUE show what would be done, only (similar to parameter \code{test} in \link{RODBC} 
functions \link{sqlQuery} and \link{sqlSave}).}
}
\description{
Compute Term Frequency - Inverse Document Frequency on a corpus.
}
\details{
By default function computes and returns all terms. When large number of terms is expected then
use parameters \code{top} to limit number of terms returned by 
filtering top ranked terms for each document. Thus if set \code{top=1000} and there
is 100 documents then at least 100,000 terms (rows) will be returned. Result size could 
exceed this number when other than \code{rownumber} \code{rankFunction} used:
\itemize{
    \item \emph{\code{rownumber}} applies a sequential row number, starting at 1, to each term in a document.
      The tie-breaker behavior is as follows: Rows that compare as equal in the sort order will be
      sorted arbitrarily within the scope of the tie, and all terms will be given unique row numbers.
    \item \emph{\code{rank}} function assigns the current row-count number as the terms's rank, provided the 
      term does not sort as equal (tie) with another term. The tie-breaker behavior is as follows: 
      terms that compare as equal in the sort order are sorted arbitrarily within the scope of the tie, 
      and the sorted-as-equal terms get the same rank number.
    \item \emph{\code{denserank}} behaves like the \code{rank} function, except that it never places 
      gaps in the rank sequence. The tie-breaker behavior is the same as that of RANK(), in that 
      the sorted-as-equal terms receive the same rank. With \code{denserank}, however, the next term after 
      the set of equally ranked terms gets a rank 1 higher than preceding tied terms.
    \item \emph{\code{percentrank}} assigns a relative rank to each term, using the formula: 
      \code{(rank - 1) / (total rows - 1)}. The tie-breaker behavior is as follows: Terms that compare 
      as equal are sorted arbitrarily within the scope of the tie, and the sorted-as-equal rows 
      get the same percent rank number.
}
The ordering of the rows is always by their tf-idf value within each document.
}
\examples{
if(interactive()){
# initialize connection to Dallas database in Aster 
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
                         server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")

# compute term-document-matrix of all 2-word Ngrams of Dallas police crime reports
# for each 4-digit zip
tdm1 = computeTfIdf(channel=conn, tableName="public.dallaspoliceall", 
                    docId="substr(offensezip, 1, 4)", 
                    textColumns=c("offensedescription", "offensenarrative"),
                    parser=nGram(2, ignoreCase=TRUE, 
                                 punctuation="[-.,?\\\\!:;~()]+"))
                    
# compute term-document-matrix of all 2-word combinations of Dallas police crime reports
# for each type of offense status
tdm2 = computeTfIdf(channel=NULL, tableName="public.dallaspoliceall", docId="offensestatus", 
                    textColumns=c("offensedescription", "offensenarrative", "offenseweather"),
                    parser=token(2), 
                    where="offensestatus NOT IN ('System.Xml.XmlElement', 'C')")
                    
# include only top 100 ranked 2-word ngrams for each 4-digit zip into resulting 
# term-document-matrix using rank function  
tdm3 = computeTfIdf(channel=NULL, tableName="public.dallaspoliceall", 
                    docId="substr(offensezip, 1, 4)", 
                    textColumns=c("offensedescription", "offensenarrative"),
                    parser=nGram(2), top=100)
                    
# same but get top 10\% ranked terms using percent rank function
tdm4 = computeTfIdf(channel=NULL, tableName="public.dallaspoliceall", 
                    docId="substr(offensezip, 1, 4)", 
                    textColumns=c("offensedescription", "offensenarrative"),
                    parser=nGram(1), top=0.10, rankFunction="percentrank")

}
}
\seealso{
\code{computeTf}, \code{\link{nGram}}, \code{\link{token}}
}

