.. _library_isolation_forest:

``isolation_forest``
====================

This library implements the Extended Isolation Forest (EIF) algorithm
for anomaly detection as described by Hariri et al. (2019). The Extended
Isolation Forest improves upon the original Isolation Forest algorithm
(Liu et al., 2008) by using random hyperplane cuts instead of
axis-aligned cuts, eliminating bias artifacts in anomaly scores along
coordinate axes.

The algorithm builds an ensemble of isolation trees (iTrees) by
recursively partitioning data using random hyperplanes. Anomalous
points, being few and different from normal points, require fewer
partitions (shorter path lengths) to be isolated. The anomaly score for
an instance is computed based on the average path length across all
trees in the forest.

Datasets are represented as objects implementing the
``dataset_protocol`` protocol from the ``classifier_protocols`` library.
See the ``test_datasets`` directory for examples.

API documentation
-----------------

Open the
`../../apis/library_index.html#isolation-forest <../../apis/library_index.html#isolation-forest>`__
link in a web browser.

Loading
-------

To load all entities in this library, load the ``loader.lgt`` file:

::

   | ?- logtalk_load(isolation_forest(loader)).

Testing
-------

To test this library predicates, load the ``tester.lgt`` file:

::

   | ?- logtalk_load(isolation_forest(tester)).

Implemented features
--------------------

- Extended Isolation Forest with random hyperplane cuts: splits are
  defined by random normal vectors and intercept points drawn from the
  data range, producing ``(x - p) * n =< 0`` partitions that generalize
  to arbitrary orientations
- Configurable extension level: level 0 corresponds to the original
  axis-aligned Isolation Forest; levels up to ``d - 1`` (the default)
  use fully extended random hyperplanes where ``d`` is the number of
  dimensions
- Anomaly score computation following Liu et al. (2008):
  ``s(x) = 2^(-E(h(x)) / c(psi))`` where ``E(h(x))`` is the average path
  length across all trees, ``c(psi)`` is the average path length of
  unsuccessful searches in a BST, and ``psi`` is the subsample size
- Handling of both continuous (numeric) and discrete (categorical)
  attributes: discrete attributes are mapped to numeric indices based on
  their position in the attribute value list declared by the dataset
- Handling of missing attribute values (represented using anonymous
  variables): during tree construction, missing values are replaced with
  random values drawn from the observed range of the corresponding
  attribute; during scoring, missing dimensions are excluded from the
  hyperplane dot product computation so that routing decisions at each
  tree node are based entirely on the known attribute values
- Configurable parameters via options:

  - ``number_of_trees/1`` (default: ``100``): number of isolation trees
  - ``subsample_size/1`` (default: ``256`` or number of instances if
    smaller): subsample size for each tree
  - ``extension_level/1`` (default: ``d - 1``): controls the
    dimensionality of the random hyperplane normal vectors
  - ``anomaly_threshold/1`` (default: ``0.5``): threshold for anomaly
    prediction

- Scoring all dataset instances with results sorted by descending
  anomaly score for easy identification of top anomalies
- Pretty-printing of learned models with tree depth and node count
  summaries

Limitations
-----------

- No incremental learning (the forest must be rebuilt from scratch when
  new examples are added)
- No streaming or online variant

References
----------

- Liu, F.T., Ting, K.M. and Zhou, Z.-H. (2008). Isolation Forest.
  *Proceedings of the 2008 Eighth IEEE International Conference on Data
  Mining*, 413-422. https://doi.org/10.1109/ICDM.2008.17

- Hariri, S., Kind, M.C. and Brunner, R.J. (2019). Extended Isolation
  Forest. *IEEE Transactions on Knowledge and Data Engineering*, 33(4),
  1479-1489. https://doi.org/10.1109/TKDE.2019.2947676

Usage
-----

To learn an isolation forest model from a dataset with default options:

::

   | ?- isolation_forest::learn(gaussian_anomalies, Model).

To learn with custom options:

::

   | ?- isolation_forest::learn(gaussian_anomalies, Model, [
            number_of_trees(200),
            subsample_size(128),
            extension_level(1),
            anomaly_threshold(0.6)
        ]).

To compute the anomaly score for a new instance:

::

   | ?- isolation_forest::learn(gaussian_anomalies, Model),
        isolation_forest::score(Model, [x-0.12, y-0.34], Score).

To predict whether an instance is an anomaly or normal:

::

   | ?- isolation_forest::learn(gaussian_anomalies, Model),
        isolation_forest::predict(Model, [x-4.50, y-4.20], Prediction).

To compute and rank anomaly scores for all instances in a dataset:

::

   | ?- isolation_forest::learn(gaussian_anomalies, Model),
        isolation_forest::score_all(gaussian_anomalies, Model, Scores).

The ``Scores`` list contains ``Id-Class-Score`` triples sorted by
descending anomaly score. This makes it easy to inspect top anomalies:

::

   | ?- isolation_forest::learn(gaussian_anomalies, Model),
        isolation_forest::score_all(gaussian_anomalies, Model, [Top1, Top2, Top3| _]).

To print a summary of the learned model:

::

   | ?- isolation_forest::learn(gaussian_anomalies, Model),
        isolation_forest::print_model(Model).

To use the original (non-extended) Isolation Forest, set the extension
level to 0:

::

   | ?- isolation_forest::learn(gaussian_anomalies, Model, [extension_level(0)]).
