OutlierDetection.jl

February 8, 2025 · View on GitHub

Chat Documentation (dev) Build Status Coverage All Contributors

OutlierDetection.jl is a Julia toolkit for detecting outlying objects, also known as anomalies. This package is an effort to make Julia a first-class citizen in the Outlier- and Anomaly-Detection community. Why should you use this package?

  • Provides a unified API for outlier detection in Julia
  • Provides access to state-of-the-art outlier detection algorithms
  • Seamlessly integrates with Julia's existing machine learning ecosystem

Citing

If you use OutlierDetection.jl in a scientific publication, we appreciate citations to:

@article{muhr2022outlierdetection,
  title={OutlierDetection.jl: A modular outlier detection ecosystem for the Julia programming language},
  author={Muhr, David and Affenzeller, Michael and Blaom, Anthony D},
  journal={arXiv preprint arXiv:2211.04550},
  year={2022}
}

or

Muhr, David, Michael Affenzeller, and Anthony D. Blaom. "OutlierDetection.jl: A modular outlier detection ecosystem for the Julia programming language." arXiv preprint arXiv:2211.04550 (2022).

Installation

It is recommended to use Pkg.jl for installation. Follow the command below to install the latest official release or use ] add OutlierDetection in the Julia REPL.

import Pkg
Pkg.add("OutlierDetection")

If you would like to modify the package locally, you can use Pkg.develop("OutlierDetection") or ] dev OutlierDetection in the Julia REPL. This fetches a full clone of the package to ~/.julia/dev/ (the path can be changed by setting the environment variable JULIA_PKG_DEVDIR).

Usage

OutlierDetection.jl is built on top of MLJ and provides many Detector implementations for MLJ. A Detector simply assigns a real-valued score to each sample, which is defined to be increasing with increasing outlierness. The detectors live in sub-packages of OutlierDetectionJL, e.g. OutlierDetectionNeighbors,and can be loaded directly with MLJ, as shown below.

using MLJ
using OutlierDetection
using OutlierDetectionData: ODDS

# download and open the thyroid benchmark dataset
X, y = ODDS.load("thyroid")

# use 50% of the data for training
train, test = partition(eachindex(y), 0.5, shuffle=true)

# load the detector
KNN = @iload KNNDetector pkg=OutlierDetectionNeighbors

# instantiate a detector with default parameters, returning scores
knn = KNN()

# bind the detector to data and learn a model with all data
knn_raw = machine(knn, X) |> fit!

# transform data to raw outlier scores based on the test data; note that there
# is no `predict` defined for raw detectors
transform(knn_raw, rows=test)

# OutlierDetection.jl provides helper functions to normalize the scores,
# for example using min-max scaling based on the training scores
knn_probas = machine(ProbabilisticDetector(knn), X) |> fit!

# predict outlier probabilities based on the test data
predict(knn_probas, rows=test)

# OutlierDetection.jl also provides helper functions to turn scores into classes,
# for example by imposing a threshold based on the training data percentiles
knn_classifier = machine(DeterministicDetector(knn), X) |> fit!

# predict outlier classes based on the test data
predict(knn_classifier, rows=test)

It is also possible to use OutlierDetection.jl without MLJ, however, note that more explicit steps are necessary.

using OutlierDetection: fit, transform, scale_minmax, classify_quantile, outlier_fraction
using OutlierDetectionNeighbors: KNNDetector # explicitly import detector
using OutlierDetectionData: ODDS

X, y = ODDS.load("thyroid")
knn = KNNDetector()

# explicit conversion to a native array is necessary
# note that we are using the transposed data, because column-major data is expected
Xmatrix = Matrix(X)'

# explicit fit result and training scores
model, scores_train = fit(knn, Xmatrix[:, 11:end]; verbosity = 0)

# transform the first 10 points to scores (not used for training)
scores_test = transform(knn, model, Xmatrix[:, 1:10])

# explicitly normalize train and test scores
proba_train, proba_test = scale_minmax((scores_train, scores_test))

# explicitly convert scores to labels (> 95th percentile would be an outlier)
labels_train, labels_test = classify_quantile(0.95)((scores_train, scores_test))

Algorithms (also known as Detectors)

Algorithms marked with '✓' are implemented in Julia. Algorithms marked with '✓ (py)' are implemented in Python (thanks to the wonderful PyOD library) with an existing Julia interface through PyCall. If you would like to know more, open the detector reference.

NameDescriptionYearStatusAuthors
CDCook's distance1977✓ (py)Dennis R. Cook
LMDDLinear deviation-based outlier detection1996✓ (py)Arning et al.
KNNDistance-based outliers1997Knorr and Ng
MCDMinimum covariance determinant1999✓ (py)Rousseeuw and Driessen
KNNDistance to the k-th nearest neighbor2000Ramaswamy
LOFLocal outlier factor2000Breunig et al.
OCSVMOne-Class support vector machine2001✓ (py)Schölkopf et al.
KNNSum of distances to the k-nearest neighbors2002Angiulli and Pizzuti
COFConnectivity-based outlier factor2002Tang et al.
LOCILocal correlation integral2003✓ (py)Papadimitirou et al.
CBLOFCluster-based local outliers2003✓ (py)He et al.
PCAPrincipal component analysis2003✓ (py)Shyu et al.
KDEKernel Density Estimation2007✓ (py)Latecki et al.
IForestIsolation forest2008✓ (py)Liu et al.
ABODAngle-based outlier detection2009Kriegel et al.
SODSubspace outlier detection2009✓ (py)Kriegel et al.
HBOSHistogram-based outlier score2012✓ (py)Goldstein and Dengel
SOSStochastic outlier selection2012✓ (py)Janssens et al.
AEAuto-encoder reconstruction loss outliers2015Aggarwal
ABODStable angle-based outlier detection2015Li et al.
GMMGaussian Mixture Model2015✓ (py)Aggarwal and Sathe
LODALightweight on-line detector of anomalies2016✓ (py)Pevný
INNEIsolation-based nearest neighbors2018✓ (py)Bandaragoda et al.
DeepSADDeep semi-supervised anomaly detection2019Ruff et al.
COPODCopula-based outlier detection2020✓ (py)Li et al.
RODRotation-based outlier detection2020✓ (py)Almardeny et al.
ESADEnd-to-end semi-supervised anomaly detection2020Huang et al.
ECODEmpirical Cumulative Distribution Functions2022✓ (py)Li et al.

If there are already so many algorithms available in Python - why Julia, you might ask? Let's have some fun!

using OutlierDetection, MLJ
using BenchmarkTools: @benchmark
X = rand(10, 100000)
LOF =  @iload LOFDetector pkg=OutlierDetectionNeighbors
PyLOF =  @iload LOFDetector pkg=OutlierDetectionPython
lof = machine(LOF(k=5, algorithm=:kdtree, leafsize=30, parallel=true), X) |> fit!
pylof = machine(PyLOF(n_neighbors=5, algorithm="kd_tree", leaf_size=30, n_jobs=-1), X) |> fit!

Julia enables you to implement your favorite algorithm in no time, and it will be fast, blazingly fast.

@benchmark transform(lof, X)
> median time:      341.464 ms (0.00% GC)

Interoperating with Python is easy!

@benchmark transform(pylof, X)
> median time:      7.934 s (0.00% GC)

Contributing

OutlierDetection.jl is a community effort and your help is extremely welcome! See our contribution guide for more information how to contribute to the project.

Contributors ✨

Thanks go to these wonderful people (emoji key):

David Muhr
David Muhr

💻 ⚠️ 📖 🚧
Páll Haraldsson
Páll Haraldsson

📖
Anthony Blaom, PhD
Anthony Blaom, PhD

💻
Pietro Monticone
Pietro Monticone

📖
Petr Mukhachev
Petr Mukhachev

📖
Tyler Thomas
Tyler Thomas

💻

This project follows the all-contributors specification. Contributions of any kind welcome!