Outliers
========

Outlier detection widget.

**Inputs**

- Data: input dataset

**Outputs**

- Outliers: instances scored as outliers
- Inliers: instances not scored as outliers
- Data: input dataset appended *Outlier* variable

The **Outliers** widget applies one of the four methods for outlier detection. All methods apply classification to the dataset. *One-class SVM with non-linear kernels (RBF)* performs well with non-Gaussian distributions, while *Covariance estimator* works only for data with Gaussian distribution. One efficient way to perform outlier detection on moderately high dimensional datasets is to use the *Local Outlier Factor* algorithm. The algorithm computes a score reflecting the degree of abnormality of the observations. It measures the local density deviation of a given data point with respect to its neighbors. Another efficient way of performing outlier detection in high-dimensional datasets is to use random forests (*Isolation Forest*).

![](images/Outliers-stamped.png)

1. Method for outlier detection:
   - [One Class SVM](http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html)
   - [Covariance Estimator](http://scikit-learn.org/stable/modules/generated/sklearn.covariance.EllipticEnvelope.html)
   - [Local Outlier Factor](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html)
   - [Isolation Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)
2. Set parameters for the method:
   - **One class SVM with non-linear kernel (RBF)**: classifies data as similar or different from the core class:
      - *Nu* is a parameter for the upper bound on the fraction of training errors and a lower bound of the fraction of support vectors
      - *Kernel coefficient* is a gamma parameter, which specifies how much influence a single data instance has
    - **Covariance estimator**: fits ellipsis to central points with Mahalanobis distance metric:
      - *Contamination* is the proportion of outliers in the dataset
      - *Support fraction* specifies the proportion of points included in the estimate
   - **Local Outlier Factor**: obtains local density from the k-nearest neighbors:
      - *Contamination* is the proportion of outliers in the dataset
      - *Neighbors* represents number of neighbors
      - *Metric* is the distance measure
   - **Isolation Forest**: isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature:
     - *Contamination* is the proportion of outliers in the dataset
     - *Replicabe training* fixes random seed
3. If *Apply automatically* is ticked, changes will be propagated automatically. Alternatively, click *Apply*.
4. Produce a report.
5. Number of instances on the input, followed by number of instances scored as inliers.


Example
-------

Below is an example of how to use this widget. We used subset (*versicolor* and *virginica* instances) of the *Iris* dataset to detect the outliers. We chose the *Local Outlier Factor* method, with *Euclidean* distance. Then we observed the annotated instances in the [Scatter Plot](../visualize/scatterplot.md) widget. In the next step we used the *setosa* instances to demonstrate novelty detection using [Apply Domain](../data/applydomain.md) widget. After concatenating both outputs we examined the outliers in the *Scatter Plot (1)*.

![](images/Outliers-Example.png)