Outliers ======== Outlier detection widget. **Inputs** - Data: input dataset **Outputs** - Outliers: instances scored as outliers - Inliers: instances not scored as outliers - Data: input dataset appended *Outlier* variable The **Outliers** widget applies one of the four methods for outlier detection. All methods apply classification to the dataset. *One-class SVM with non-linear kernels (RBF)* performs well with non-Gaussian distributions, while *Covariance estimator* works only for data with Gaussian distribution. One efficient way to perform outlier detection on moderately high dimensional datasets is to use the *Local Outlier Factor* algorithm. The algorithm computes a score reflecting the degree of abnormality of the observations. It measures the local density deviation of a given data point with respect to its neighbors. Another efficient way of performing outlier detection in high-dimensional datasets is to use random forests (*Isolation Forest*). ![](images/Outliers-stamped.png) 1. Method for outlier detection: - [One Class SVM](http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html) - [Covariance Estimator](http://scikit-learn.org/stable/modules/generated/sklearn.covariance.EllipticEnvelope.html) - [Local Outlier Factor](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html) - [Isolation Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html) 2. Set parameters for the method: - **One class SVM with non-linear kernel (RBF)**: classifies data as similar or different from the core class: - *Nu* is a parameter for the upper bound on the fraction of training errors and a lower bound of the fraction of support vectors - *Kernel coefficient* is a gamma parameter, which specifies how much influence a single data instance has - **Covariance estimator**: fits ellipsis to central points with Mahalanobis distance metric: - *Contamination* is the proportion of outliers in the dataset - *Support fraction* specifies the proportion of points included in the estimate - **Local Outlier Factor**: obtains local density from the k-nearest neighbors: - *Contamination* is the proportion of outliers in the dataset - *Neighbors* represents number of neighbors - *Metric* is the distance measure - **Isolation Forest**: isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature: - *Contamination* is the proportion of outliers in the dataset - *Replicabe training* fixes random seed 3. If *Apply automatically* is ticked, changes will be propagated automatically. Alternatively, click *Apply*. 4. Produce a report. 5. Number of instances on the input, followed by number of instances scored as inliers. Example ------- Below is an example of how to use this widget. We used subset (*versicolor* and *virginica* instances) of the *Iris* dataset to detect the outliers. We chose the *Local Outlier Factor* method, with *Euclidean* distance. Then we observed the annotated instances in the [Scatter Plot](../visualize/scatterplot.md) widget. In the next step we used the *setosa* instances to demonstrate novelty detection using [Apply Domain](../data/applydomain.md) widget. After concatenating both outputs we examined the outliers in the *Scatter Plot (1)*. ![](images/Outliers-Example.png)