Rank ==== .. figure:: icons/rank.png Ranking of attributes in classification or regression data sets. Signals ------- **Inputs**: - **Data** An input data set. **Outputs**: - **Reduced Data** A data set whith selected attributes. Description ----------- The **Rank** widget considers class-labeled data sets (classification or regression) and scores the attributes according to their correlation with the class. .. figure:: images/Rank-stamped.png 1. Select attributes from the data table. 2. Data table with attributes (rows) and their scores by different scoring methods (columns) 3. Produce a report. 4. If '*Send Automatically*' is ticked, the widget automatically communicates changes to other widgets. Scoring methods --------------- 1. Information Gain: the expected amount of information (reduction of entropy) 2. `Gain Ratio `_: a ratio of the information gain and the attribute's intrinsic information, which reduces the bias towards multivalued features that occurs in information gain 3. `Gini `_: the inequality among values of a frequency distribution 4. `ANOVA `_: the difference between average vaules of the feature in different classes 5. `Chi2 `_: dependence between the feature and the class as measure by the chi-square statistice 6. `ReliefF `_: the ability of an attribute to distinguish between classes on similar data instances 7. `FCBF (Fast Correlation Based Filter) `_: entropy-based measure, which also identifies redundancy due to pairwise correlations between features Example: Attribute Ranking and Selection ---------------------------------------- Below, we have used the **Rank** widget immediately after the :doc:`File<../data/file>` widget to reduce the set of data attributes and include only the most informative ones: .. figure:: images/Rank-Select-Schema.png Notice how the widget outputs a data set that includes only the best-scored attributes: .. figure:: images/Rank-Select-Widgets.png Example: Feature Subset Selection for Machine Learning ------------------------------------------------------ What follows is a bit more complicated example. In the workflow below, we first split the data into a training set and a test set. In the upper branch, the training data passes through the **Rank** widget to select the most informative attributes, while in the lower branch there is no feature selection. Both feature selected and original data sets are passed to their own :doc:`Test & Score <../evaluation/testlearners>` widgets, which develop a *Naive Bayes* classifier and score it on a test set. .. figure:: images/Rank-and-Test.png For data sets with many features, a naive Bayesian classifier feature selection, as shown above, would often yield a better predictive accuracy.