.. currentmodule:: Orange.data.storage
##########################
Data Storage (``storage``)
##########################
:obj:`Orange.data.storage.Storage` is an abstract class representing a data object
in which rows represent data instances (examples, in machine learning
terminology) and columns represent variables (features, attributes, classes,
targets, meta attributes).
Data is divided into three parts that represent independent variables (`X`),
dependent variables (`Y`) and meta data (`metas`). If practical, the class
should expose those parts as properties. In the associated domain
(:obj:`Orange.data.Domain`), the three parts correspond to lists of variable
descriptors `attributes`, `class_vars` and `metas`.
Any of those parts may be missing, dense, sparse or sparse boolean. The
difference between the later two is that the sparse data can be seen as a list
of pairs (variable, value), while in the latter the variable (item) is present
or absent, like in market basket analysis. The actual storage of sparse data
depends upon the storage type.
There is no uniform constructor signature: every derived class provides one or
more specific constructors.
There are currently two derived classes :obj:`Orange.data.Table` and
:obj:`Orange.data.sql.Table`, the former storing the data in-memory, in numpy
objects, and the latter in SQL (currently, only PostreSQL is supported).
Derived classes must implement at least the methods for getting rows and the
number of instances (`__getitem__` and `__len__`). To make storage fast enough
to be practically useful, it must also reimplement a number of filters,
preprocessors and aggregators. For instance, method
`_filter_values(self, filter)` returns a new storage which only contains the
rows that match the criteria given in the filter. :obj:`Orange.data.Table`
implements an efficient method based on numpy indexing, and
:obj:`Orange.data.sql.Table`, which "stores" a table as an SQL query, converts
the filter into a WHERE clause.
.. attribute:: domain (:obj:`Orange.data.Domain`)
The domain describing the columns of the data
Data access
-----------
.. method:: __getitem__(self, index)
Return one or more rows of the data.
- If the index is an int, e.g. `data[7]`; the corresponding row is
returned as an instance of :obj:`~Orange.data.instance.Instance`. Concrete
implementations of `Storage` use specific derived classes for instances.
- If the index is a slice or a sequence of ints (e.g. `data[7:10]` or
`data[[7, 42, 15]]`, indexing returns a new storage with the selected
rows.
- If there are two indices, where the first is an int (a row number) and
the second can be interpreted as columns, e.g. `data[3, 5]` or
`data[3, 'gender']` or `data[3, y]` (where `y` is an instance of
:obj:`~Orange.data.Variable`), a single value is returned as an instance
of :obj:`~Orange.data.Value`.
- In all other cases, the first index should be a row index, a slice or
a sequence, and the second index, which represent a set of columns,
should be an int, a slice, a sequence or a numpy array. The result is
a new storage with a new domain.
.. method:: .__len__(self)
Return the number of data instances (rows)
Inspection
----------
.. method:: Storage.X_density, Storage.Y_density, Storage.metas_density
Indicates whether the attributes, classes and meta attributes are dense
(`Storage.DENSE`) or sparse (`Storage.SPARSE`). If they are sparse and
all values are 0 or 1, it is marked as (`Storage.SPARSE_BOOL`). The
Storage class provides a default DENSE. If the data has no attibutes,
classes or meta attributes, the corresponding method should re
Filters
-------
Storage should define the following methods to optimize the filtering
operations as allowed by the underlying data structure.
:obj:`Orange.data.Table` executes them directly through numpy (or bottleneck
or related) methods, while :obj:`Orange.data.sql.Table` appends them to the
WHERE clause of the query that defines the data.
These methods should not be called directly but through the classes defined in
:obj:`Orange.data.filter`. Methods in :obj:`Orange.data.filter` also provide
the slower fallback functions for the functions not defined in the storage.
.. method:: _filter_is_defined(self, columns=None, negate=False)
Extract rows without undefined values.
:param columns: optional list of columns that are checked for unknowns
:type columns: sequence of ints, variable names or descriptors
:param negate: invert the selection
:type negate: bool
:return: a new storage of the same type or :obj:`~Orange.data.Table`
:rtype: Orange.data.storage.Storage
.. method:: _filter_has_class(self, negate=False)
Return rows with known value of the target attribute. If there are multiple
classes, all must be defined.
:param negate: invert the selection
:type negate: bool
:return: a new storage of the same type or :obj:`~Orange.data.Table`
:rtype: Orange.data.storage.Storage
.. method:: _filter_same_value(self, column, value, negate=False)
Select rows based on a value of the given variable.
:param column: the column that is checked
:type column: int, str or Orange.data.Variable
:param value: the value of the variable
:type value: int, float or str
:param negate: invert the selection
:type negate: bool
:return: a new storage of the same type or :obj:`~Orange.data.Table`
:rtype: Orange.data.storage.Storage
.. method:: _filter_values(self, filter)
Apply a the given filter to the data.
:param filter: A filter for selecting the rows
:type filter: Orange.data.Filter
:return: a new storage of the same type or :obj:`~Orange.data.Table`
:rtype: Orange.data.storage.Storage
Aggregators
-----------
Similarly to filters, storage classes should provide several methods for fast
computation of statistics. These methods are not called directly but by modules
within :obj:`Orange.statistics`.
.. method:: _compute_basic_stats(
self, columns=None, include_metas=False, compute_variance=False)
Compute basic statistics for the specified variables: minimal and maximal
value, the mean and a varianca (or a zero placeholder), the number
of missing and defined values.
:param columns: a list of columns for which the statistics is computed;
if `None`, the function computes the data for all variables
:type columns: list of ints, variable names or descriptors of type
:obj:`Orange.data.Variable`
:param include_metas: a flag which tells whether to include meta attributes
(applicable only if `columns` is `None`)
:type include_metas: bool
:param compute_variance: a flag which tells whether to compute the variance
:type compute_variance: bool
:return: a list with tuple (min, max, mean, variance, #nans, #non-nans)
for each variable
:rtype: list
.. method:: _compute_distributions(self, columns=None)
Compute the distribution for the specified variables. The result is a list
of pairs containing the distribution and the number of rows for which the
variable value was missing.
For discrete variables, the distribution is represented as a vector with
absolute frequency of each value. For continuous variables, the result is
a 2-d array of shape (2, number-of-distinct-values); the first row contains
(distinct) values of the variables and the second has their absolute
frequencies.
:param columns: a list of columns for which the distributions are computed;
if `None`, the function runs over all variables
:type columns: list of ints, variable names or descriptors of type
:obj:`Orange.data.Variable`
:return: a list of distributions
:rtype: list of numpy arrays
.. automethod:: Orange.data.storage.Storage._compute_contingency