Data Storage (storage
)¶
Orange.data.storage.Storage
is an abstract class representing a data object
in which rows represent data instances (examples, in machine learning
terminology) and columns represent variables (features, attributes, classes,
targets, meta attributes).
Data is divided into three parts that represent independent variables (X),
dependent variables (Y) and meta data (metas). If practical, the class
should expose those parts as properties. In the associated domain
(Orange.data.Domain
), the three parts correspond to lists of variable
descriptors attributes, class_vars and metas.
Any of those parts may be missing, dense, sparse or sparse boolean. The difference between the later two is that the sparse data can be seen as a list of pairs (variable, value), while in the latter the variable (item) is present or absent, like in market basket analysis. The actual storage of sparse data depends upon the storage type.
There is no uniform constructor signature: every derived class provides one or more specific constructors.
There are currently two derived classes Orange.data.Table
and
Orange.data.sql.Table
, the former storing the data in-memory, in numpy
objects, and the latter in SQL (currently, only PostreSQL is supported).
Derived classes must implement at least the methods for getting rows and the
number of instances (__getitem__ and __len__). To make storage fast enough
to be practically useful, it must also reimplement a number of filters,
preprocessors and aggregators. For instance, method
_filter_values(self, filter) returns a new storage which only contains the
rows that match the criteria given in the filter. Orange.data.Table
implements an efficient method based on numpy indexing, and
Orange.data.sql.Table
, which "stores" a table as an SQL query, converts
the filter into a WHERE clause.
- Orange.data.storage.domain(:obj:`Orange.data.Domain`)¶
The domain describing the columns of the data
Data access¶
- Orange.data.storage.__getitem__(self, index)¶
Return one or more rows of the data.
If the index is an int, e.g. data[7]; the corresponding row is returned as an instance of
Instance
. Concrete implementations of Storage use specific derived classes for instances.If the index is a slice or a sequence of ints (e.g. data[7:10] or data[[7, 42, 15]], indexing returns a new storage with the selected rows.
If there are two indices, where the first is an int (a row number) and the second can be interpreted as columns, e.g. data[3, 5] or data[3, 'gender'] or data[3, y] (where y is an instance of
Variable
), a single value is returned as an instance ofValue
.In all other cases, the first index should be a row index, a slice or a sequence, and the second index, which represent a set of columns, should be an int, a slice, a sequence or a numpy array. The result is a new storage with a new domain.
- .__len__(self)¶
Return the number of data instances (rows)
Inspection¶
- Storage.X_density, Storage.Y_density, Storage.metas_density
Indicates whether the attributes, classes and meta attributes are dense (Storage.DENSE) or sparse (Storage.SPARSE). If they are sparse and all values are 0 or 1, it is marked as (Storage.SPARSE_BOOL). The Storage class provides a default DENSE. If the data has no attibutes, classes or meta attributes, the corresponding method should re
Filters¶
Storage should define the following methods to optimize the filtering
operations as allowed by the underlying data structure.
Orange.data.Table
executes them directly through numpy (or bottleneck
or related) methods, while Orange.data.sql.Table
appends them to the
WHERE clause of the query that defines the data.
These methods should not be called directly but through the classes defined in
Orange.data.filter
. Methods in Orange.data.filter
also provide
the slower fallback functions for the functions not defined in the storage.
- Orange.data.storage._filter_is_defined(self, columns=None, negate=False)¶
Extract rows without undefined values.
- Orange.data.storage._filter_has_class(self, negate=False)¶
Return rows with known value of the target attribute. If there are multiple classes, all must be defined.
- Orange.data.storage._filter_same_value(self, column, value, negate=False)¶
Select rows based on a value of the given variable.
Aggregators¶
Similarly to filters, storage classes should provide several methods for fast
computation of statistics. These methods are not called directly but by modules
within Orange.statistics
.
- _compute_basic_stats(
- self, columns=None, include_metas=False, compute_variance=False)
Compute basic statistics for the specified variables: minimal and maximal value, the mean and a varianca (or a zero placeholder), the number of missing and defined values.
- Parameters
columns (list of ints, variable names or descriptors of type
Orange.data.Variable
) -- a list of columns for which the statistics is computed; if None, the function computes the data for all variablesinclude_metas (bool) -- a flag which tells whether to include meta attributes (applicable only if columns is None)
compute_variance (bool) -- a flag which tells whether to compute the variance
- Returns
a list with tuple (min, max, mean, variance, #nans, #non-nans) for each variable
- Return type
- Orange.data.storage._compute_distributions(self, columns=None)¶
Compute the distribution for the specified variables. The result is a list of pairs containing the distribution and the number of rows for which the variable value was missing.
For discrete variables, the distribution is represented as a vector with absolute frequency of each value. For continuous variables, the result is a 2-d array of shape (2, number-of-distinct-values); the first row contains (distinct) values of the variables and the second has their absolute frequencies.
- Parameters
columns (list of ints, variable names or descriptors of type
Orange.data.Variable
) -- a list of columns for which the distributions are computed; if None, the function runs over all variables- Returns
a list of distributions
- Return type
list of numpy arrays
- Storage._compute_contingency(col_vars=None, row_var=None)[source]¶
Compute contingency matrices for one or more discrete or continuous variables against the specified discrete variable.
The resulting list contains a pair for each column variable. The first element contains the contingencies and the second elements gives the distribution of the row variables for instances in which the value of the column variable is missing.
The format of contingencies returned depends on the variable type:
for discrete variables, it is a numpy array, where element (i, j) contains count of rows with i-th value of the row variable and j-th value of the column variable.
for continuous variables, contingency is a list of two arrays, where the first array contains ordered distinct values of the column_variable and the element (i,j) of the second array contains count of rows with i-th value of the row variable and j-th value of the ordered column variable.
- Parameters
col_vars (list of ints, variable names or descriptors of type
Orange.data.Variable
) -- variables whose values will correspond to columns of contingency matricesrow_var (int, variable name or
Orange.data.DiscreteVariable
) -- a discrete variable whose values will correspond to the rows of contingency matrices