Loading and saving data (io)

Orange.data.Table supports loading from several file formats:

  • Comma-separated values (*.csv) file,
  • Tab-separated values (*.tab, *.tsv) file,
  • Excel spreadsheet (*.xls, *.xlsx),
  • Python pickle.

In addition, the text-based files (CSV, TSV) can be compressed with gzip, bzip2 or xz (e.g. *.csv.gz).

Header Format

The data in CSV, TSV, and Excel files can be described in an extended three-line header format, or a condensed single-line header format.

Three-line header format

A three-line header consists of:

  1. Feature names on the first line. Feature names can include any combination of characters.
  2. Feature types on the second line. The type is determined automatically, or, if set, can be any of the following:
    • discrete (or d) — imported as Orange.data.DiscreteVariable,
    • a space-separated list of discrete values, like “male female”, which will result in Orange.data.DiscreteVariable with those values and in that order. If the individual values contain a space character, it needs to be escaped (prefixed) with, as common, a backslash (’') character.
    • continuous (or c) — imported as Orange.data.ContinuousVariable,
    • string (or s, or text) — imported as Orange.data.StringVariable,
    • time (or t) — imported as Orange.data.TimeVariable, if the values parse as ISO 8601 date/time formats,
  3. Flags (optional) on the third header line. Feature’s flag can be empty, or it can contain, space-separated, a consistent combination of:
    • class (or c) — feature will be imported as a class variable. Most algorithms expect a single class variable.
    • meta (or m) — feature will be imported as a meta-attribute, just describing the data instance but not actually used for learning,
    • weight (or w) — the feature marks the weight of examples (in algorithms that support weighted examples),
    • ignore (or i) — feature will not be imported,
    • <key>=<value> are custom attributes recognized in specific contexts, for instance color, which defines the color palette when the variable is visualized, or type=image which signals that the variable contains a path to an image.

Example of iris dataset in Orange’s three-line format (iris.tab).

sepal length	sepal width	petal length	petal width	iris
c	c	c	c	d
5.1	3.5	1.4	0.2	Iris-setosa
4.9	3.0	1.4	0.2	Iris-setosa
4.7	3.2	1.3	0.2	Iris-setosa
4.6	3.1	1.5	0.2	Iris-setosa

Single-line header format

Single-line header consists of feature names prefixed by an optional “<flags>#” string, i.e. flags followed by a hash (‘#’) sign. The flags can be a consistent combination of:

  • c for class feature (also known as a target variable or dependent variable),
  • i for feature to be ignored,
  • m for meta attributes (not used in learning),
  • C for features that are continuous (numeric),
  • D for features that are discrete (categorical),
  • T for features that represent date and/or time in one of the ISO 8601 formats,
  • S for string features.

If some (all) names or flags are omitted, the names, types, and flags are discerned automatically, and correctly (most of the time).