Data Representation in KNIME

The DataTable Concept

KNIME uses DataCells to represent data (each DataCell holds one single entity, for instance a floating point value). An array of DataCells makes up a DataRow. There are a couple of default implementations of DataCell that hold specific types of data such as StringCell, IntCell, and DoubleCell. The entire data that is passed along nodes in the workflow is exposed in a DataTable, containing a (generally unknown) number of (equal-length) rows. All elements in a column must be compatible to the type that is assigned to the column, e.g. they must all be numeric or all must be a strings. The following figure sketches a DataTable.


The meta information to a DataTable (number of columns, column specific information) is available through the DataTableSpec (accessible via DataTable.getDataTableSpec) which consists of a set of DataColumnSpecs comprising a specific name (a string), type (a DataType, see below), and some (optional) properties such as domain information (minimum, maximum, possible values), a color handler, and so on for each of the columns. For further details on these specific classes see their class description and the FAQ on how to use them.

A BufferedDataTable is a special implementation of a DataTable, which is used as data structure to pass data from one node in the workflow to another. This has some major advantages over using the generic DataTable interface:

New Types in KNIME

KNIME allows to define customized types of data, e.g. a DataCell that carries a representation of a molecule. Specialized data types bring along their own renderer (e.g. to display the molecular structure), a customized icon (that is displayed, for instance in a column header to recognize the column's type), and a comparator. In order to implement a new type in KNIME, generally you have to create two different classes/interfaces:

  1. An interface derived from DataValue defining access method(s) to the generic objects. This will be the base interface to the new type and will also expose the meta information (renderer and such). Since Java does not allow to specify static members nor methods in interfaces (or any other class definition), KNIME will access a member in that interface with the following signature using reflection:
      public static final UtilityFactory UTILITY
      
    The class UtilityFactory has methods to retrieve specific information to this type implementation. If no such a member is provided, the reflection mechanism will use the information from the super interface, i.e. DataValue (though, no customized information is available in this case). It is highly recommended to define the UtilityFactory as an inner class of the interface. The new interface should be similar to:
      public interface MyDataValue extends DataValue {
        
        /** Derived locally. */
        public static final UtilityFactory UTILITY = new UtilityFactory() {
           ...
        };
        
        /** The interface methods. */
        MyValue getMyValue();
     }
     
  2. A class derived from DataCell and implementing the interface defined in 1. and any DataValue interface to which your new type should be compatible to (for instance our new molecule type should also be able to return a simple string representation of the molecule – it needs therefore to implement StringValue). The first time the new DataCell is instantiated, KNIME will parse the list of implemented DataValue interfaces and make the list of compatible DataValues available through the DataCell's DataType. You associate your DataCell implementation with your newly created DataType by returning it in the MyNewCell#getType(). This is determined at runtime, the first time the method DataType.getType(MyNewCell.class) is invoked – either directly or implicitly. There are two important issues when defining a new DataCell:

The skeleton for the new DataCell will look like:

public class MyCell extends DataCell implements MyValue, StringValue {

    public static final Class<? extends DataValue> getPreferredValueClass() {
        return MyValue.class;
    }
    
    public static final DataCellSerializer<MyCell> getCellSerializer() {
        return new DataCellSerializer() {
            public void serialize(MyCell cell, DataOutput output) throws IOException {
                ...
            }
            public MyCell deserialize(DataInput input) throws IOException {
                ...
                return new MyCell(...);
            }
        };
    }
    ...
}

Collection of DataTypes

Documentation is pending... For a quick overview refer to the implementations of CollectionDataValue and ListCell.