org.knime.core.data.container
Class DataContainer

java.lang.Object
  extended by org.knime.core.data.container.DataContainer
All Implemented Interfaces:
RowAppender
Direct Known Subclasses:
BufferedDataContainer

public class DataContainer
extends Object
implements RowAppender

Buffer that collects DataRow objects and creates a DataTable on request. This data structure is useful if the number of rows is not known in advance.

Usage: Create a container with a given spec (matching the rows being added later on, add the data using the addRowToTable(DataRow) method and finally close it with close(). You can access the table by getTable().

Note regarding the column domain: This implementation updates the column domain while new rows are added to the table. It will keep the lower and upper bound for all columns that are numeric, i.e. whose column type is a sub type of DoubleCell.TYPE. For categorical columns, it will keep the list of possible values if the number of different values does not exceed 60. (If there are more, the values are forgotten and therefore not available in the final table.) A categorical column is a column whose type is a sub type of StringCell.TYPE, i.e. StringCell.TYPE.isSuperTypeOf(yourtype) where yourtype is the given column type.

Author:
Bernd Wiswedel, University of Konstanz

Nested Class Summary
(package private) static class DataContainer.BufferCreator
          Helper class to create a Buffer instance given a binary file and the data table spec.
 
Field Summary
(package private) static int ASYNC_CACHE_SIZE
          Size of buffers.
(package private) static String CFG_TABLESPEC
          Used in write/readFromZip: Config entry: The spec of the table.
static boolean DEF_GZIP_COMPRESSION
          Whether compression is enabled by default.
static int DEF_MAX_CELLS_IN_MEMORY
          The default number of cells to be held in memory.
static int MAX_CELLS_IN_MEMORY
          Number of cells that are cached without being written to the temp file (see Buffer implementation); It defaults to the value defined by DEF_MAX_CELLS_IN_MEMORY but can be changed using the java property PROPERTY_CELLS_IN_MEMORY.
static String PROPERTY_CELLS_IN_MEMORY
          Java property name to set a different threshold for the number of cells to be held in main memory.
(package private) static boolean SYNCHRONOUS_IO
          Whether to use synchronous IO while adding rows to a buffer or reading from an file iterator.
(package private) static String ZIP_ENTRY_SPEC
          Used in write/readFromZip: Name of the zip entry containing the spec.
 
Constructor Summary
DataContainer(DataTableSpec spec)
          Opens the container so that rows can be added by addRowToTable(DataRow).
DataContainer(DataTableSpec spec, boolean initDomain)
          Opens the container so that rows can be added by addRowToTable(DataRow).
DataContainer(DataTableSpec spec, boolean initDomain, int maxCellsInMemory)
          Opens the container so that rows can be added by addRowToTable(DataRow).
 
Method Summary
protected  void addRowKeyForDuplicateCheck(RowKey key)
          Method being called when addRowToTable(DataRow) is called.
 void addRowToTable(DataRow row)
          Appends a row to the end of a container.
static DataTable cache(DataTable table, ExecutionMonitor exec)
          Convenience method that will buffer the entire argument table.
static DataTable cache(DataTable table, ExecutionMonitor exec, int maxCellsInMemory)
          Convenience method that will buffer the entire argument table.
 void close()
          Closes container and creates table that can be accessed by getTable().
protected  int createInternalBufferID()
          Get an internal id for the buffer being used.
static File createTempFile()
          Creates a temp file called "knime_container_date_xxxx.zip" and marks it for deletion upon exit.
protected  ContainerTable getBufferedTable()
          Returns the table holding the data.
protected  Map<Integer,ContainerTable> getGlobalTableRepository()
          Get the map of buffers that potentially have written blob objects.
protected  Map<Integer,ContainerTable> getLocalTableRepository()
          Get the local repository.
 DataTable getTable()
          Get reference to table.
 DataTableSpec getTableSpec()
          Get the currently set DataTableSpec.
 boolean isClosed()
          Returns true if table has been closed and getTable() will return a DataTable object.
static boolean isContainerTable(DataTable table)
          Returns true if the given argument table has been created by the DataContainer, false otherwise.
protected  boolean isForceCopyOfBlobs()
          Get the property, which has possibly been set by setForceCopyOfBlobs(boolean).
 boolean isOpen()
          Returns true if the container has been initialized with DataTableSpec and is ready to accept rows.
static ContainerTable readFromStream(InputStream in)
          Reads a table from an input stream.
static ContainerTable readFromZip(File zipFile)
          Reads a table from a zip file that has been written using the writeToZip(DataTable, File, ExecutionMonitor) method.
(package private) static ContainerTable readFromZip(ReferencedFile zipFileRef, DataContainer.BufferCreator creator)
          Factory method used to restore table from zip file.
(package private) static ContainerTable readFromZipDelayed(CopyOnAccessTask c, DataTableSpec spec)
          Used in BufferedDataContainer to read the tables from the workspace location.
protected static ContainerTable readFromZipDelayed(ReferencedFile zipFile, DataTableSpec spec, int bufferID, Map<Integer,ContainerTable> bufferRep)
          Used in BufferedDataContainer to read the tables from the workspace location.
protected  void setBufferCreator(DataContainer.BufferCreator bufferCreator)
          Set a buffer creator to be used to initialize the buffer.
protected  void setForceCopyOfBlobs(boolean forceCopyOfBlobs)
          If true any blob that is not owned by this container, will be copied and this container will take ownership.
 void setMaxPossibleValues(int maxPossibleValues)
          Define a new threshold for number of possible values to memorize.
 int size()
          Get the number of rows that have been added so far.
static void writeToStream(DataTable table, OutputStream out, ExecutionMonitor exec)
          Writes a given DataTable permanently to an output stream.
static void writeToZip(DataTable table, File zipFile, ExecutionMonitor exec)
          Writes a given DataTable permanently to a zip file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEF_GZIP_COMPRESSION

public static final boolean DEF_GZIP_COMPRESSION
Whether compression is enabled by default.

See Also:
KNIMEConstants.PROPERTY_TABLE_GZIP_COMPRESSION, Constant Field Values

PROPERTY_CELLS_IN_MEMORY

public static final String PROPERTY_CELLS_IN_MEMORY
Java property name to set a different threshold for the number of cells to be held in main memory. This property is set at startup, usually by adding a line such as -Dorg.knime.container.cellsinmemory=1000 to the knime.ini file in the installation directory.

See Also:
Constant Field Values

DEF_MAX_CELLS_IN_MEMORY

public static final int DEF_MAX_CELLS_IN_MEMORY
The default number of cells to be held in memory.

See Also:
Constant Field Values

MAX_CELLS_IN_MEMORY

public static final int MAX_CELLS_IN_MEMORY
Number of cells that are cached without being written to the temp file (see Buffer implementation); It defaults to the value defined by DEF_MAX_CELLS_IN_MEMORY but can be changed using the java property PROPERTY_CELLS_IN_MEMORY.


ASYNC_CACHE_SIZE

static final int ASYNC_CACHE_SIZE
Size of buffers.

See Also:
Constant Field Values

SYNCHRONOUS_IO

static final boolean SYNCHRONOUS_IO
Whether to use synchronous IO while adding rows to a buffer or reading from an file iterator. This is by default false but can be enabled by setting the appropriate java property at startup.


ZIP_ENTRY_SPEC

static final String ZIP_ENTRY_SPEC
Used in write/readFromZip: Name of the zip entry containing the spec.

See Also:
Constant Field Values

CFG_TABLESPEC

static final String CFG_TABLESPEC
Used in write/readFromZip: Config entry: The spec of the table.

See Also:
Constant Field Values
Constructor Detail

DataContainer

public DataContainer(DataTableSpec spec)
Opens the container so that rows can be added by addRowToTable(DataRow). The table spec of the resulting table (the one being returned by getTable()) will have a valid column domain. That means, while rows are added to the container, the domain of each column is adjusted.

If you prefer to stick with the domain as passed in the argument, use the constructor DataContainer(DataTableSpec, true, DataContainer.MAX_CELLS_IN_MEMORY) instead.

Parameters:
spec - Table spec of the final table. Rows that are added to the container must comply with this spec.
Throws:
NullPointerException - If spec is null.

DataContainer

public DataContainer(DataTableSpec spec,
                     boolean initDomain)
Opens the container so that rows can be added by addRowToTable(DataRow).

Parameters:
spec - Table spec of the final table. Rows that are added to the container must comply with this spec.
initDomain - if set to true, the column domains in the container are initialized with the domains from spec.
Throws:
NullPointerException - If spec is null.

DataContainer

public DataContainer(DataTableSpec spec,
                     boolean initDomain,
                     int maxCellsInMemory)
Opens the container so that rows can be added by addRowToTable(DataRow).

Parameters:
spec - Table spec of the final table. Rows that are added to the container must comply with this spec.
initDomain - if set to true, the column domains in the container are initialized with the domains from spec.
maxCellsInMemory - Maximum count of cells in memory before swapping.
Throws:
IllegalArgumentException - If maxCellsInMemory < 0.
NullPointerException - If spec is null.
Method Detail

setBufferCreator

protected void setBufferCreator(DataContainer.BufferCreator bufferCreator)
Set a buffer creator to be used to initialize the buffer. This method must be called before any rows are added.

Parameters:
bufferCreator - To be used.
Throws:
NullPointerException - If the argument is null.
IllegalStateException - If the buffer has already been created.

setForceCopyOfBlobs

protected final void setForceCopyOfBlobs(boolean forceCopyOfBlobs)
If true any blob that is not owned by this container, will be copied and this container will take ownership. This option is true for loop end nodes, which need to aggregate the data generated in the loop body.

Parameters:
forceCopyOfBlobs - this above described property
Throws:
IllegalStateException - If this buffer has already added rows, i.e. this method must be called right after construction.

isForceCopyOfBlobs

protected final boolean isForceCopyOfBlobs()
Get the property, which has possibly been set by setForceCopyOfBlobs(boolean).

Returns:
this property.

setMaxPossibleValues

public void setMaxPossibleValues(int maxPossibleValues)
Define a new threshold for number of possible values to memorize. It makes sense to call this method before any rows are added.

Parameters:
maxPossibleValues - The new number.
Throws:
IllegalArgumentException - If the value < 0

isOpen

public boolean isOpen()
Returns true if the container has been initialized with DataTableSpec and is ready to accept rows.

This implementation returns !isClosed();

Returns:
true if container is accepting rows.

isClosed

public boolean isClosed()
Returns true if table has been closed and getTable() will return a DataTable object.

Returns:
true if table is available, false otherwise.

close

public void close()
Closes container and creates table that can be accessed by getTable(). Successive calls of addRowToTable will fail with an exception.

Throws:
IllegalStateException - If container is not open.
DuplicateKeyException - If the final check for duplicate row keys fails.
DataContainerException - If the duplicate check fails for an unknown IO problem

size

public int size()
Get the number of rows that have been added so far. (How often has addRowToTable been called.)

Returns:
The number of rows in the container.
Throws:
IllegalStateException - If container is not open.

getTable

public DataTable getTable()
Get reference to table. This method throws an exception unless the container is closed and has therefore a table available.

Returns:
Reference to the table that has been built up.
Throws:
IllegalStateException - If isClosed() returns false

getBufferedTable

protected final ContainerTable getBufferedTable()
Returns the table holding the data. This method is identical to the getTable() method but is more specific with respec to the return type. It's used in derived classes.

Returns:
The table underlying this container.
Throws:
IllegalStateException - If isClosed() returns false

getTableSpec

public DataTableSpec getTableSpec()
Get the currently set DataTableSpec.

Returns:
The current spec.

addRowToTable

public void addRowToTable(DataRow row)
Appends a row to the end of a container. The row must comply with the settings in the DataTableSpec that has been set when the container or table has been constructed.

Specified by:
addRowToTable in interface RowAppender
Parameters:
row - DataRow to be added

createInternalBufferID

protected int createInternalBufferID()
Get an internal id for the buffer being used. This ID is used in conjunction with blob serialization to locate buffers. Blobs that belong to a Buffer (i.e. they have been created in a particular Buffer) will write this ID when serialized to a file. Subsequent Buffers that also need to serialize Blob cells (which, however, have already been written) can then reference to the respective Buffer object using this ID.

An ID of -1 denotes the fact, that the buffer is not intended to be used for sophisticated blob serialization. All blob cells that are added to it will be newly serialized as if they were created for the first time.

This implementation returns -1.

Returns:
-1 or a unique buffer ID.

addRowKeyForDuplicateCheck

protected void addRowKeyForDuplicateCheck(RowKey key)
Method being called when addRowToTable(DataRow) is called. This method will add the given row key to the internal row key hashing structure, which allows for duplicate checking.

This method may be overridden to disable duplicate checks. The overriding class must ensure that there are no duplicates being added whatsoever.

Parameters:
key - Key being added. This implementation extracts the string representation from it and adds it to an internal DuplicateChecker instance.
Throws:
DataContainerException - This implementation may throw a DataContainerException when DuplicateChecker.addKey(String) throws an IOException.
DuplicateKeyException - If a duplicate is encountered.

getGlobalTableRepository

protected Map<Integer,ContainerTable> getGlobalTableRepository()
Get the map of buffers that potentially have written blob objects. If m_buffer needs to serialize a blob, it will check if any other buffer has written the blob already and then reference to this buffer rather than writing out the blob again.

If used along with the ExecutionContext, this method returns the global table repository (global = in the context of the current workflow).

This implementation does not support sophisticated blob serialization. It will return a new HashMap<Integer, Buffer>().

Returns:
The map bufferID to Buffer.
See Also:
getLocalTableRepository()

getLocalTableRepository

protected Map<Integer,ContainerTable> getLocalTableRepository()
Get the local repository. Overridden in BufferedDataContainer

Returns:
A local repository to which tables are added that have been created during the node's execution.

cache

public static DataTable cache(DataTable table,
                              ExecutionMonitor exec,
                              int maxCellsInMemory)
                       throws CanceledExecutionException
Convenience method that will buffer the entire argument table. This is useful if you have a wrapper table at hand and want to make sure that all calculations are done here

Parameters:
table - The table to cache.
exec - The execution monitor to report progress to and to check for the cancel status.
maxCellsInMemory - The number of cells to be kept in memory before swapping to disk.
Returns:
A cache table containing the data from the argument.
Throws:
NullPointerException - If the argument is null.
CanceledExecutionException - If the process has been canceled.

cache

public static DataTable cache(DataTable table,
                              ExecutionMonitor exec)
                       throws CanceledExecutionException
Convenience method that will buffer the entire argument table. This is useful if you have a wrapper table at hand and want to make sure that all calculations are done here

Parameters:
table - The table to cache.
exec - The execution monitor to report progress to and to check for the cancel status.
Returns:
A cache table containing the data from the argument.
Throws:
NullPointerException - If the argument is null.
CanceledExecutionException - If the process has been canceled.

writeToZip

public static void writeToZip(DataTable table,
                              File zipFile,
                              ExecutionMonitor exec)
                       throws IOException,
                              CanceledExecutionException
Writes a given DataTable permanently to a zip file. This includes also all table spec information, such as color, size, and shape properties.

Parameters:
table - The table to write.
zipFile - The file to write to. Will be created or overwritten.
exec - For progress info.
Throws:
IOException - If writing fails.
CanceledExecutionException - If canceled.
See Also:
readFromZip(File)

writeToStream

public static void writeToStream(DataTable table,
                                 OutputStream out,
                                 ExecutionMonitor exec)
                          throws IOException,
                                 CanceledExecutionException
Writes a given DataTable permanently to an output stream. This includes also all table spec information, such as color, size, and shape properties.

The content is saved by instantiating a ZipOutputStream on the argument stream, saving the necessary information in respective zip entries and eventually closing the entire stream. If the stream should not be closed, consider to use a NonClosableOutputStream as argument stream.

Parameters:
table - The table to write.
out - The stream to save to.
exec - For progress info.
Throws:
IOException - If writing fails.
CanceledExecutionException - If canceled.
See Also:
readFromStream(InputStream)

readFromZip

public static ContainerTable readFromZip(File zipFile)
                                  throws IOException
Reads a table from a zip file that has been written using the writeToZip(DataTable, File, ExecutionMonitor) method.

Parameters:
zipFile - To read from.
Returns:
The table contained in the zip file.
Throws:
IOException - If that fails.
See Also:
writeToZip(DataTable, File, ExecutionMonitor)

readFromStream

public static ContainerTable readFromStream(InputStream in)
                                     throws IOException
Reads a table from an input stream. This is the reverse operation of writeToStream(DataTable, OutputStream, ExecutionMonitor).

The argument stream will be closed. If this is not desired, consider to use a NonClosableInputStream as argument.

Parameters:
in - To read from, Stream will be closed finally.
Returns:
The table contained in the stream.
Throws:
IOException - If that fails.
See Also:
writeToStream(DataTable, OutputStream, ExecutionMonitor)

readFromZip

static ContainerTable readFromZip(ReferencedFile zipFileRef,
                                  DataContainer.BufferCreator creator)
                           throws IOException
Factory method used to restore table from zip file.

Parameters:
zipFileRef - To read from.
creator - Factory object to create a buffer instance.
Returns:
The table contained in the zip file.
Throws:
IOException - If that fails.
See Also:
readFromZip(File)

readFromZipDelayed

protected static ContainerTable readFromZipDelayed(ReferencedFile zipFile,
                                                   DataTableSpec spec,
                                                   int bufferID,
                                                   Map<Integer,ContainerTable> bufferRep)
Used in BufferedDataContainer to read the tables from the workspace location.

Parameters:
zipFile - To read from (is going to be copied to temp on access)
spec - The DTS for the table.
bufferID - The buffer's id used for blob (de)serialization
bufferRep - Repository of buffers for blob (de)serialization.
Returns:
Table contained in zipFile.

readFromZipDelayed

static ContainerTable readFromZipDelayed(CopyOnAccessTask c,
                                         DataTableSpec spec)
Used in BufferedDataContainer to read the tables from the workspace location.

Parameters:
c - The factory that create the Buffer instance that the returned table reads from.
spec - The DTS for the table.
Returns:
Table contained in zipFile.

createTempFile

public static final File createTempFile()
                                 throws IOException
Creates a temp file called "knime_container_date_xxxx.zip" and marks it for deletion upon exit. This method is used to init the file when the data container flushes to disk. It is also used when the nodes are read back in to copy the data to the tmp-directory.

Returns:
A temp file to use. The file is empty.
Throws:
IOException - If that fails for any reason.

isContainerTable

public static final boolean isContainerTable(DataTable table)
Returns true if the given argument table has been created by the DataContainer, false otherwise.

Parameters:
table - The table to check.
Returns:
If the given table was created by a DataContainer.
Throws:
NullPointerException - If the argument is null.


Copyright, 2003 - 2010. All rights reserved.
University of Konstanz, Germany.
Chair for Bioinformatics and Information Mining, Prof. Dr. Michael R. Berthold.
You may not modify, publish, transmit, transfer or sell, reproduce, create derivative works from, distribute, perform, display, or in any way exploit any of the content, in whole or in part, except as otherwise expressly permitted in writing by the copyright owner or as specified in the license file distributed with this product.