org.knime.core.util
Class DuplicateChecker

java.lang.Object
  extended by org.knime.core.util.DuplicateChecker

public class DuplicateChecker
extends Object

This class checks for duplicates in an (almost) arbitrary number of strings. This can be used to check for e.g. unique row keys. The checking is done in two stages: first new keys are added to a set. If the set already contains a key an exception is thrown. If the set gets bigger than the maximum chunk size it is written to disk and the set is cleared. If then after adding all keys checkForDuplicates() is called all created chunks are processed and sorted by a merge sort like algorithm. If any duplicate keys are detected during this process an exception is thrown.

Note: This implementation is not thread-safe, it's supposed to be used by a single thread only.

Author:
Thorsten Meinl, University of Konstanz

Field Summary
static int MAX_CHUNK_SIZE
          The default chunk size.
static int MAX_STREAMS
          The default number of streams open during merging.
 
Constructor Summary
DuplicateChecker()
          Creates a new duplicate checker with default parameters.
DuplicateChecker(int maxChunkSize, int maxStreams)
          Creates a new duplicate checker.
 
Method Summary
 void addKey(String s)
          Adds a new key to the duplicate checker.
 void checkForDuplicates()
          Checks for duplicates in all added keys.
 void clear()
          Clears the checker, i.e.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MAX_CHUNK_SIZE

public static final int MAX_CHUNK_SIZE
The default chunk size.

See Also:
Constant Field Values

MAX_STREAMS

public static final int MAX_STREAMS
The default number of streams open during merging.

See Also:
Constant Field Values
Constructor Detail

DuplicateChecker

public DuplicateChecker()
Creates a new duplicate checker with default parameters.


DuplicateChecker

public DuplicateChecker(int maxChunkSize,
                        int maxStreams)
Creates a new duplicate checker.

Parameters:
maxChunkSize - the size of each chunk, i.e. the maximum number of elements kept in memory
maxStreams - the maximum number of streams that are kept open during the merge process
Method Detail

addKey

public void addKey(String s)
            throws DuplicateKeyException,
                   IOException
Adds a new key to the duplicate checker.

Parameters:
s - the key
Throws:
DuplicateKeyException - if a duplicate within the current chunk has been detected
IOException - if an I/O error occurs while writing the chunk to disk

checkForDuplicates

public void checkForDuplicates()
                        throws DuplicateKeyException,
                               IOException
Checks for duplicates in all added keys.

Throws:
DuplicateKeyException - if a duplicate key has been detected
IOException - if an I/O error occurs

clear

public void clear()
Clears the checker, i.e. removes all temporary files and all keys in memory.



Copyright, 2003 - 2010. All rights reserved.
University of Konstanz, Germany.
Chair for Bioinformatics and Information Mining, Prof. Dr. Michael R. Berthold.
You may not modify, publish, transmit, transfer or sell, reproduce, create derivative works from, distribute, perform, display, or in any way exploit any of the content, in whole or in part, except as otherwise expressly permitted in writing by the copyright owner or as specified in the license file distributed with this product.