org.knime.base.node.mine.decisiontree2.learner
Class SplitQualityGini

java.lang.Object
  extended by org.knime.base.node.mine.decisiontree2.learner.SplitQualityMeasure
      extended by org.knime.base.node.mine.decisiontree2.learner.SplitQualityGini
All Implemented Interfaces:
Cloneable

public class SplitQualityGini
extends SplitQualityMeasure

Implements the gini index split quality measure. This gini index is subtracted from 1 (worst value), thus the gini index is also better if it is larger than another gini index (same as for gain ratio).

Author:
Christoph Sieb, University of Konstanz

Constructor Summary
SplitQualityGini()
           
 
Method Summary
 double getWorstValue()
          Returns the worst value for this quality measure.
 void initQualityMeasure(double[] classFrequencies, double allOverRecords)
          Some quality measures, like the information gain, calculate a quality of a previous distribution compared to a new one.
 boolean isBetter(double quality1, double quality2)
          A gini index is better if it is larger than the other one.
 boolean isBetterOrEqual(double quality1, double quality2)
          A GINI index is better if it is larger than the other one.
 double measureQuality(double allOverRecords, double[] partitionFrequency, double[][] partitionClassFrequency, double numUnknownRecords)
          Calculates the gini split index.
 double postProcessMeasure(double qualityMeasure, double allOverRecords, double[] partitionFrequency, double numUnknownRecords)
          The gini index need not to post process the measure.
 String toString()
          
 
Methods inherited from class org.knime.base.node.mine.decisiontree2.learner.SplitQualityMeasure
clone
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

SplitQualityGini

public SplitQualityGini()
Method Detail

isBetter

public boolean isBetter(double quality1,
                        double quality2)
A gini index is better if it is larger than the other one. Determines if the first passed quality is better compared to the second quality.

Specified by:
isBetter in class SplitQualityMeasure
Parameters:
quality1 - first quality to compare
quality2 - second quality to compare
Returns:
true, iff the first quality is better to the second quality

isBetterOrEqual

public boolean isBetterOrEqual(double quality1,
                               double quality2)
A GINI index is better if it is larger than the other one. Determines if the first passed quality is better or equal compared to the second quality.

Specified by:
isBetterOrEqual in class SplitQualityMeasure
Parameters:
quality1 - first quality to compare
quality2 - second quality to compare
Returns:
true, iff the first quality is better or equal to the second quality

measureQuality

public double measureQuality(double allOverRecords,
                             double[] partitionFrequency,
                             double[][] partitionClassFrequency,
                             double numUnknownRecords)
Calculates the gini split index.

For a dataset T the gini index is: gini(T) = 1 - SUM(pj * pj) - for all relative class frequencies pj (pj = Pj/|T|). Pj is the absolut class frequency and nx the number of records in the data set

The gini for the split is: giniSplit(T) = SUM(nx/N*gini(Tx)) - for all relative partition frequencies nx/N and all partitions Tx

Specified by:
measureQuality in class SplitQualityMeasure
Parameters:
allOverRecords - the allover number of records with known values in the partition to split; corresponds to N in the formula
partitionFrequency - the frequencies of the different patitions; corresponds to nx in the formula
partitionClassFrequency - all class frequencies Pj (second dimension) for all partitions Tx (first dimension *
numUnknownRecords - the number of records with unknown (missing) value of the relevant attribute; used to weight the quality measure
Returns:
the gini split index

getWorstValue

public double getWorstValue()
Returns the worst value for this quality measure.

Specified by:
getWorstValue in class SplitQualityMeasure
Returns:
the worst value for this quality measure

initQualityMeasure

public void initQualityMeasure(double[] classFrequencies,
                               double allOverRecords)
Some quality measures, like the information gain, calculate a quality of a previous distribution compared to a new one. This previous distribution can be reused. For those cases a init method is provided that enable pre calculations to increase performance.

Specified by:
initQualityMeasure in class SplitQualityMeasure
Parameters:
classFrequencies - the class frequencies
allOverRecords - the overall count

toString

public String toString()

Specified by:
toString in class SplitQualityMeasure

postProcessMeasure

public double postProcessMeasure(double qualityMeasure,
                                 double allOverRecords,
                                 double[] partitionFrequency,
                                 double numUnknownRecords)
The gini index need not to post process the measure. Some quality measures need normalization when compared to other attributes. As this normalization is not required when the quality is compared inside a single attribute, this method allows to perform post processing (normalization) of quality measures to avoid a lot of unnecessary calculations.

Specified by:
postProcessMeasure in class SplitQualityMeasure
Parameters:
qualityMeasure - the quality measure to post process
allOverRecords - the allover number of known (non-missing) records
partitionFrequency - the frequencies of the potential split partitions
numUnknownRecords - the number of unknown (missing) records
Returns:
the post processed quality measure


Copyright, 2003 - 2010. All rights reserved.
University of Konstanz, Germany.
Chair for Bioinformatics and Information Mining, Prof. Dr. Michael R. Berthold.
You may not modify, publish, transmit, transfer or sell, reproduce, create derivative works from, distribute, perform, display, or in any way exploit any of the content, in whole or in part, except as otherwise expressly permitted in writing by the copyright owner or as specified in the license file distributed with this product.