GClasses
GClasses::GDecisionTree Class Reference

This is an efficient learning algorithm. It divides on the attributes that reduce entropy the most, or alternatively can make random divisions. More...

#include <GDecisionTree.h>

Inheritance diagram for GClasses::GDecisionTree:
GClasses::GSupervisedLearner GClasses::GTransducer

Public Types

enum  DivisionAlgorithm { MINIMIZE_ENTROPY, RANDOM }
 

Public Member Functions

 GDecisionTree ()
 General-purpose constructor. See also the comment for GSupervisedLearner::GSupervisedLearner. More...
 
 GDecisionTree (GDomNode *pNode, GLearnerLoader &ll)
 Loads from a DOM. More...
 
virtual ~GDecisionTree ()
 
void autoTune (GMatrix &features, GMatrix &labels)
 Uses cross-validation to find a set of parameters that works well with the provided data. More...
 
virtual void clear ()
 Frees the model. More...
 
bool isBinary ()
 Returns true iff useBinaryDivisions was called. More...
 
size_t leafThresh ()
 Returns the leaf threshold. More...
 
virtual void predict (const double *pIn, double *pOut)
 See the comment for GSupervisedLearner::predict. More...
 
virtual void predictDistribution (const double *pIn, GPrediction *pOut)
 See the comment for GSupervisedLearner::predictDistribution. More...
 
void print (std::ostream &stream, GArffRelation *pFeatureRel=NULL, GArffRelation *pLabelRel=NULL)
 Prints an ascii representation of the decision tree to the specified stream. pRelation is an optional relation that can be supplied in order to provide better meta-data to make the print-out richer. More...
 
virtual GDomNodeserialize (GDom *pDoc) const
 Marshal this object into a DOM, which can then be converted to a variety of serial formats. More...
 
void setLeafThresh (size_t n)
 Sets the leaf threshold. When the number of samples is <= this value, it will no longer try to divide the data, but will create a leaf node. The default value is 1. For noisy data, a larger value may be advantageous. More...
 
void setMaxLevels (size_t n)
 Sets the max levels. When a path from the root to the current node contains n nodes (including the root), it will no longer try to divide the data, but will create a leaf node. If set to 0, then there is no maximum. 0 is the default. More...
 
size_t treeSize ()
 Returns the number of nodes in this tree. More...
 
void useBinaryDivisions ()
 Specify to only use binary divisions. More...
 
void useRandomDivisions (size_t randomDraws=1)
 Specifies for this decision tree to use random divisions (instead of divisions that reduce entropy). Random divisions make the algorithm train somewhat faster, and also increase model variance, so it is better suited for ensembles, but random divisions also make the decision tree vulnerable to problems with irrelevant attributes. More...
 
- Public Member Functions inherited from GClasses::GSupervisedLearner
 GSupervisedLearner ()
 General-purpose constructor. More...
 
 GSupervisedLearner (GDomNode *pNode, GLearnerLoader &ll)
 Deserialization constructor. More...
 
virtual ~GSupervisedLearner ()
 Destructor. More...
 
void basicTest (double minAccuracy1, double minAccuracy2, double deviation=1e-6, bool printAccuracy=false, double warnRange=0.035)
 This is a helper method used by the unit tests of several model learners. More...
 
virtual bool canGeneralize ()
 Returns true because fully supervised learners have an internal model that allows them to generalize previously unseen rows. More...
 
void confusion (GMatrix &features, GMatrix &labels, std::vector< GMatrix * > &stats)
 Generates a confusion matrix containing the total counts of the number of times each value was expected and predicted. (Rows represent target values, and columns represent predicted values.) stats should be an empty vector. This method will resize stats to the number of dimensions in the label vector. The caller is responsible to delete all of the matrices that it puts in this vector. For continuous labels, the value will be NULL. More...
 
virtual bool isFilter ()
 Returns false. More...
 
void precisionRecall (double *pOutPrecision, size_t nPrecisionSize, GMatrix &features, GMatrix &labels, size_t label, size_t nReps)
 label specifies which output to measure. (It should be 0 if there is only one label dimension.) The measurement will be performed "nReps" times and results averaged together nPrecisionSize specifies the number of points at which the function is sampled pOutPrecision should be an array big enough to hold nPrecisionSize elements for every possible label value. (If the attribute is continuous, it should just be big enough to hold nPrecisionSize elements.) If bLocal is true, it computes the local precision instead of the global precision. More...
 
const GRelationrelFeatures ()
 Returns a reference to the feature relation (meta-data about the input attributes). More...
 
const GRelationrelLabels ()
 Returns a reference to the label relation (meta-data about the output attributes). More...
 
double sumSquaredError (const GMatrix &features, const GMatrix &labels)
 Computes the sum-squared-error for predicting the labels from the features. For categorical labels, Hamming distance is used. More...
 
void train (const GMatrix &features, const GMatrix &labels)
 Call this method to train the model. More...
 
virtual double trainAndTest (const GMatrix &trainFeatures, const GMatrix &trainLabels, const GMatrix &testFeatures, const GMatrix &testLabels)
 Trains and tests this learner. Returns sum-squared-error. More...
 
- Public Member Functions inherited from GClasses::GTransducer
 GTransducer ()
 General-purpose constructor. More...
 
 GTransducer (const GTransducer &that)
 Copy-constructor. Throws an exception to prevent models from being copied by value. More...
 
virtual ~GTransducer ()
 
virtual bool canImplicitlyHandleContinuousFeatures ()
 Returns true iff this algorithm can implicitly handle continuous features. If it cannot, then the GDiscretize transform will be used to convert continuous features to nominal values before passing them to it. More...
 
virtual bool canImplicitlyHandleContinuousLabels ()
 Returns true iff this algorithm can implicitly handle continuous labels (a.k.a. regression). If it cannot, then the GDiscretize transform will be used during training to convert nominal labels to continuous values, and to convert nominal predictions back to continuous labels. More...
 
virtual bool canImplicitlyHandleMissingFeatures ()
 Returns true iff this algorithm supports missing feature values. If it cannot, then an imputation filter will be used to predict missing values before any feature-vectors are passed to the algorithm. More...
 
virtual bool canImplicitlyHandleNominalFeatures ()
 Returns true iff this algorithm can implicitly handle nominal features. If it cannot, then the GNominalToCat transform will be used to convert nominal features to continuous values before passing them to it. More...
 
virtual bool canImplicitlyHandleNominalLabels ()
 Returns true iff this algorithm can implicitly handle nominal labels (a.k.a. classification). If it cannot, then the GNominalToCat transform will be used during training to convert nominal labels to continuous values, and to convert categorical predictions back to nominal labels. More...
 
virtual bool canTrainIncrementally ()
 Returns false because semi-supervised learners cannot be trained incrementally. More...
 
double crossValidate (const GMatrix &features, const GMatrix &labels, size_t nFolds, RepValidateCallback pCB=NULL, size_t nRep=0, void *pThis=NULL)
 Perform n-fold cross validation on pData. Returns sum-squared error. Uses trainAndTest for each fold. pCB is an optional callback method for reporting intermediate stats. It can be NULL if you don't want intermediate reporting. nRep is just the rep number that will be passed to the callback. pThis is just a pointer that will be passed to the callback for you to use however you want. It doesn't affect this method. More...
 
GTransduceroperator= (const GTransducer &other)
 Throws an exception to prevent models from being copied by value. More...
 
GRandrand ()
 Returns a reference to the random number generator associated with this object. For example, you could use it to change the random seed, to make this algorithm behave differently. This might be important, for example, in an ensemble of learners. More...
 
double repValidate (const GMatrix &features, const GMatrix &labels, size_t reps, size_t nFolds, RepValidateCallback pCB=NULL, void *pThis=NULL)
 Perform cross validation "nReps" times and return the average score. pCB is an optional callback method for reporting intermediate stats It can be NULL if you don't want intermediate reporting. pThis is just a pointer that will be passed to the callback for you to use however you want. It doesn't affect this method. More...
 
virtual bool supportedFeatureRange (double *pOutMin, double *pOutMax)
 Returns true if this algorithm supports any feature value, or if it does not implicitly handle continuous features. If a limited range of continuous values is supported, returns false and sets pOutMin and pOutMax to specify the range. More...
 
virtual bool supportedLabelRange (double *pOutMin, double *pOutMax)
 Returns true if this algorithm supports any label value, or if it does not implicitly handle continuous labels. If a limited range of continuous values is supported, returns false and sets pOutMin and pOutMax to specify the range. More...
 
GMatrixtransduce (const GMatrix &features1, const GMatrix &labels1, const GMatrix &features2)
 Predicts a set of labels to correspond with features2, such that these labels will be consistent with the patterns exhibited by features1 and labels1. More...
 
void transductiveConfusionMatrix (const GMatrix &trainFeatures, const GMatrix &trainLabels, const GMatrix &testFeatures, const GMatrix &testLabels, std::vector< GMatrix * > &stats)
 Makes a confusion matrix for a transduction algorithm. More...
 

Static Public Member Functions

static void test ()
 Performs unit tests for this class. Throws an exception if there is a failure. More...
 
- Static Public Member Functions inherited from GClasses::GSupervisedLearner
static void test ()
 Runs some unit tests related to supervised learning. Throws an exception if any problems are found. More...
 

Protected Member Functions

GDecisionTreeNode * buildBranch (GMatrix &features, GMatrix &labels, std::vector< size_t > &attrPool, size_t nDepth, size_t tolerance)
 A recursive helper method used to construct the decision tree. More...
 
GDecisionTreeLeafNode * findLeaf (const double *pIn, size_t *pDepth)
 Finds the leaf node that corresponds with the specified feature vector. More...
 
double measureInfoGain (GMatrix *pData, size_t nAttribute, double *pPivot)
 InfoGain is defined as the difference in entropy in the data before and after dividing it based on the specified attribute. For continuous attributes it uses the difference between the original variance and the sum of the variances of the two parts after dividing at the point the maximizes this value. More...
 
size_t pickDivision (GMatrix &features, GMatrix &labels, double *pPivot, std::vector< size_t > &attrPool, size_t nDepth)
 
virtual void trainInner (const GMatrix &features, const GMatrix &labels)
 See the comment for GSupervisedLearner::trainInner. More...
 
- Protected Member Functions inherited from GClasses::GSupervisedLearner
GDomNodebaseDomNode (GDom *pDoc, const char *szClassName) const
 Child classes should use this in their implementation of serialize. More...
 
size_t precisionRecallContinuous (GPrediction *pOutput, double *pFunc, GMatrix &trainFeatures, GMatrix &trainLabels, GMatrix &testFeatures, GMatrix &testLabels, size_t label)
 This is a helper method used by precisionRecall. More...
 
size_t precisionRecallNominal (GPrediction *pOutput, double *pFunc, GMatrix &trainFeatures, GMatrix &trainLabels, GMatrix &testFeatures, GMatrix &testLabels, size_t label, int value)
 This is a helper method used by precisionRecall. More...
 
void setupFilters (const GMatrix &features, const GMatrix &labels)
 This method determines which data filters (normalize, discretize, and/or nominal-to-cat) are needed and trains them. More...
 
virtual GMatrixtransduceInner (const GMatrix &features1, const GMatrix &labels1, const GMatrix &features2)
 See GTransducer::transduce. More...
 

Protected Attributes

bool m_binaryDivisions
 
DivisionAlgorithm m_eAlg
 
size_t m_leafThresh
 
size_t m_maxLevels
 
GDecisionTreeNode * m_pRoot
 
size_t m_randomDraws
 
- Protected Attributes inherited from GClasses::GSupervisedLearner
GRelationm_pRelFeatures
 
GRelationm_pRelLabels
 
- Protected Attributes inherited from GClasses::GTransducer
GRand m_rand
 

Detailed Description

This is an efficient learning algorithm. It divides on the attributes that reduce entropy the most, or alternatively can make random divisions.

Member Enumeration Documentation

Enumerator
MINIMIZE_ENTROPY 
RANDOM 

Constructor & Destructor Documentation

GClasses::GDecisionTree::GDecisionTree ( )

General-purpose constructor. See also the comment for GSupervisedLearner::GSupervisedLearner.

GClasses::GDecisionTree::GDecisionTree ( GDomNode pNode,
GLearnerLoader ll 
)

Loads from a DOM.

virtual GClasses::GDecisionTree::~GDecisionTree ( )
virtual

Member Function Documentation

void GClasses::GDecisionTree::autoTune ( GMatrix features,
GMatrix labels 
)

Uses cross-validation to find a set of parameters that works well with the provided data.

GDecisionTreeNode* GClasses::GDecisionTree::buildBranch ( GMatrix features,
GMatrix labels,
std::vector< size_t > &  attrPool,
size_t  nDepth,
size_t  tolerance 
)
protected

A recursive helper method used to construct the decision tree.

virtual void GClasses::GDecisionTree::clear ( )
virtual

Frees the model.

Implements GClasses::GSupervisedLearner.

GDecisionTreeLeafNode* GClasses::GDecisionTree::findLeaf ( const double *  pIn,
size_t *  pDepth 
)
protected

Finds the leaf node that corresponds with the specified feature vector.

bool GClasses::GDecisionTree::isBinary ( )
inline

Returns true iff useBinaryDivisions was called.

size_t GClasses::GDecisionTree::leafThresh ( )
inline

Returns the leaf threshold.

double GClasses::GDecisionTree::measureInfoGain ( GMatrix pData,
size_t  nAttribute,
double *  pPivot 
)
protected

InfoGain is defined as the difference in entropy in the data before and after dividing it based on the specified attribute. For continuous attributes it uses the difference between the original variance and the sum of the variances of the two parts after dividing at the point the maximizes this value.

size_t GClasses::GDecisionTree::pickDivision ( GMatrix features,
GMatrix labels,
double *  pPivot,
std::vector< size_t > &  attrPool,
size_t  nDepth 
)
protected
virtual void GClasses::GDecisionTree::predict ( const double *  pIn,
double *  pOut 
)
virtual
virtual void GClasses::GDecisionTree::predictDistribution ( const double *  pIn,
GPrediction pOut 
)
virtual
void GClasses::GDecisionTree::print ( std::ostream &  stream,
GArffRelation pFeatureRel = NULL,
GArffRelation pLabelRel = NULL 
)

Prints an ascii representation of the decision tree to the specified stream. pRelation is an optional relation that can be supplied in order to provide better meta-data to make the print-out richer.

virtual GDomNode* GClasses::GDecisionTree::serialize ( GDom pDoc) const
virtual

Marshal this object into a DOM, which can then be converted to a variety of serial formats.

Implements GClasses::GSupervisedLearner.

void GClasses::GDecisionTree::setLeafThresh ( size_t  n)
inline

Sets the leaf threshold. When the number of samples is <= this value, it will no longer try to divide the data, but will create a leaf node. The default value is 1. For noisy data, a larger value may be advantageous.

void GClasses::GDecisionTree::setMaxLevels ( size_t  n)
inline

Sets the max levels. When a path from the root to the current node contains n nodes (including the root), it will no longer try to divide the data, but will create a leaf node. If set to 0, then there is no maximum. 0 is the default.

static void GClasses::GDecisionTree::test ( )
static

Performs unit tests for this class. Throws an exception if there is a failure.

virtual void GClasses::GDecisionTree::trainInner ( const GMatrix features,
const GMatrix labels 
)
protectedvirtual
size_t GClasses::GDecisionTree::treeSize ( )

Returns the number of nodes in this tree.

void GClasses::GDecisionTree::useBinaryDivisions ( )

Specify to only use binary divisions.

void GClasses::GDecisionTree::useRandomDivisions ( size_t  randomDraws = 1)
inline

Specifies for this decision tree to use random divisions (instead of divisions that reduce entropy). Random divisions make the algorithm train somewhat faster, and also increase model variance, so it is better suited for ensembles, but random divisions also make the decision tree vulnerable to problems with irrelevant attributes.

Member Data Documentation

bool GClasses::GDecisionTree::m_binaryDivisions
protected
DivisionAlgorithm GClasses::GDecisionTree::m_eAlg
protected
size_t GClasses::GDecisionTree::m_leafThresh
protected
size_t GClasses::GDecisionTree::m_maxLevels
protected
GDecisionTreeNode* GClasses::GDecisionTree::m_pRoot
protected
size_t GClasses::GDecisionTree::m_randomDraws
protected