Random Forest
Random forest is an ensemble learning method to solve any classification and regression problem. Random forest training builds a model composition (ensemble) of one type and uses some aggregation algorithm of several answers from models. Each model is trained on a part of the training dataset. The part is defined according to bagging and feature subspace methods. More information about these concepts may be found here: 1, 2, and 3.
There are several implementations of aggregation algorithms in Apache Ignite ML:
-
MeanValuePredictionsAggregator
- computes answer of a random forest as mean value of predictions from all models in the given composition. Often this is used for regression tasks. -
OnMajorityPredictionsAggegator
- gets a mode of predictions from all models in the given composition. This can be useful for a classification task. NOTE: This aggregator supports multi-classification tasks.
Model
The random forest algorithm is implemented in Ignite ML as a special case of a model composition with specific aggregators for different problems (MeanValuePredictionsAggregator
for regression, OnMajorityPredictionsAggegator
for classification).
Here is an example of model usage:
ModelsComposition randomForest = ….
double prediction = randomForest.apply(featuresVector);
Trainer
The random forest training algorithm is implemented with RandomForestRegressionTrainer
and RandomForestClassifierTrainer
trainers with the following parameters:
-
meta
- features meta, list of feature type description such as:-
featureId
- index in features vector. -
isCategoricalFeature
- flag,true
if a feature is categorical. -
feature name.
-
This meta-information is important for random forest training algorithms because it builds feature histograms and categorical features should be represented in histograms for all feature values:
-
featuresCountSelectionStrgy
- sets strategy defining count of random features for learning one tree. There are several strategies: SQRT, LOG2, ALL and ONE_THIRD strategies implemented in the FeaturesCountSelectionStrategies class. -
maxDepththe
- sets the maximum tree depth. -
minInpurityDelta
- a node in a decision tree is split into two nodes if the impurity values on these two nodes is less than the unspilt node’s minImpurityDecrease value. -
subSampleSize
- value lying in the [0; MAX_DOUBLE]-interval. This parameter defines the count of sample repetitions in uniformly sampling with replacement. -
seed
- seed value used in random generators.
Random forest training may be used as follows:
RandomForestClassifierTrainer trainer = new RandomForestClassifierTrainer(featuresMeta)
.withCountOfTrees(101)
.withFeaturesCountSelectionStrgy(FeaturesCountSelectionStrategies.ONE_THIRD)
.withMaxDepth(4)
.withMinImpurityDelta(0.)
.withSubSampleSize(0.3)
.withSeed(0);
ModelsComposition rf = trainer.fit(
datasetBuilder,
featureExtractor,
labelExtractor
);
Example
To see how Random Forest Classifier can be used in practice, try this example, available on GitHub and delivered with every Apache Ignite distribution. In this example, a Wine recognition dataset was used. Description of this dataset and data are available from the UCI Machine Learning Repository.
© 2024 GridGain Systems, Inc. All Rights Reserved. Privacy Policy | Legal Notices. GridGain® is a registered trademark of GridGain Systems, Inc.
Apache, Apache Ignite, the Apache feather and the Apache Ignite logo are either registered trademarks or trademarks of The Apache Software Foundation.