|
Back to the table of contents Previous Next waffles_learnA command-line tool that wraps supervised and semi-supervised learning algorithms. Here's the usage information:
Full Usage Information
[Square brackets] are used to indicate required arguments.
<Angled brackets> are used to indicate optional arguments.
waffles_learn [command]
Supervised learning, transduction, cross-validation, etc.
autotune [dataset] <data_opts> [algname]
Use cross-validation to automatically determine a good set of parameters
for the specified algorithm with the specified data. The selected
parameters are printed to stdout.
[dataset]
The filename of a dataset.
<data_opts>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
[algname]
The name of the algorithm that you wish to automatically tune.
agglomerativetransducer
An agglomerative transducer
decisiontree
A decision tree
graphcuttransducer
A graph-cut transducer
knn
A k-nearest-neighbor instance-based learner
meanmarginstree
A mean margins tree
neuralnet
A feed-foward neural network (a.k.a. multi-layer perceptron)
naivebayes
A naive Bayes model
naiveinstance
A naive instance model
train <options> [dataset] <data_opts> [algorithm]
Trains a supervised learning algorithm. The trained model-file is printed
to stdout. (Typically, you will want to pipe this to a file.)
<options>
-seed [value]
Specify a seed for the random number generator. (Use this option to
ensure that your results are reproduceable.)
-calibrate
Calibrate the model after it is trained, such that predicted
distributions will approximate the distributions represented in the
training data. This switch is typically used only if you plan to
predict distributions (by calling predictdistribution) instead of
just class labels or regression values. Calibration will not effect
the predictions made by regular calls to 'predict', which is used
by most other tools.
-embed
Escape the output model such that it can easily be embedded in C or
C++ code.
[dataset]
The filename of a dataset.
<data_opts>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
predict <options> [model-file] [dataset] <data_opts>
Predict labels for all of the patterns in [dataset]. Results are printed
in the form of a ".arff" file (including both features and predictions)
to stdout.
<options>
-seed [value]
Specify a seed for the random number generator. (Use this option to
ensure that your results are reproduceable.)
[model-file]
The filename of a trained model. (This is the file to which you saved
the output when you trained a supervised learning algorithm.)
[dataset]
The filename of a dataset. (There should already be placeholder labels
in this dataset. The placeholder labels will be replaced in the output
by the labels that the model predicts.)
<data_opts>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
predictdistribution <options> [model-file] <data_opts> [pattern]
Predict a distribution for a single feature vector and print it to
stdout. (Typically, the '-calibrate' switch should be used when training
the model. If the model is not calibrated, then the predicted
distribution may not be a very good estimated distribution. Also, some
models cannot be used to predict a distribution.)
<options>
-seed [value]
Specify a seed for the random number generator. (Use this option to
ensure that your results are reproduceable.)
[model-file]
The filename of a trained model. (This is the file to which you saved
the output when you trained a supervised learning algorithm.)
[pattern]
A list of feature values separated by spaces. (A "?" may be used for
unknown feature values if the model supports using unknown feature
values.)
<data_opts>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
test <options> [model-file] [dataset] <data_opts>
Test a trained model using some test data. Results are printed to stdout
for each dimension in the label vector. Predictive accuracy is reported
for nominal label dimensions, and mean-squared-error is reported for
continuous label dimensions.
<options>
-seed [value]
Specify a seed for the random number generator. (Use this option to
ensure that your results are reproduceable.)
-confusion
Print a confusion matrix for each nominal label attribute.
-confusioncsv
Print a confusion matrix in comma-separated value format for each
nominal label attribute.
[model-file]
The filename of a trained model. (This is the file to which you saved
the output when you trained a supervised learning algorithm.)
[dataset]
The filename of a test dataset. (This dataset must have the same
number of columns as the dataset with which the model was trained.)
<data_opts>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
transduce <options> [labeled-set] <data_opts1> [unlabeled-set] <data_opts2> [algorithm]
Predict labels for [unlabeled-set] based on the examples in
[labeled-set]. For most algorithms, this is the same as training on
[labeled-set] and then predicting labels for [unlabeled-set]. Some
algorithms, however, have no models. These can transduce, even though
they cannot be trained. The predicted labels are printed to stdout as a
".arff" file.
<options>
-seed [value]
Specify a seed for the random number generator. (Use this option to
ensure that your results are reproduceable.)
[labeled-set]
The filename of a dataset. The labels in this dataset are used to
infer labels for the unlabeled set.
[unlabeled-set]
The filename of a dataset. This dataset must have placeholder labels,
but these will be ignored when predicting new labels.
<data_opts1>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
<data_opts2>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
transacc <options> [training-set] <data_opts1> [test-set] <data_opts2> [algorithm]
Measure the transductive accuracy of [algorithm] with respect to the
specified training and test sets. Results are printed to stdout for each
dimension in the label vector. Predictive accuracy is reported for
nominal labels, and mean-squared-error is reported for continuous labels.
<options>
-seed [value]
Specify a seed for the random number generator. (Use this option to
ensure that your results are reproduceable.)
[training-set]
The filename of a dataset. The labels in this dataset are used to
infer labels for the unlabeled set.
[test-set]
The filename of a dataset. This dataset must have placeholder labels.
The placeholder labels will be replaced in the output with the new
predicted labels.
<data_opts1>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
<data_opts2>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
splittest <options> [dataset] <data_opts> [algorithm]
This shuffles the data, then splits it into two parts, trains with one
part, and tests with the other. (This also works with model-free
algorithms.) Results are printed to stdout for each dimension in the
label vector. Predictive accuracy is reported for nominal labels, and
mean-squared-error is reported for continuous labels.
<options>
-seed [value]
Specify a seed for the random number generator. (Use this option to
ensure that your results are reproduceable.)
-trainratio [value]
Specify the amount of the data (between 0 and 1) to use for
training. The rest will be used for testing.
-reps [value]
Specify the number of repetitions to perform. If not specified, the
default is 1.
-writelastmodel [filename]
Write the model generated on the last repetion to the given
filename. Note that this only works when the learner being used
has an internal model.
[dataset]
The filename of a dataset.
<data_opts>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
crossvalidate <options> [dataset] <data_opts> [algorithm]
Perform cross-validation with the specified dataset and algorithm.
Results are printed to stdout.
<options>
-seed [value]
Specify a seed for the random number generator. (Use this option to
ensure that your results are reproduceable.)
-reps [value]
Specify the number of repetitions to perform. If not specified, the
default is 5.
-folds [value]
Specify the number of folds to use. If not specified, the default
is 2.
-succinct
Just report the average mean squared error. Do not report results
at each fold.
[dataset]
The filename of a dataset.
<data_opts>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
precisionrecall <options> [dataset] <data_opts> [algorithm]
Compute the precision/recall for a dataset and algorithm
<options>
-seed [value]
Specify a seed for the random number generator. (Use this option to
ensure that your results are reproduceable.)
-labeldims [n]
Specify the number of dimensions in the label (output) vector. The
default is 1. (Don't confuse this with the number of class labels.
It only takes one dimension to specify a class label, even if there
are k possible labels.)
-reps [n]
Specify the number of reps to perform. More reps means it will take
longer, but results will be more accurate. The default is 5.
-samples [n]
Specify the granularity at which to measure recall. If not
specified, the default is 100.
[dataset]
The filename of a dataset.
<data_opts>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
sterilize <options> [dataset] <data_opts> [algorithm]
Perform cross-validation to generate a new dataset that contains only the
correctly-classified instances. The new sterilized data is printed to
stdout.
<options>
-seed [value]
Specify a seed for the random number generator. (Use this option to
ensure that your results are reproduceable.)
-folds [n]
Specify the number of cross-validation folds to perform.
-diffthresh [d]
Specify a threshold of absolute difference for continuous labels.
Predictions with an absolute difference less than this threshold
are considered to be "correct".
[dataset]
The filename of a dataset to sterilize.
<data_opts>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
trainrecurrent <options> [method] [obs-data] [action-data] [context-dims] [algorithm] [algorithm]
Train a recurrent model of a dynamical system with the specified training
[method]. The training data is specified by [obs-data], which specifies
the sequence of observations, and [action-data], which specifies the
sequence of actions. [context-dims] specifies the number of dimensions in
the state-space of the system. The two algorithms specify the two
functions of a model of a dynamical system. The first [algorithm] models
the transition function. The second [algorithm] models the observation
function.
<options>
-seed [value]
Specify a seed for the random number generator. (Use this option to
ensure that your results are reproduceable.)
-paramdims 2 [wid] [hgt]
If observations are images, use this option to parameterize the
predictions, so only the channel values of each pixel are
predicted. (Other values besides 2 dimensions are also supported.)
-state [filename]
Save the estimated state to the specified file. (Only has effect if
moses is used as the training method.)
-validate [interval] 1 [obs] [action]
Perform validation at [interval]-second intervals with observation
data, [obs], and action data, [action]. (Also supports more than 1
validation sequence if desired.)
-out [filename]
Save the resulting model to the specified file. If not speicified,
the default is "model.json".
-noblur
Do not use blurring. The default is to use blurring. Sometimes
blurring improves results. Sometimes not.
-traintime [seconds]
Specify how many seconds to train the model. The default is 3600,
which is 1 hour.
-isomap
Use Isomap instead of Breadth-first Unfolding if moses is used as
the training method.
[method]
moses
Use Temporal-NLDR to estimate state, then build the model using the
state estimate.
bptt [depth] [iters-per-grow-sequence]
Backpropagation Through Time. [depth] specifies the number of
instances of the transition function that will appear in the
unfolded model. A good value might be 3. [iters-per-grow-sequence]
specifies the number of pattern presentations before the sequence
is incremented. A good value might be 50000.
evolutionary
Train with evoluationary optimization
hillclimber
Train with a hill-climbing algorithm.
annealing [deviation] [decay] [window]
Train with simulated annealing. Good values might be 2.0 0.5 300
regress [data] <data_opts> [equation]
Use a hill climbing algorithm to optimize the parameters of [equation] to
fit to the [data]. If [data] has d feature dimensions, then [equation]
must have more than d parameters. The equation must be named f. The first
d arguments to f are supplied by the data features. The remaining
arguments are optimized by the hill climber. The data must have exactly 1
label dimension, which the equation will attempt to predict. The
sum-squared error and parameter values are printed to stdout.
[dataset]
The filename of a dataset.
<data_opts>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
[equation]
An equation to regress to fit the data. The equation must be named
'f'. It can call helper-equations, separated by semicolons, if needed.
Example: "f(x1,x2,p1,p2,p3)=s(p1*x1+p2*x2+p3);s(x)=1/(1+e^(-x))"
metadata [data] <data_opts>
Generate a vector of metadata values for the given dataset. This might be
useful for meta-analysis. The resulting vector is printed to stdout in
ARFF format.
usage
Print usage information.
[algorithm]
A supervised learning algorithm, or a transductive algorithm.
agglomerativetransducer
A model-free transduction algorithm based on single-link agglomerative
clustering. Unlabeled patterns take the label of the cluster with which
they are joined. It never joins clusters with different labels.
bag <contents> end
A bagging (bootstrap aggregating) ensemble. This is a way to combine the
power of many learning algorithms through voting. "end" marks the end of
the ensemble contents. Each algorithm instance is trained using a
training set created by drawing (with replacement) from the original data
until the training set has the same number of instances as the original
data.
<contents>
[instance_count] [algorithm]
Specify the number of instances of a learning algorithm to add to
the bagging ensemble.
baseline
This is one of the simplest of all supervised algorithms. It ignores all
features. For nominal labels, it always predicts the most common class in
the training set. For continuous labels, it always predicts the mean
label in the training set. An effective learning algorithm should never
do worse than baseline--hence the name "baseline".
bucket <contents> end
This uses cross-validation with the training set to select the best model
from a bucket of models. When accuracy is measured across multiple
datasets, it will usually do better than the best model in the bucket
could do. "end" marks the end of the contents of the bucket.
<contents>
[algorithm]
Add an algorithm to the bucket
bma <contents> end
A Bayesian model averaging ensemble. This trains each model after the
manner of bagging, but then combines them weighted according to their
probability given the data. Uniform priors are assumed.
<contents>
[instance_count] [algorithm]
Specify the number of instances of a learning algorithm to add to
the BMA ensemble.
bmc <options> <contents> end
A Bayesian model combination ensemble. This algorithm is described in
Monteith, Kristine and Carroll, James and Seppi, Kevin and Martinez,
Tony, Turning Bayesian Model Averaging into Bayesian Model Combination,
Proceedings of the IEEE International Joint Conference on Neural Networks
IJCNN'11, 2657--2663, 2011.
<options>
-samples [n]
Specify the number of samples to draw from the simplex of possible
ensemble combinations. (Larger values result in better accuracy
with the cost of more computation.)
<contents>
[instance_count] [algorithm]
Specify the number of instances of a learning algorithm to add to
the BMA ensemble.
boost <options> [algorithm]
Uses ResamplingAdaBoost to create an ensemble that may be more accurate
than a lone instance of the specified algorithm. (ResamplingAdaBoost is
similar to AdaBoost, except that it uses resampling to approximate
weighted instances in the training set. This difference enables it to
work with algorithms that do not implicitly support weighted samples.)
<options>
-trainratio [value]
When approximating the weighted training set by resampling, use a
sample of size [value]*training_set_size
-size [n]
The number of base learners to use in the ensemble.
cvdt [n]
This is a bucket of two bagging ensembles: one with [n] entropy-reducing
decision trees, and one with [n] meanmarginstrees. (This algorithm is
specified in Gashler, Michael S. and Giraud-Carrier, Christophe and
Martinez, Tony. Decision Tree Ensemble: Small Heterogeneous Is Better
Than Large Homogeneous. In The Seventh International Conference on
Machine Learning and Applications, Pages 900 - 905, ICMLA '08. 2008)
decisiontree <options>
A decision tree.
<options>
-autotune
Automatically determine a good set of parameters for this model
with the current data.
-random [draws]
Use random divisions (instead of divisions that reduce entropy).
Random divisions make the algorithm train faster, and also increase
model variance, so it is better suited for ensembles, but random
divisions also make the decision tree more vulnerable to problems
with irrelevant features. [draws] is typically 1, but if you
specify a larger value, it will pick the best out of the specified
number of random draws.
-binary
Use binary divisions. For nominal attributes with more than 2
categorical values, one specific value will be separated from all
others at each division.
-leafthresh [n]
When building the tree, if the number of samples is <= this value,
it will stop trying to divide the data and will create a leaf node.
The default value is 1. For noisy data, larger values may be
advantageous.
-maxlevels [n]
When building the tree, if the depth (the length of the path from
the root to the node currently being formed, including the root and
the currently forming node) is [n], it will stop trying to divide
the data and will create a leaf node. This means that there will
be at most [n]-1 splits before a decision is made. This crudely
limits overfitting, and so can be helpful on small data sets. It
can also make the resulting trees easier to interpret. If set to
0, then there is no maximum (which is the default).
gaussianprocess
A Gaussian process model.
<options>
-noise [var]
The variance of the noise parameter.
-prior [var]
The prior variance for the weights. (This value will be multiplied
by an identity matrix to form the prior covariance for the weights.
-maxsamples [n]
The maximum number of samples to train with. (If the training data
contains more than [n] rows, then it will automatically randomly
sub-sample the training data in order to limit computational
complexity.)
-kernel [k]
Specify the kernel to use
identity
This simple kernel causes it to learn a linear model. If no
kernel is specified, this is the default.
chisquared
A Chi Squared kernel.
rbf [var]
A Gaussian RBF kernel. [var] specifies the variance term for
this kernel. Larger values result in a smoother model.
polynomial [ofs] [order]
A polynomial kernel.
[ofs]
An offset value.
[order]
The order of the polynomial.
graphcuttransducer <options>
This is a model-free transduction algorithm. It uses a min-cut/max-flow
graph-cut algorithm to separate each label from all of the others.
<options>
-autotune
Automatically determine a good set of parameters for this model
with the current data.
-neighbors [k]
Set the number of neighbors to connect with each point in order to
form the graph.
hodgepodge
This is a ready-made ensemble of various unrelated learning algorithms.
knn <options>
The k-Nearest-Neighbor instance-based learning algorithm. It uses
Euclidean distance for continuous features and Hamming distance for
nominal features.
<options>
-autotune
Automatically determine a good set of parameters for this model
with the current data.
-neighbors [k]
Specify the number of neighbors, k, to use.
-nonormalize
Specify not to normalize the scale of continuous features. (The
default is to normalize by dividing by 2 times the deviation in
that attribute.)
-equalweight
Give equal weight to every neighbor. (The default is to use linear
weighting for continuous features, and sqared linear weighting for
nominal features.
-scalefeatures
Use a hill-climbing algorithm on the training set to scale the
feature dimensions in order to give more accurate results. This
increases training time, but also improves accuracy and robustness
to irrelevant features.
-pearson
Use Pearson's correlation coefficient to evaluate the similarity
between sparse vectors. (Only compatible with sparse training.)
-cosine
Use the cosine method to evaluate the similarity between sparse
vectors. (Only compatible with sparse training.)
linear
A linear regression model
meanmarginstree
This is a very simple oblique (or linear combination) tree. (This
algorithm is specified in Gashler, Michael S. and Giraud-Carrier,
Christophe and Martinez, Tony. Decision Tree Ensemble: Small
Heterogeneous Is Better Than Large Homogeneous. In The Seventh
International Conference on Machine Learning and Applications, Pages 900
- 905, ICMLA '08. 2008)
naivebayes <options>
The naive Bayes learning algorithm.
<options>
-autotune
Automatically determine a good set of parameters for this model
with the current data.
-ess [value]
Specifies an equivalent sample size to prevent unsampled values
from dominating the joint distribution. Good values typically range
between 0 and 1.5.
naiveinstance <options>
This is an instance learner that assumes each dimension is conditionally
independant from other dimensions. It lacks the accuracy of knn in low
dimensional feature space, but scales much better to high dimensionality.
<options>
-autotune
Automatically determine a good set of parameters for this model
with the current data.
-neighbors [k]
Set the number of neighbors to use in each dimension
neighbortransducer <options>
This is a model-free transduction algorithm. It is an instance learner
that propagates labels where the neighbors are most in agreement. This
algorithm does well when classes sample a manifold (such as with text
recognition).
<options>
-autotune
Automatically determine a good set of parameters for this model
with the current data.
-neighbors [k]
Set the number of neighbors to use with each point
neuralnet <options>
A single or multi-layer feed-forward neural network (a.k.a. multi-layer
perceptron). It can be trained with online backpropagation (Rumelhart,
D.E., Hinton, G.E., and Williams, R.J. Learning representations by
back-propagating errors. Nature, 323:9, 1986.), or several other
optimization methods.
<options>
-autotune
Automatically determine a good set of parameters (including number
of layers, hidden units, learning rate, etc.) for this model with
the current data.
-addlayer [size]
Add a hidden layer with "size" logisitic units to the network. You
may use this option multiple times to add multiple layers. The
first layer added is adjacent to the input features. The last layer
added is adjacent to the output labels. If you don't add any hidden
layers, the network is just a single layer of sigmoid units.
-learningrate [value]
Specify a value for the learning rate. The default is 0.1
-momentum [value]
Specifies a value for the momentum. The default is 0.0
-windowepochs [value]
Specifies the number of training epochs that are performed before
the stopping criteria is tested again. Bigger values will result in
a more stable stopping criteria. Smaller values will check the
stopping criteria more frequently.
-minwindowimprovement [value]
Specify the minimum improvement that must occur over the window of
epochs for training to continue. [value] specifies the minimum
decrease in error as a ratio. For example, if value is 0.02, then
training will stop when the mean squared error does not decrease by
two percent over the window of epochs. Smaller values will
typically result in longer training times.
-holdout [portion]
Specify the portion of the data (between 0 and 1) to use as a
hold-out set for validation. That is, this portion of the data will
not be used for training, but will be used to determine when to
stop training. If the holdout portion is set to 0, then no holdout
set will be used, and the entire training set will be used for
validation (which may lead to long training time and overfit).
randomforest [trees] <options>
A baggging ensemble of decision trees that use random division
boundaries. (This algorithm is described in Breiman, Leo (2001). Random
Forests. Machine Learning 45 (1): 5-32. doi:10.1023/A:1010933404324.)
[trees]
Specify the number of trees in the random forest
<options>
-samples [n]
Specify the number of randomly-drawn attributes to evaluate. The
one that maximizes information gain will be chosen for the decision
boundary. If [n] is 1, then the divisions are completely random.
Larger values will decrease the randomness.
reservoir <options>
A reservoir network.
<options>
-augments [d]
The number of dimensions to augment the data with. (Smaller values
lead to smoother models.)
-deviation [dev]
The deviation to use to randomly initialize the weights in the
reservoir.
-layers [n]
The number of hidden layers to use in the reservoir.
wag <options>
A multi-layer perceptron (MLP) that is trained by first training several
MLP models, and then averaging their weights together using a process
called wagging. (Before the weights in hidden layers can be averaged,
they are first aligned using bipartite matching.)
<options>
-addlayer [size]
Add a hidden layer with "size" logisitic units to the network. You
may use this option multiple times to add multiple layers. The
first layer added is adjacent to the input features. The last layer
added is adjacent to the output labels. If you don't add any hidden
layers, the network is just a single layer of sigmoid units.
-learningrate [value]
Specify a value for the learning rate. The default is 0.1
-models [k]
Specify the number of MLP models to train and then average
together.
-momentum [value]
Specifies a value for the momentum. The default is 0.0
-windowepochs [value]
Specifies the number of training epochs that are performed before
the stopping criteria is tested again. Bigger values will result in
a more stable stopping criteria. Smaller values will check the
stopping criteria more frequently.
-minwindowimprovement [value]
Specify the minimum improvement that must occur over the window of
epochs for training to continue. [value] specifies the minimum
decrease in error as a ratio. For example, if value is 0.02, then
training will stop when the mean squared error does not decrease by
two percent over the window of epochs. Smaller values will
typically result in longer training times.
-noalign
Specify to compute weight averages without first aligning the
corresponding weights. This option will typically make results
significantly worse, but it may be useful for evaluating the value
of aligning the weights before averaging them together.
-holdout [portion]
Specify the portion of the data (between 0 and 1) to use as a
hold-out set for validation. That is, this portion of the data will
not be used for training, but will be used to determine when to
stop training. If the holdout portion is set to 0, then no holdout
set will be used, and the entire training set will be used for
validation (which may lead to long training time and overfit).
-dontsquashoutputs
Don't squash the outputs values with the logistic function. Just
report the net value at the output layer. This is often used for
regression.
usage
Print usage information.
Previous Next Back to the table of contents |