Back to the table of contents Previous Next waffles_learnA command-line tool that wraps supervised and semi-supervised learning algorithms. Here's the usage information: Full Usage Information [Square brackets] are used to indicate required arguments. <Angled brackets> are used to indicate optional arguments. waffles_learn [command] Supervised learning, transduction, cross-validation, etc. autotune [dataset] <data_opts> [algname] Use cross-validation to automatically determine a good set of parameters for the specified algorithm with the specified data. The selected parameters are printed to stdout. [dataset] The filename of a dataset. <data_opts> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. [algname] The name of the algorithm that you wish to automatically tune. agglomerativetransducer An agglomerative transducer decisiontree A decision tree graphcuttransducer A graph-cut transducer knn A k-nearest-neighbor instance-based learner meanmarginstree A mean margins tree neuralnet A feed-foward neural network (a.k.a. multi-layer perceptron) naivebayes A naive Bayes model naiveinstance A naive instance model train <options> [dataset] <data_opts> [algorithm] Trains a supervised learning algorithm. The trained model-file is printed to stdout. (Typically, you will want to pipe this to a file.) <options> -seed [value] Specify a seed for the random number generator. (Use this option to ensure that your results are reproduceable.) -calibrate Calibrate the model after it is trained, such that predicted distributions will approximate the distributions represented in the training data. This switch is typically used only if you plan to predict distributions (by calling predictdistribution) instead of just class labels or regression values. Calibration will not effect the predictions made by regular calls to 'predict', which is used by most other tools. -embed Escape the output model such that it can easily be embedded in C or C++ code. [dataset] The filename of a dataset. <data_opts> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. predict <options> [model-file] [dataset] <data_opts> Predict labels for all of the patterns in [dataset]. Results are printed in the form of a ".arff" file (including both features and predictions) to stdout. <options> -seed [value] Specify a seed for the random number generator. (Use this option to ensure that your results are reproduceable.) [model-file] The filename of a trained model. (This is the file to which you saved the output when you trained a supervised learning algorithm.) [dataset] The filename of a dataset. (There should already be placeholder labels in this dataset. The placeholder labels will be replaced in the output by the labels that the model predicts.) <data_opts> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. predictdistribution <options> [model-file] <data_opts> [pattern] Predict a distribution for a single feature vector and print it to stdout. (Typically, the '-calibrate' switch should be used when training the model. If the model is not calibrated, then the predicted distribution may not be a very good estimated distribution. Also, some models cannot be used to predict a distribution.) <options> -seed [value] Specify a seed for the random number generator. (Use this option to ensure that your results are reproduceable.) [model-file] The filename of a trained model. (This is the file to which you saved the output when you trained a supervised learning algorithm.) [pattern] A list of feature values separated by spaces. (A "?" may be used for unknown feature values if the model supports using unknown feature values.) <data_opts> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. test <options> [model-file] [dataset] <data_opts> Test a trained model using some test data. Results are printed to stdout for each dimension in the label vector. Predictive accuracy is reported for nominal label dimensions, and mean-squared-error is reported for continuous label dimensions. <options> -seed [value] Specify a seed for the random number generator. (Use this option to ensure that your results are reproduceable.) -confusion Print a confusion matrix for each nominal label attribute. -confusioncsv Print a confusion matrix in comma-separated value format for each nominal label attribute. [model-file] The filename of a trained model. (This is the file to which you saved the output when you trained a supervised learning algorithm.) [dataset] The filename of a test dataset. (This dataset must have the same number of columns as the dataset with which the model was trained.) <data_opts> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. transduce <options> [labeled-set] <data_opts1> [unlabeled-set] <data_opts2> [algorithm] Predict labels for [unlabeled-set] based on the examples in [labeled-set]. For most algorithms, this is the same as training on [labeled-set] and then predicting labels for [unlabeled-set]. Some algorithms, however, have no models. These can transduce, even though they cannot be trained. The predicted labels are printed to stdout as a ".arff" file. <options> -seed [value] Specify a seed for the random number generator. (Use this option to ensure that your results are reproduceable.) [labeled-set] The filename of a dataset. The labels in this dataset are used to infer labels for the unlabeled set. [unlabeled-set] The filename of a dataset. This dataset must have placeholder labels, but these will be ignored when predicting new labels. <data_opts1> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. <data_opts2> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. transacc <options> [training-set] <data_opts1> [test-set] <data_opts2> [algorithm] Measure the transductive accuracy of [algorithm] with respect to the specified training and test sets. Results are printed to stdout for each dimension in the label vector. Predictive accuracy is reported for nominal labels, and mean-squared-error is reported for continuous labels. <options> -seed [value] Specify a seed for the random number generator. (Use this option to ensure that your results are reproduceable.) [training-set] The filename of a dataset. The labels in this dataset are used to infer labels for the unlabeled set. [test-set] The filename of a dataset. This dataset must have placeholder labels. The placeholder labels will be replaced in the output with the new predicted labels. <data_opts1> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. <data_opts2> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. splittest <options> [dataset] <data_opts> [algorithm] This shuffles the data, then splits it into two parts, trains with one part, and tests with the other. (This also works with model-free algorithms.) Results are printed to stdout for each dimension in the label vector. Predictive accuracy is reported for nominal labels, and mean-squared-error is reported for continuous labels. <options> -seed [value] Specify a seed for the random number generator. (Use this option to ensure that your results are reproduceable.) -trainratio [value] Specify the amount of the data (between 0 and 1) to use for training. The rest will be used for testing. -reps [value] Specify the number of repetitions to perform. If not specified, the default is 1. -writelastmodel [filename] Write the model generated on the last repetion to the given filename. Note that this only works when the learner being used has an internal model. [dataset] The filename of a dataset. <data_opts> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. crossvalidate <options> [dataset] <data_opts> [algorithm] Perform cross-validation with the specified dataset and algorithm. Results are printed to stdout. <options> -seed [value] Specify a seed for the random number generator. (Use this option to ensure that your results are reproduceable.) -reps [value] Specify the number of repetitions to perform. If not specified, the default is 5. -folds [value] Specify the number of folds to use. If not specified, the default is 2. -succinct Just report the average mean squared error. Do not report results at each fold. [dataset] The filename of a dataset. <data_opts> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. precisionrecall <options> [dataset] <data_opts> [algorithm] Compute the precision/recall for a dataset and algorithm <options> -seed [value] Specify a seed for the random number generator. (Use this option to ensure that your results are reproduceable.) -labeldims [n] Specify the number of dimensions in the label (output) vector. The default is 1. (Don't confuse this with the number of class labels. It only takes one dimension to specify a class label, even if there are k possible labels.) -reps [n] Specify the number of reps to perform. More reps means it will take longer, but results will be more accurate. The default is 5. -samples [n] Specify the granularity at which to measure recall. If not specified, the default is 100. [dataset] The filename of a dataset. <data_opts> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. sterilize <options> [dataset] <data_opts> [algorithm] Perform cross-validation to generate a new dataset that contains only the correctly-classified instances. The new sterilized data is printed to stdout. <options> -seed [value] Specify a seed for the random number generator. (Use this option to ensure that your results are reproduceable.) -folds [n] Specify the number of cross-validation folds to perform. -diffthresh [d] Specify a threshold of absolute difference for continuous labels. Predictions with an absolute difference less than this threshold are considered to be "correct". [dataset] The filename of a dataset to sterilize. <data_opts> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. trainrecurrent <options> [method] [obs-data] [action-data] [context-dims] [algorithm] [algorithm] Train a recurrent model of a dynamical system with the specified training [method]. The training data is specified by [obs-data], which specifies the sequence of observations, and [action-data], which specifies the sequence of actions. [context-dims] specifies the number of dimensions in the state-space of the system. The two algorithms specify the two functions of a model of a dynamical system. The first [algorithm] models the transition function. The second [algorithm] models the observation function. <options> -seed [value] Specify a seed for the random number generator. (Use this option to ensure that your results are reproduceable.) -paramdims 2 [wid] [hgt] If observations are images, use this option to parameterize the predictions, so only the channel values of each pixel are predicted. (Other values besides 2 dimensions are also supported.) -state [filename] Save the estimated state to the specified file. (Only has effect if moses is used as the training method.) -validate [interval] 1 [obs] [action] Perform validation at [interval]-second intervals with observation data, [obs], and action data, [action]. (Also supports more than 1 validation sequence if desired.) -out [filename] Save the resulting model to the specified file. If not speicified, the default is "model.json". -noblur Do not use blurring. The default is to use blurring. Sometimes blurring improves results. Sometimes not. -traintime [seconds] Specify how many seconds to train the model. The default is 3600, which is 1 hour. -isomap Use Isomap instead of Breadth-first Unfolding if moses is used as the training method. [method] moses Use Temporal-NLDR to estimate state, then build the model using the state estimate. bptt [depth] [iters-per-grow-sequence] Backpropagation Through Time. [depth] specifies the number of instances of the transition function that will appear in the unfolded model. A good value might be 3. [iters-per-grow-sequence] specifies the number of pattern presentations before the sequence is incremented. A good value might be 50000. evolutionary Train with evoluationary optimization hillclimber Train with a hill-climbing algorithm. annealing [deviation] [decay] [window] Train with simulated annealing. Good values might be 2.0 0.5 300 regress [data] <data_opts> [equation] Use a hill climbing algorithm to optimize the parameters of [equation] to fit to the [data]. If [data] has d feature dimensions, then [equation] must have more than d parameters. The equation must be named f. The first d arguments to f are supplied by the data features. The remaining arguments are optimized by the hill climber. The data must have exactly 1 label dimension, which the equation will attempt to predict. The sum-squared error and parameter values are printed to stdout. [dataset] The filename of a dataset. <data_opts> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. [equation] An equation to regress to fit the data. The equation must be named 'f'. It can call helper-equations, separated by semicolons, if needed. Example: "f(x1,x2,p1,p2,p3)=s(p1*x1+p2*x2+p3);s(x)=1/(1+e^(-x))" metadata [data] <data_opts> Generate a vector of metadata values for the given dataset. This might be useful for meta-analysis. The resulting vector is printed to stdout in ARFF format. usage Print usage information. [algorithm] A supervised learning algorithm, or a transductive algorithm. agglomerativetransducer A model-free transduction algorithm based on single-link agglomerative clustering. Unlabeled patterns take the label of the cluster with which they are joined. It never joins clusters with different labels. bag <contents> end A bagging (bootstrap aggregating) ensemble. This is a way to combine the power of many learning algorithms through voting. "end" marks the end of the ensemble contents. Each algorithm instance is trained using a training set created by drawing (with replacement) from the original data until the training set has the same number of instances as the original data. <contents> [instance_count] [algorithm] Specify the number of instances of a learning algorithm to add to the bagging ensemble. baseline This is one of the simplest of all supervised algorithms. It ignores all features. For nominal labels, it always predicts the most common class in the training set. For continuous labels, it always predicts the mean label in the training set. An effective learning algorithm should never do worse than baseline--hence the name "baseline". bucket <contents> end This uses cross-validation with the training set to select the best model from a bucket of models. When accuracy is measured across multiple datasets, it will usually do better than the best model in the bucket could do. "end" marks the end of the contents of the bucket. <contents> [algorithm] Add an algorithm to the bucket bma <contents> end A Bayesian model averaging ensemble. This trains each model after the manner of bagging, but then combines them weighted according to their probability given the data. Uniform priors are assumed. <contents> [instance_count] [algorithm] Specify the number of instances of a learning algorithm to add to the BMA ensemble. bmc <options> <contents> end A Bayesian model combination ensemble. This algorithm is described in Monteith, Kristine and Carroll, James and Seppi, Kevin and Martinez, Tony, Turning Bayesian Model Averaging into Bayesian Model Combination, Proceedings of the IEEE International Joint Conference on Neural Networks IJCNN'11, 2657--2663, 2011. <options> -samples [n] Specify the number of samples to draw from the simplex of possible ensemble combinations. (Larger values result in better accuracy with the cost of more computation.) <contents> [instance_count] [algorithm] Specify the number of instances of a learning algorithm to add to the BMA ensemble. boost <options> [algorithm] Uses ResamplingAdaBoost to create an ensemble that may be more accurate than a lone instance of the specified algorithm. (ResamplingAdaBoost is similar to AdaBoost, except that it uses resampling to approximate weighted instances in the training set. This difference enables it to work with algorithms that do not implicitly support weighted samples.) <options> -trainratio [value] When approximating the weighted training set by resampling, use a sample of size [value]*training_set_size -size [n] The number of base learners to use in the ensemble. cvdt [n] This is a bucket of two bagging ensembles: one with [n] entropy-reducing decision trees, and one with [n] meanmarginstrees. (This algorithm is specified in Gashler, Michael S. and Giraud-Carrier, Christophe and Martinez, Tony. Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous. In The Seventh International Conference on Machine Learning and Applications, Pages 900 - 905, ICMLA '08. 2008) decisiontree <options> A decision tree. <options> -autotune Automatically determine a good set of parameters for this model with the current data. -random [draws] Use random divisions (instead of divisions that reduce entropy). Random divisions make the algorithm train faster, and also increase model variance, so it is better suited for ensembles, but random divisions also make the decision tree more vulnerable to problems with irrelevant features. [draws] is typically 1, but if you specify a larger value, it will pick the best out of the specified number of random draws. -binary Use binary divisions. For nominal attributes with more than 2 categorical values, one specific value will be separated from all others at each division. -leafthresh [n] When building the tree, if the number of samples is <= this value, it will stop trying to divide the data and will create a leaf node. The default value is 1. For noisy data, larger values may be advantageous. -maxlevels [n] When building the tree, if the depth (the length of the path from the root to the node currently being formed, including the root and the currently forming node) is [n], it will stop trying to divide the data and will create a leaf node. This means that there will be at most [n]-1 splits before a decision is made. This crudely limits overfitting, and so can be helpful on small data sets. It can also make the resulting trees easier to interpret. If set to 0, then there is no maximum (which is the default). gaussianprocess A Gaussian process model. <options> -noise [var] The variance of the noise parameter. -prior [var] The prior variance for the weights. (This value will be multiplied by an identity matrix to form the prior covariance for the weights. -maxsamples [n] The maximum number of samples to train with. (If the training data contains more than [n] rows, then it will automatically randomly sub-sample the training data in order to limit computational complexity.) -kernel [k] Specify the kernel to use identity This simple kernel causes it to learn a linear model. If no kernel is specified, this is the default. chisquared A Chi Squared kernel. rbf [var] A Gaussian RBF kernel. [var] specifies the variance term for this kernel. Larger values result in a smoother model. polynomial [ofs] [order] A polynomial kernel. [ofs] An offset value. [order] The order of the polynomial. graphcuttransducer <options> This is a model-free transduction algorithm. It uses a min-cut/max-flow graph-cut algorithm to separate each label from all of the others. <options> -autotune Automatically determine a good set of parameters for this model with the current data. -neighbors [k] Set the number of neighbors to connect with each point in order to form the graph. hodgepodge This is a ready-made ensemble of various unrelated learning algorithms. knn <options> The k-Nearest-Neighbor instance-based learning algorithm. It uses Euclidean distance for continuous features and Hamming distance for nominal features. <options> -autotune Automatically determine a good set of parameters for this model with the current data. -neighbors [k] Specify the number of neighbors, k, to use. -nonormalize Specify not to normalize the scale of continuous features. (The default is to normalize by dividing by 2 times the deviation in that attribute.) -equalweight Give equal weight to every neighbor. (The default is to use linear weighting for continuous features, and sqared linear weighting for nominal features. -scalefeatures Use a hill-climbing algorithm on the training set to scale the feature dimensions in order to give more accurate results. This increases training time, but also improves accuracy and robustness to irrelevant features. -pearson Use Pearson's correlation coefficient to evaluate the similarity between sparse vectors. (Only compatible with sparse training.) -cosine Use the cosine method to evaluate the similarity between sparse vectors. (Only compatible with sparse training.) linear A linear regression model meanmarginstree This is a very simple oblique (or linear combination) tree. (This algorithm is specified in Gashler, Michael S. and Giraud-Carrier, Christophe and Martinez, Tony. Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous. In The Seventh International Conference on Machine Learning and Applications, Pages 900 - 905, ICMLA '08. 2008) naivebayes <options> The naive Bayes learning algorithm. <options> -autotune Automatically determine a good set of parameters for this model with the current data. -ess [value] Specifies an equivalent sample size to prevent unsampled values from dominating the joint distribution. Good values typically range between 0 and 1.5. naiveinstance <options> This is an instance learner that assumes each dimension is conditionally independant from other dimensions. It lacks the accuracy of knn in low dimensional feature space, but scales much better to high dimensionality. <options> -autotune Automatically determine a good set of parameters for this model with the current data. -neighbors [k] Set the number of neighbors to use in each dimension neighbortransducer <options> This is a model-free transduction algorithm. It is an instance learner that propagates labels where the neighbors are most in agreement. This algorithm does well when classes sample a manifold (such as with text recognition). <options> -autotune Automatically determine a good set of parameters for this model with the current data. -neighbors [k] Set the number of neighbors to use with each point neuralnet <options> A single or multi-layer feed-forward neural network (a.k.a. multi-layer perceptron). It can be trained with online backpropagation (Rumelhart, D.E., Hinton, G.E., and Williams, R.J. Learning representations by back-propagating errors. Nature, 323:9, 1986.), or several other optimization methods. <options> -autotune Automatically determine a good set of parameters (including number of layers, hidden units, learning rate, etc.) for this model with the current data. -addlayer [size] Add a hidden layer with "size" logisitic units to the network. You may use this option multiple times to add multiple layers. The first layer added is adjacent to the input features. The last layer added is adjacent to the output labels. If you don't add any hidden layers, the network is just a single layer of sigmoid units. -learningrate [value] Specify a value for the learning rate. The default is 0.1 -momentum [value] Specifies a value for the momentum. The default is 0.0 -windowepochs [value] Specifies the number of training epochs that are performed before the stopping criteria is tested again. Bigger values will result in a more stable stopping criteria. Smaller values will check the stopping criteria more frequently. -minwindowimprovement [value] Specify the minimum improvement that must occur over the window of epochs for training to continue. [value] specifies the minimum decrease in error as a ratio. For example, if value is 0.02, then training will stop when the mean squared error does not decrease by two percent over the window of epochs. Smaller values will typically result in longer training times. -holdout [portion] Specify the portion of the data (between 0 and 1) to use as a hold-out set for validation. That is, this portion of the data will not be used for training, but will be used to determine when to stop training. If the holdout portion is set to 0, then no holdout set will be used, and the entire training set will be used for validation (which may lead to long training time and overfit). randomforest [trees] <options> A baggging ensemble of decision trees that use random division boundaries. (This algorithm is described in Breiman, Leo (2001). Random Forests. Machine Learning 45 (1): 5-32. doi:10.1023/A:1010933404324.) [trees] Specify the number of trees in the random forest <options> -samples [n] Specify the number of randomly-drawn attributes to evaluate. The one that maximizes information gain will be chosen for the decision boundary. If [n] is 1, then the divisions are completely random. Larger values will decrease the randomness. reservoir <options> A reservoir network. <options> -augments [d] The number of dimensions to augment the data with. (Smaller values lead to smoother models.) -deviation [dev] The deviation to use to randomly initialize the weights in the reservoir. -layers [n] The number of hidden layers to use in the reservoir. wag <options> A multi-layer perceptron (MLP) that is trained by first training several MLP models, and then averaging their weights together using a process called wagging. (Before the weights in hidden layers can be averaged, they are first aligned using bipartite matching.) <options> -addlayer [size] Add a hidden layer with "size" logisitic units to the network. You may use this option multiple times to add multiple layers. The first layer added is adjacent to the input features. The last layer added is adjacent to the output labels. If you don't add any hidden layers, the network is just a single layer of sigmoid units. -learningrate [value] Specify a value for the learning rate. The default is 0.1 -models [k] Specify the number of MLP models to train and then average together. -momentum [value] Specifies a value for the momentum. The default is 0.0 -windowepochs [value] Specifies the number of training epochs that are performed before the stopping criteria is tested again. Bigger values will result in a more stable stopping criteria. Smaller values will check the stopping criteria more frequently. -minwindowimprovement [value] Specify the minimum improvement that must occur over the window of epochs for training to continue. [value] specifies the minimum decrease in error as a ratio. For example, if value is 0.02, then training will stop when the mean squared error does not decrease by two percent over the window of epochs. Smaller values will typically result in longer training times. -noalign Specify to compute weight averages without first aligning the corresponding weights. This option will typically make results significantly worse, but it may be useful for evaluating the value of aligning the weights before averaging them together. -holdout [portion] Specify the portion of the data (between 0 and 1) to use as a hold-out set for validation. That is, this portion of the data will not be used for training, but will be used to determine when to stop training. If the holdout portion is set to 0, then no holdout set will be used, and the entire training set will be used for validation (which may lead to long training time and overfit). -dontsquashoutputs Don't squash the outputs values with the logistic function. Just report the net value at the output layer. This is often used for regression. usage Print usage information. Previous Next Back to the table of contents |