Back to the table of contents Previous Next waffles_sparseA command-line tool for learning from and operating on sparse data, as are typically used to represent text-documents, etc. Here's the usage information: Full Usage Information [Square brackets] are used to indicate required arguments. <Angled brackets> are used to indicate optional arguments. waffles_recommend [command] Predict missing values in data, and test collaborative-filtering recommendation systems. crossvalidate <options> [3col-data] [collab-filter] Measure accuracy using cross-validation. Prints MSE and MAE to stdout. <options> -seed [value] Specify a seed for the random number generator. -folds [n] Specify the number of folds. If not specified, the default is 2. [3col-data] The filename of 3-column (user, item, rating) dataset. Column 0 contains a user ID. Column 1 contains an item ID. Column 2 contains the known rating for that user-item pair. fillmissingvalues <options> [data] [collab-filter] Fill in the missing values in an ARFF file with predicted values and print the resulting full dataset to stdout. ([data] is in full users*items or patterns*attributes format, not the dense 3-column format.) <options> -seed [value] Specify a seed for the random number generator. -nonormalize Do not normalize all of the columns to fall between 0 and 1 before imputing the missing values. (The default is to normalize first.) [data] The filename of a dataset with missing values to impute. precisionrecall <options> [3col-data] [collab-filter] Compute precision-recall data <options> -seed [value] Specify a seed for the random number generator. -ideal Ignore the model and compute ideal results (as if the model always predicted correct ratings). [3col-data] The filename of 3-column (user, item, rating) dataset. Column 0 contains a user ID. Column 1 contains an item ID. Column 2 contains the known rating for that user-item pair. roc <options> [3col-data] [collab-filter] Compute data for an ROC curve. (The area under the curve will appear in the comments at the top of the data.) <options> -seed [value] Specify a seed for the random number generator. -ideal Ignore the model and compute ideal results (as if the model always predicted correct ratings). [3col-data] The filename of 3-column (user, item, rating) dataset. Column 0 contains a user ID. Column 1 contains an item ID. Column 2 contains the known rating for that user-item pair. transacc <options> [train] [test] [collab-filter] Train using [train], then test using [test]. Prints MSE and MAE to stdout. <options> -seed [value] Specify a seed for the random number generator. [train] The filename of 3-column (user, item, rating) dataset with one row for each rating. Column 0 contains a user ID. Column 1 contains an item ID. Column 2 contains the known rating for that user-item pair. [test] The filename of 3-column (user, item, rating) dataset with one row for each rating. Column 0 contains a user ID. Column 1 contains an item ID. Column 2 contains the known rating for that user-item pair. usage Print usage information. [collab-filter] A collaborative-filtering recommendation algorithm. bag <contents> end A bagging (bootstrap aggregating) ensemble. This is a way to combine the power of collaborative filtering algorithms through voting. "end" marks the end of the ensemble contents. Each collaborative filtering algorithm instance is trained on a subset of the original data, where each expressed element is given a probability of 0.5 of occurring in the training set. <contents> [instance_count] [collab-filter] Specify the number of instances of a collaborative filtering algorithm to add to the bagging ensemble. baseline A very simple recommendation algorithm. It always predicts the average rating for each item. This algorithm is useful as a baseline algorithm for comparison. clusterdense [n] <options> A collaborative-filtering algorithm that clusters users based on a dense distance metric with k-means, and then makes uniform recommendations within each cluster. [n] The number of clusters to use. <options> -norm [l] Specify the norm for the L-norm distance metric to use. -missingpenalty [d] Specify the difference to use in the distance computation when a value is missing from one or both of the vectors. clustersparse [n] <options> A collaborative-filtering algorithm that clusters users based on a sparse similarity metric with k-means, and then makes uniform recommendations within each cluster. [n] The number of clusters to use. <options> -pearson Use Pearson Correlation to compute the similarity between users. (The default is to use the cosine method.) instance [k] <options> An instance-based collaborative-filtering algorithm that makes recommendations based on the k-nearest neighbors of a user. [k] The number of neighbors to use. <options> -pearson Use Pearson Correlation to compute the similarity between users. (The default is to use the cosine method.) -regularize [value] Add [value] to the denominator in order to regularize the results. This ensures that recommendations will not be dominated when a small number of overlapping items occurs. Typically, [value] will be a small number, like 0.5 or 1.5. -sigWeight [value] Scale the significane weighting of the items based on how many items two users have rated. The default value of 0 indicates the no significance weightig will be done. The significance is scaled as numItemsRatedByBotheUSers/sigWeight. matrix [intrinsic] <options> A matrix factorization collaborative-filtering algorithm. (Implemented according to the specification on page 631 in Takacs, G., Pilaszy, I., Nemeth, B., and Tikk, D. Scalable collaborative filtering approaches for large recommender systems. The Journal of Machine Learning Research, 10:623-656, 2009. ISSN 1532-4435., except with the addition of learning-rate decay and a different stopping criteria.) [intrinsic] The number of intrinsic (or latent) feature dims to use to represent each user's preferences. <options> -regularize [value] Specify a regularization value. Typically, this is a small value. Larger values will put more pressure on the system to use small values in the matrix factors. -miniters [value] Specify a the minimum number of iterations to train the model before checking its validation error. This ensures that model does at least a certain amount of training before converging. -decayrate [value] Specify a decay rate in the range of (0-1) for the learning rate parameter. Value closer to 1 will cause the rate the decay slower while rate closer to 0 cause the a faster decay. nlpca [intrinsic] <options> A non-linear PCA collaborative-filtering algorithm. This algorithm was published in Scholz, M. Kaplan, F. Guy, C. L. Kopka, J. Selbig, J., Non-linear PCA: a missing data approach, In Bioinformatics, Vol. 21, Number 20, pp. 3887-3895, Oxford University Press, 2005. It uses a generalization of backpropagation to train a multi-layer perceptron to fit to the known ratings, and to predict unknown values. [intrinsic] The number of intrinsic (or latent) feature dims to use to represent each user's preferences. <options> -addlayer [size] Add a hidden layer with "size" logisitic units to the network. You may use this option multiple times to add multiple layers. The first layer added is adjacent to the input features. The last layer added is adjacent to the output labels. If you don't add any hidden layers, the network is just a single layer of sigmoid units. -learningrate [value] Specify a value for the learning rate. The default is 0.1 -momentum [value] Specifies a value for the momentum. The default is 0.0 -windowepochs [value] Specifies the number of training epochs that are performed before the stopping criteria is tested again. Bigger values will result in a more stable stopping criteria. Smaller values will check the stopping criteria more frequently. -minwindowimprovement [value] Specify the minimum improvement that must occur over the window of epochs for training to continue. [value] specifies the minimum decrease in error as a ratio. For example, if value is 0.02, then training will stop when the mean squared error does not decrease by two percent over the window of epochs. Smaller values will typically result in longer training times. -dontsquashoutputs Don't squash the outputs values with the logistic function. Just report the net value at the output layer. This is often used for regression. -noinputbias Do not use an input bias. -nothreepass Use one-pass training instead of three-pass training. -regularize [value] Specify a regularization value. Typically, this is a small value. Larger values will put more pressure on the system to use small values in the matrix factors. Note that is only used if three-pass training is being used and there is at least on hidden layer. -miniters [value] Specify a the minimum number of iterations to train the model before checking its validation error. This ensures that model does at least a certain amount of training before converging. -decayrate [value] Specify a decay rate in the range of (0-1) for the learning rate parameter. Value closer to 1 will cause the rate the decay slower while rate closer to 0 cause the a faster decay. hybridnlpca [intrinsic] [item_dataset] <data_opts> <options> A hybrid content-based recommendation and collaborative filter based on NLPCA. This approach uses collaborative filtering and content-based recommendation DUDE. [intrinsic] The number of intrinsic (or latent) feature dims to use to represent each user's preferences. [items_dataset] <data_opts> The dataset representing the item attributes. It is assumed that the item dataset matrix is in the form of item id followed by the attribute values for each item. It assumes that the item corresponds with the first column in the 3-col data. <data_opts> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. <options> -addlayer [size] Add a hidden layer with "size" logisitic units to the network. You may use this option multiple times to add multiple layers. The first layer added is adjacent to the input features. The last layer added is adjacent to the output labels. If you don't add any hidden layers, the network is just a single layer of sigmoid units. -learningrate [value] Specify a value for the learning rate. The default is 0.1 -momentum [value] Specifies a value for the momentum. The default is 0.0 -windowepochs [value] Specifies the number of training epochs that are performed before the stopping criteria is tested again. Bigger values will result in a more stable stopping criteria. Smaller values will check the stopping criteria more frequently. -minwindowimprovement [value] Specify the minimum improvement that must occur over the window of epochs for training to continue. [value] specifies the minimum decrease in error as a ratio. For example, if value is 0.02, then training will stop when the mean squared error does not decrease by two percent over the window of epochs. Smaller values will typically result in longer training times. -dontsquashoutputs Don't squash the outputs values with the logistic function. Just report the net value at the output layer. This is often used for regression. -crossentropy Use cross-entropy instead of squared-error for the error signal. -noinputbias Do not use an input bias. -nothreepass Use one-pass training instead of three-pass training. -regularize [value] Specify a regularization value. Typically, this is a small value. Larger values will put more pressure on the system to use small weight values. Note that is only used if three-pass training is being used and there is at least on hidden layer. -miniters [value] Specify a the minimum number of iterations to train the model before checking its validation error. This ensures that model does at least a certain amount of training before converging. -decayrate [value] Specify a decay rate in the range of (0-1) for the learning rate parameter. Value closer to 1 will cause the rate the decay slower while rate closer to 0 cause the a faster decay. contentbased [item_dataset] <data_opts> [learning_algorithm] <learning_opts> A content-based filter. A content-based recommendation filter is build using the supervised learning algorithms provided in the Waffles toolkit. [items_dataset] <data_opts> The dataset representing the item attributes. It is assumed that the item dataset matrix is in the form of item id followed by the attribute values for each item. It assumes that the item corresponds with the first column in the 3-col data. <data_opts> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. [learning_algorithm] <learning_opts> See the usage statement for the desired learning algorithm using "waffles_learn usage". cbcf [item_dataset] <data_opts> [learning_algorithm] <learning_opts> -- [k] <inst_options> A content-boosted collaborative filter. This algorithm was published in P. Melville, R. Mooney, and R. Nagarajan, Content-Boosted Collaborative Filtering for Improved Recommendations, in Proceedings of the 18th National Conference on Artificial Intelligence (AAAI-02), pp. 187-192, 2002. It uses a content-based filter to fill in the sparse matrix before giving it to a collaborative filter. We followed the Author's implementation and used an instance-based collaborative filter. Note that this algorithm often takes a while to run. [items_dataset] <data_opts> The dataset representing the item attributes. It is assumed that the item dataset matrix is in the form of item id followed by the attribute values for each item. It assumes that the item corresponds with the first column in the 3-col data. <data_opts> -labels [attr_list] Specify which attributes to use as labels. (If not specified, the default is to use the last attribute for the label.) [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. -ignore [attr_list] Specify attributes to ignore. [attr_list] is a comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. [learning_algorithm] <learning_opts> See the usage statement for the desired learning algorithm using "waffles_learn usage". -- Denotes the ending of the learning algorithm parameters and the parameters for the collaborative filter. [k] The number of neighbors to use. <inst_options> -pearson Use Pearson Correlation to compute the similarity between users. (The default is to use the cosine method.) -regularize [value] Add [value] to the denominator in order to regularize the results. This ensures that recommendations will not be dominated when a small number of overlapping items occurs. Typically, [value] will be a small number, like 0.5 or 1.5. -sigWeight [value] Scale the significane weighting of the items based on how many items two users have rated. The default value of 0 indicates the no significance weightig will be done. The significance is scaled as numItemsRatedByBotheUSers/sigWeight. Previous Next Back to the table of contents |