Wrapped progressive sampling search for optimizing learning algorithm parameters, A Van den Bosch

Tags: experiments, algorithms, data set, learning algorithms, default settings, search method, training set, parameter optimization, test data, error reduction, internal training, training data, wrapping, learning algorithm, parameter values, algorithm, Wrapped progressive, wps, data set size, percentage of error, data sets, parameter value, bin, accuracies, parameter settings, machine learning algorithms, Tilburg University, training material, Antal van den Bosch, Computational Linguistics
Content: Wrapped Progressive Sampling Search for Optimizing Learning Algorithm Parameters Antal van den Bosch ILK / Computational Linguistics and AI, Tilburg University P.O. Box 90153, NL-5000 LE Tilburg, The Netherlands Abstract We present a heuristic meta-learning search method for finding a set of optimized algorithmic parameters for a range of machine learning algorithms. The method, wrapped progressive sampling, is a combination of classifier wrapping and progressive sampling of training data. A series of experiments on UCI benchmark data sets with nominal features, and five machine learning algorithms to which simple wrapping and wrapped progressive sampling is applied, yields results that show little improvement for the algorithm which offers few parameter variations, but marked improvements for the algorithms offering many possible testable parameter combinations, yielding up to 32.2% error reduction with the winnow learning algorithm. 1 Introduction It is common knowledge that large changes can be observed in the generalisation accuracy of a machine learning algorithm on some task when instead of its default algorithmic parameter settings, one or more parameters are given a non-default value. The fundamental problem in algorithmic parameter selection (or model selection) is that it is hard to estimate which parameter setting would lead to optimal generalization performance on new data. One can estimate it on the labeled data available for training purposes, but optimizing parameters on that easily leads to overfitting. A remedy for overfitting is to use classifier wrapping [7], which partitions the available labeled training material in internal training and test data, and which performs cross-validation experiments to estimate a trainingset-internal generalization accuracy. Using this method, competitions can be held among parameter settings, to determine the average best-performing setting to be used later in the experiment on the real test data. For many tasks and algorithms it is not feasible to test all possible combinations of parameter settings and feature selections exhaustively. By opening up a vast search space, exhaustive wrapping moves the burden of the optimization process to raw computing power. On the other hand, search methods can offer heuristic solutions to finding sufficiently good solutions in a large search space. This paper describes a method for optimizing algorithmic parameters based on wrapped progressive sampling, a heuristic search method we describe in detail in the next section. The method is based on classifier wrapping, but it does not venture into
searching among all possible combinations of parameter settings exhaustively on all training data available. Rather, it borrows a heuristic from progressive sampling [10]. The goal of progressive sampling is to iteratively seek a data set size at which generalization performance on test material converges. The method is to start at a small Sample Size training set, and progressively increase the training set while monitoring the convergence ­ and halting the process at some training set size when the error on test material has converged. In this study we do not actively monitor convergence, but we do adopt the progressive sampling method in which we test decreasing amounts of settings combinations with increasing amounts of training data. 2 Wrapped progressive sampling The wrapped progressive sampling (henceforth wps) algorithm takes as input a data set of labeled examples D. An example is represented by a vector of feature values, and labeled with one out of a set of possible output labels. This data set is used as the basis for sampling smaller training and test sets. In the following subsections we detail this process. We start by describing how the sizes of the progressively-sampled training and test sets are computed in Subsection 2.1. We then describe the wps process in Subsection 2.2. 2.1 Determining progressive sample sizes The first action of the wps method is to determine the progressive sizes of the wrapping data set samples that will be needed. Both training sets and test sets will be needed; a single cut is made of the original data (after being suffled randomly) in 80% of the examples for internal training data, and 20% for internal test data. We generate a frequency-clipped pseudo-quadratic series, according to the following three-step procedure: First, let n be the number of labeled examples in the 80% part of the data set designated to extract internal training material from. A quadratic sequence of d data set sizes is created by using a factor f = d n. In all experiments, d = 20. Starting with a seed data set of one example, a sequence of i = {1 . . . d} data sets with sizes sizei is created by letting size1 = 1 and for every i, sizei = sizei-1 f . We then limit the generated list of 20 sizes down to a list containing only the data sets with more than 500 examples. We also include the 500-example data set itself as the first set. This leaves a clipped pseudo-quadratic series. For each of the training sets, an accompanying test set is created by taking, from the tail of the 20% compartment of the full data set designated for test material, a set that has 20% of the size of its corresponding training set. 2.2 The wrapped progressive sampling procedure The wps procedure is an iterative procedure over the clipped list of data set sizes. The procedure operates on a pool of settings, S, where one setting is a unique
combination of algorithmic parameter values of the chosen machine-learning algorithm A (next to data set D, A is given as input to the wps procedure). On the outset, S contains all possible combinations of values of algorithm A's parameters. We refer to them as s1 . . . sc, c being the total number of possible combinations. The first step of wps is to perform experiments with all settings in s1 . . . sc. Each of these experiments involves training algorithm A on the first training set (500 examples) and testing the learned model on the first test set (100 examples), and measuring A's test accuracy, viz. the percentage of correctly classified test examples. This produces a list of accuracies, acc(s1) . . . acc(sc). As the second step, badly-performing settings from the current set are removed on grounds of their low score. Any such selection should be performed with some care, since it is unknown whether a currently badly performing setting would perform relatively better as compared to other settings when trained on more examples. Wps, therefore, does not simply sort the list of accuracies and cuts away the lower-performing half or some other predefined fraction. Rather, it attempts to estimate at each step the subset of accuracies that stands out as the best performing group, whichever portion of the total set of accuracies that is. To this end, a simple linear histogram is computed on all accuracies, dividing them in ten equally-sized bins, b1 . . . b10 (the notation for the size of a bin, the number of accuracies in the bin, is |bi|). Without assuming any distribution over the bins, wps enacts the following procedure to determine which settings are to be selected for the next step. This procedure produces a selection of bins, which in turn represents a set of settings represented by the selected bins. First, the bin with the highest accuracies is taken as the first selected bin. Subsequently, every preceding bin is also selected that represents an equal number of settings or more than its subsequent bin, |bi| |bi+1|. This is determined in a loop that halts as soon as |bi| < |bi+1|. Next, all non-selected settings are deleted from S, and the next step is initiated. This involves discarding the current training set and test set, and replacing them by their next-step progressively sampled versions. On this bigger-sized training and test set combo, all settings in S are tested through experiments, a histogram is computed on the outcomes, etcetera. The process is iterated until either one of these stop conditions is met: (1) After the most recent setting selection, only one setting is left. Even if more training set sizes are available, these are not used, and search halts, returning the one selected setting. Or, (2) after the last setting selection on the basis of experiments with the largest training and test set sizes, several settings are still selected. First, it is checked whether A's default setting is among them (we discuss default settings in the next Section). If it is, this default setting is returned. If not, a random selection is made among the selected settings, and the randomly chosen setting is returned. 3 Customizing wrapped progressive sampling to classifiers The wps procedure can operate on any supervised machine learning algorithm (classifier) that learns from labeled examples and is able to label new data after
Algorithm ripper c4.5 maxent ib1 winnow
Parameters varied -F (1, 2, 5, 10, 20, 50); -a (-freq, +freq); -n (-n, -!n), -S (0.5, 1.0, 2.0); -O (0, 1, 2); -L (0.5, 1.0, 2.0) -m (1, 2, 5, 10, 20, 50, 100, 200, 500); -c (5, 10, 15, 20, 25, . . ., 90, 95, 100); -g (not set, -g) --gis or ­lb-fgs, -g (0.0, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, only with --lb-fgs) -k (1, 3, 5, 7, 9, 11, 13, 15, 19, 25, 35); -w (0, 1, 2, 3, 4); -m (O, M, J); -d (Z, IL, ID, ED1, only with k > 1); -L (1, 2, only with -mM or -mJ) promotion (1.1, 1.2, 1.3, 1.5); demotion (0.5, 0.7, 0.8, 0.9); threshold (1.0, 2.0, 3.0, 4.0, 5.0); -r (2, 3, 5); -S (0.0, 0.1, 0.2, 0.3, 0.4)
Combinations 648 360 11 925 1200
Table 1: Varied parameters (as they appear on the command line) with their tested values and constraints between parentheses. Default values are printed in bold. The right column lists the total number of combinations of these parameter values, for the five tested algorithms.
learning. The procedure is only effective, naturally, to those algorithms which have algorithmic parameters that change some significant bias aspect of the learning algorithm. Provided that the chosen algorithm has one or more parameters which can all take two or more values, the wps method can be applied to all possible settings, each representing one particular combination of possible parameter values. However, this simple procedure may not be applied blindly. Some algorithmic parameters may be mutually exclusive; some parameters may only be selected in combination with another parameter or some other parameter's value. This implies that the method has to be customized for each classifier, using background knowledge about the algorithm's mutual parameter constraints (which are usually known and specified by the algorithm's developers) and about the algorithms' parameter value constraints (which may only be known as rules of thumb). We customized wps to the following five well-known machine-learning algorithms. (1) Ripper [4] is a rule-induction algorithm that compresses a data set of labeled examples into an ordered rule list. We employed implementation V1 Release 2.5 patch 1. (2) C4.5 [11] is an algorithm for the top-down induction of decision trees. It compresses a data set of labeled examples into a decision tree that has class labels at its nodes, and tests on feature values on its branches. We employed implementation Release 8. (3) Maxent [6] is a probabilistic learner in which the central probability matrix between feature values and class labels is smoothed towards a state of maximum entropy. We used the implementation by [8], version 20040315. (4) Ib1 [1] is an instance-based classifier based on the k-nearest neighbor (k-NN) classification rule. We used an implementation that supports a range of k-NN kernel plugins, TiMBL [5], version 5.0.0 patch 3. (5) Winnow [9] is a linear-threshold classifier which learns weights on the model parameters (features) in a learning phase through an error-based weight update rule, which at the side removes weights that end up below a threshold. We employed SNoW [3], a sparse implementation of Winnow classifiers, version 3.1.3. Table 1 lists the parameters with their values that were varied, and the total number of combinations of parameter settings tested in the first pseudo-exhaustive round of
Task audiology bridges soybean tic-tac-toe votes car connect4 kr-vs-kp splice nursery
Number of
examples features
228
69
110
7
685
35
960
9
437
16
1,730
6
67,559
42
3,197
36
3,192
60
12,961
8
classes 24 8 19 2 2 4 3 2 3 5
Class entropy 3.41 2.50 3.84 0.93 0.96 1.21 1.22 1.00 1.48 1.72
Table 2: Basic statistics of the six UCI repository data sets. On the top five tasks normal wrapping is performed rather than wps.
wps.
4 Experimental setup For our experiments we used ten benchmark data sets from the UCI repository [2]. Table 2 lists some data set statistics for the ten selected data sets, which all have nominal attributes. Note that "soybean" is short for "soybean-large", and "car" is short for "car evaluation". As the table illustrates, the selection displays a wide variance in numbers of examples, features, classes, and entropy of the classes. The top five data sets in Table 2 have less than 1000 examples. We apply straightforward wrapping to these five, and apply wps to the bottom five tasks, to allow for some comparison between the two approaches. The rationale for this split is twofold; (1) it is feasible to run pseudo-exhaustive wrapping 10-fold cross-validation experiments on the smaller datasets, while this is gradually more infeasible with 10 or 100 times as much instances, as in the bottom five tasks; (2) little progressive sampling is possible with under 500 or just over 500 instances, which is the fixed size of the initial training set of wps. All data sets were concatenated into one full set (if there were originally disjoint training and test sets provided in the UCI repository), shuffled randomly, and subsequently partitioned into ten 10% partitions. On these partitions 10-fold crossvalidation (CV) experiments were performed. All experiments had two variants: in one variant, the default setting of the particular implementation was used for the 10-fold CV experiment. In the other variant, wps was performed. The difference between the two variants was subject to a one-tailed unpaired t-test.
5 Results Table 3 displays the effects of normal wrapping and wps on generalization accuracy on the ten UCI benchmark data sets by ripper, c4.5, maxent, ib1, and winnow, respectively. All tables give an average and the standard deviation for both the default setting and the settings found by normal wrapping on the top five tasks, and wps on the bottom five tasks, measured in 10-fold CV experiments. For
Task audiology bridges soybean tic-tac-toe votes car connect4 kr-vs-kp splice nursery audiology bridges soybean tic-tac-toe votes car connect4 kr-vs-kp splice nursery audiology bridges soybean tic-tac-toe votes car connect4 kr-vs-kp splice nursery audiology bridges soybean tic-tac-toe votes car connect4 kr-vs-kp splice nursery audiology bridges soybean tic-tac-toe votes car connect4 kr-vs-kp splice nursery
% Correct test instances
default
wrapping / wps
ripper
75.4 58.2 91.8 97.5 95.4
± 8.1 ± 14.8 ± 2.5 ± 1.3 ± 2.5
76.3 55.5 91.5 99.8 95.2
± 9.0 ± 12.5 ± 2.5 ± 0.3 ± 2.8
87.8 75.4 99.2 93.4 96.6
± 3.2 ± 0.8 ± 0.4 ± 1.6 ± 0.7
98.0 77.0 98.9 94.3 98.9
± 0.6 ± 1.3 ± 0.8 ± 1.0 ± 0.5
c4.5
78.0 49.1 91.2 84.7 95.9 91.9 80.9 99.5 94.0 97.0
± 8.0 ± 6.0 ± 2.9 ± 4.8 ± 2.6 ± 2.2 ± 0.4 ± 0.3 ± 1.6 ± 0.4
82.4 52.7 92.4 85.9 95.4 93.2 79.4 99.6 93.7 97.5
± 7.0 ± 7.9 ± 2.5 ± 3.1 ± 2.5 ± 1.9 ± 1.7 ± 0.2 ± 1.7 ± 0.7
maxent
82.8 56.5 92.6 98.1 96.1
± 9.3 ± 16.7 ± 2.3 ± 1.0 ± 2.3
81.5 57.3 92.3 98.2 96.3
± 8.3 ± 15.2 ± 2.3 ± 0.9 ± 3.0
93.5 75.7 97.8 90.7 92.5
± 1.7 ± 1.4 ± 0.9 ± 2.2 ± 0.5
93.2 74.7 97.1 94.5 92.5
± 1.7 ± 1.7 ± 1.0 ± 2.8 ± 0.4
ib1
79.3 54.6 91.8 89.5 93.8 94.0 71.0 97.7 91.9 94.7
± 9.7 ± 8.1 ± 3.1 ± 2.6 ± 3.7 ± 0.9 ± 0.3 ± 0.5 ± 2.0 ± 0.5
79.8 50.9 94.6 99.2 95.9 96.6 77.9 96.8 95.4 99.3
± 8.7 ± 13.6 ± 2.9 ± 0.6 ± 3.2 ± 1.2 ± 2.1 ± 0.7 ± 1.0 ± 0.2
winnow
71.3 58.2 84.3 90.7 95.0
± 14.7 ± 15.9 ± 5.9 ± 3.0 ± 2.7
77.0 56.0 88.2 94.3 95.4
± 11.8 ± 13.2 ± 6.1 ± 2.1 ± 2.9
94.3 47.2 97.2 92.1 96.4
± 2.1 ± 5.8 ± 0.9 ± 2.0 ± 0.7
96.4 69.5 97.0 94.7 98.4
± 1.7 ± 2.5 ± 1.0 ± 1.6 ± 0.4
% Error reduction 3.6 -6.5 -3.5 93.2 -4.8 84.0 6.6 -30.9 13.3 66.4 19.9 7.1 13.4 8.1 -11.5 16.3 -7.8 20.0 -5.5 15.3 -7.6 2.1 -3.8 5.4 5.6 -5.4 -4.1 -29.3 41.3 -0.7 2.4 -8.0 34.0 92.1 33.7 43.3 24.7 -42.5 43.7 86.9 19.9 -5.3 24.9 38.2 9.2 36.4 42.2 -5.3 33.2 54.6
One-tailed t-test
p < 0.001 p < 0.001 p < 0.01 p < 0.001
t = 0.233 t = 0.445 t = 0.258 t = 5.330 t = 0.187 t = 10.122 t = 3.373 t = 0.884 t = 1.447 t = 7.732
p < 0.01 p < 0.05
t = 1.302 t = 1.156 t = 0.892 t = 0.689 t = 0.410 t = 1.450 t = 2.750 t = 0.799 t = 0.442 t = 1.889
p < 0.01
t = 0.332 t = 0.126 t = 0.269 t = 0.228 t = 0.186 t = 0.463 t = 1.437 t = 1.545 t = 3.437 t = 0.296
p < 0.05 p < 0.001 p < 0.001 p < 0.001 p < 0.001 p < 0.001 p < 0.001
t = 0.119 t = 0.726 t = 2.053 t = 11.423 t = 1.346 t = 5.569 t = 10.171 t = 3.649 t = 4.932 t = 28.407
p < 0.01 p < 0.05 p < 0.001 p < 0.01 p < 0.001
t = 0.959 t = 0.340 t = 1.457 t = 3.057 t = 0.372 t = 2.396 t = 11.087 t = 0.360 t = 3.272 t = 7.653
Table 3: parameter optimization effects on learning ten UCI benchmark tasks by the five tested learning algorithms.
convenience, each table displays an error reduction, measured as the percentage of error that was saved by wrapping or wps ­ this quantity may be negative if the wrapping method yields worse results. The t value of the one-tailed unpaired t-test is reported, as well as the p level if it is smaller than 0.05. Accuracies in bold mark the significantly better accuracies in a pair of outcomes for one data set. Two algorithms, ib1 and winnow, display marked improvements due to wps on four or the five UCI tasks. Ripper is able to gain significant improvements compared to the default settings on three of the five datasets. C4.5 performs significantly better with wps on two tasks, and maxent shows the least effect; only one task is significantly improved due to wps. Error reduction levels vary widely; they are negative in 17 out of the total of 50 experiments, but only in two cases the negative effect of wps is significant. On the positive side, of the 33 measured positive effects across the 50 experiments run with all five algorithms, 17 are significant. To gain more insight into the effects of the two wrapping methods, Table 4 lists the average error reductions for all five algorithms for each of the two wrapping methods tested (i.e. each average is taken over the error reductions measured in five
Algorithm ripper c4.5 maxent ib1 winnow
Normal wrapping
Error reduct. Reduct./combin.
16.4
0.025
7.4
0.021
5.9
0.536
30.8
0.033
17.4
0.015
wps
Error reduct. Reduct./combin
27.9
0.043
7.7
0.021
0.4
0.036
31.2
0.034
32.2
0.027
Table 4: Average error reduction levels yielded through normal wrapping and wps and "gain per tested combination" statistic, for each of the five learning algorithms.
experimental outcomes). Apart from these averages, which range between 0.4% for maxent with wps to 32.2% for winnow with wps, the table also displays the relative contribution of each tested combination of parameter settings. For example, the 32.2% error reduction of winnow is due to a search by wps among 1200 combinations; the average contribution from each combination is 32.2/1200 = 0.027%. Interestingly, this average contribution per setting is a fairly constant number ranging between about 0.02 and 0.04. The single outlier is the high average relative contribution of each of the 11 combinations of settings of maxent, which can be considered accidental and unreliable as an average.
6 Discussion In this paper we demonstrated the use of wrapped progressive sampling for algorithmic parameter optimization. We customized wps to five machine learning algorithms, and ran experiments on UCI benchmark data sets comparing the algorithms' default settings as provided by the employed implementations to the settings found by wps for each fold of a 10-fold CV experiment. Furthermore, we ran the same five algorithms and applied normal (pseudo-exhaustive) wrapping to five smaller UCI tasks, which would not be feasible for the larger datasets on which wps is applied. The wps procedure appears to have the desired function. It tends to reduce average error over the investigated UCI data sets at rates ranging from a negligible 0.4% with maxent to a considerable 27.2% (ripper), 31.2% (ib1), and 32.2% (winnow). In 13 out of the 25 wps experiments, wps performs significantly better than the default setting. As a computationally feasible alternative to normal wrapping, which is tested on the smaller five datasets, wps offers at least as promising results. There even appears to be a constant gain in error reduction per tested combination of parameter settings, leading to higher gains when larger parameter spaces can be explored ­ in our experiments, about 0.02­0.04% per tested combination. The fact that wps sometimes produces a lower generalization accuracy than obtained with the default setting of an algorithm can partly be explained by the general fact that wrapped estimations on training data may not carry over to test data ­ simple wrapping also produces negative results. On the other hand, the wps procedure is susceptible to discarding settings that would perform well on
test data, but perform badly on small amounts of training data and are therefore deleted early on in the process. More comparative experiments and deeper analyses are necessary to investigate this potential cause of negative end results. To conclude, wrapped progressive sampling is a meta-learning approach to algorithmic parameter optimization that does not produce much effect, unsuprisingly, with algorithms that offer little variation in their parameters, such as maxent. It is, however, able to make more improvements when more possible combinations are available to search through. Acknowledgements This research is funded by the Netherlands Organisation for scientific research (NWO). The author wishes to thank William Cohen, Robert Stockton, Iris Hendrickx, Walter Daelemans, Zhang Le, and Dan Roth for comments, suggestions, and discussions. References [1] D. W. Aha, D. Kibler, and M. Albert. Instance-based learning algorithms. Machine Learning, 6:37­66, 1991. [2] C. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/ mlearn/MLRepository.html. [3] A. J. Carlson, C. M. Cumby, J. L. Rosen, and D. Roth. SNoW user guide. Technical Report UIUCDCS-R-99-2101, Cognitive Computation Group, COMPUTER SCIENCE Department, University of Illinois, Urbana, Illinois, 1999. [4] W. Cohen. Fast effective rule induction. In Proceedings 12th InterNational Conference on Machine Learning, pages 115­123. Morgan Kaufmann, 1995. [5] W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch. TiMBL: Tilburg memory based learner, version 5.0, Reference Guide. ILK Technical Report 03-10, Tilburg University, 2003. [6] S. Guiasu and A. Shenitzer. The principle of maximum entropy. The Mathematical Intelligencer, 7(1), 1985. [7] R. Kohavi and G. John. Wrappers for feature subset selection. Artificial Intelligence Journal, 97(1­2):273­324, 1997. [8] Zhang Le. Maximum Entropy Modeling Toolkit for Python and C++. Natural Language Processing Lab, Northeastern University, China, 2004. http://www.nlplab.cn/zhangle/software/maxent/manual/. [9] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning, 2:285­318, 1988. [10] F. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In Proceedings of the Fifth International conference on Knowledge Discovery and data mining, pages 23­32, 1999. [11] J.R. Quinlan. c4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.

A Van den Bosch

File: wrapped-progressive-sampling-search-for-optimizing-learning-algorithm.pdf
Title: paramsearch.dvi
Author: A Van den Bosch
Published: Thu Aug 26 13:38:11 2004
Pages: 8
File size: 0.13 Mb


, pages, 0 Mb
Copyright © 2018 doc.uments.com