A Simple, Fast, and Effective Rule Learner William W. Cohen Yoram Singer AT&TShannon Laboratory 180 Park Avenue Florham Park, NJ 07932-0971 USA {wcohen,singer } @research.att.com

Abstract Wedescribe SLIPPERa~ newrule learner that generates rulesets by repeatedly boosting a simple, greedy, rule-builder. Likethe rulesets built by other rule learners, the ensembleof rules created by SLIPPERis compact and comprehensible. This is madepossible by imposingappropriate constraints on the rule-builder, andby use of a recently-proposedgeneralization of Adaboostcalled confidence-ratedboosting. In spite of its relative simplicity, SLIPPEiRs highly scalable, and an effective learner. Experimentally, SLIPPERscales no worse than O(n log n), where n is the numberof examples, and on a set of 32 benchmarkproblems, SLIPPER achieves lower error rates than RIPPER20 times, and lowererror rates than C4.5rules 22 times. Introduction Boosting (Schapire 1990; Freund 1995; Freund Schapire 1997) is usually used to create ensemble classifters. It is popular because it is simple, easy to implement, well-understood formally, and effective at improving accuracy. One disadvantage of boosting is that improvements in accuracy are often obtained at the expense of comprehensibility. If comprehensibility is important, it is more appropriate to use some learner that produces a compact, understandable hypothesis--for instance, a rule learning system like CN2 (Clark & Niblett 1989), RIPPER(Cohen 1995), or C4.5rules (Quinlan 1994). However, the rule learning systems that perform best experimentally have the disadvantage of being complex, hard to implement, and not well-understood formally. Here, we describe a new rule learning algorithm called SLIPPER1. SLIPPERgenerates rulesets by repeatedly boosting a simple, greedy, rule-builder. SLIPPER'srule-builder is muchlike the inner loops of RIP- PER (Cohen 1995) and IREP (Fiirnkranz & Widmer 1994). However, SLIPPERdoes not employ the "setcovering" process used by conventional rule learners-rather than removing examples covered by a new rule, SLIPPERuses boosting to reduce the weight of these examples. 1For Simple Learner with Iterative Pruning to Produce Error Reduction.

Like the rulesets constructed by RIPPERand other rule learners, SLIPPER'srulesets have the desirable property that the label assigned to an instance depends only on the rules that "fire" for that instance. This property is not shared by earlier applications of boosting to rule learning (see for instance (Freund & Schapire 1996)), in which the behavior of the entire ensemble rules can affect an instance's classification. This property makes classifications madeby the rulesets easier to understand, and is made possible by imposing appropriate constraints on the base learner, and use of a recently-proposed generalization of AdaBoost(Schapire & Singer 1998). SLIPPERis simpler and better-understood formally than other state-of-the-art rule learners. In spite of this, SLIPPERscales well on large datasets, and is an extremely effective learner. Experimentally, SLIPPER's run-time on large real-world datasets scales no worse than O (n log n), where n is the numberof examples. On a set of 32 benchmark problems, SLIPPER achieves lower error rates than RIPPER20 times, and lower error rates than C4.5rules 22 times. The rulesets produced by SLIPPERare also comparable in size to those produced by C4.5rules. The SLIPPER Algorithm SLIPPERuses boosting to create an ensemble of rules. The weak learner that is boosted finds a single rule, using essentially the same process as used in the inner loops of IREP (Fiirnkranz & Widmer 1994) and RIPPER(Cohen 1995). Specifically, the weak learner splits the training data, grows a single rule using one subset of the data, and then prunes the rule using the other subset. In SLIPPER, the ad hoc metrics used to guide the growing and pruning of rules are replaced with metrics based on the formal analysis of boosting algorithms. The specific boosting algorithm used is a generalization of Freund and Schapire's AdaBoost(Freund & Schapire 1997) that employs confidence-rated predictions (Schapire & Singer 1998). This generalization allows the rules generated by the weak learner to "abstain" (vote with confidence zero) on examples not covered by the rule, and vote with an appropriate nonzero confidence on covered examples.

The current implementation of SLIPPERonly handles two-class classification problems. The output of SLIPPERis a weighted ruleset, in which each rule R is associated with a confidence CR. To classify an instance x, one computes the sum of the confidences of all rules that cover x, then predicts according to the sign of this sum: if the sum is greater than zero, one predicts the positive class. In order to makethe ruleset more comprehensible, we further constrain SLIPPERto generate only rules that are associated with a positive confidence rating--that is, all rules predict membership in the positive class. The only rule with a negative confidence rating (i.e., that predicts membership in the negative class) is a single default rule. This representation is a generalization of propositional DNF, and is similar to that used by many other rule learners: for mostrule learners the classifier is a set of rules, often with some associated numerical confidence measure, and often with some sort of voting scheme for resolving possible conflicts in the predictions. Below, we describe the SLIPPERalgorithm in detail.

Boosting Confidence-rated

Rules

The first boosting algorithms (Schapire 1990; Freund 1995) were developed for theoretical reasons--to answer certain fundamental questions about pac-learnability (Kearns & Valiant 1994). While mathematically beau- tiful, these two algorithms were rather impractical. Later, Freund and Schapire (1997) developed the Ada- Boost algorithm, which proved to be a practically useful meta-learning algorithm. AdaBoost works by making repeated calls to a weak learner. On each call the weak learner generates a single weak hypothesis, after which the examples are re-weighted. The weak hypotheses are combinedinto an ensemble called a strong hypothesis. Recently, Schapire and Singer (1998) studied a generalization of AdaBoost, in which a weak-hypothesis can assign a real-valued confidence to each prediction. The weak-hypothesis can assign different confidences to different instances, and in particular, it can "abstain" on some instances by making a prediction with zero confidence. The ability to abstain is important for our purposes. Wenowgive a brief overview of this extended boosting frameworkand describe howit is used for constructing weighted rulesets. Since we have thus far implemented only a two-class version of SLIPPER,we will focus on the two-class case; however, the theory extends nicely to multiple classes. Assume that we are given a set of examples ((Xl, Yl),..., (Xm, Ym)) where each instance xi belongs to a domain X and each label Yi is in {-1, +1}. Assume also that we have access to a weak learning algorithm, which accepts as input the training examples along with a distribution over the instances (initially uniform). the generalized boosting setting, the weak learner computes a weakhypothesis h of the form h : X --~ ~, where the sign of h(x) is interpreted as the predicted label and the magnitude Ih(x)] as the confidence in the prediction: large numbers for ]h(x)] indicate high confidence

Given: (xl,yl),...,(xm,Ym) Initialize D1(i) = 1/m. For t = 1,...,T:

; xi E X, yi {- 1,ч1}

Ї Train weak learner using distribution Dr. Ї Get weakhypothesis h~ : 2d --~ R. Ї Choose at E ~. Ї Update: nt+l(i)= n~(i)exp(-a~yih~(xi))/Zt

Output final hypothesis: H(x) si gn (~T=I at ht(x))

Figure 1: A generalized version of AdaBooswt ith real valued predictions (Schapire 8z Singer 1998).

in the prediction, and numbersclose to zero indicate low confidence. The weak hypothesis can abstain from predicting the label of an instance x by setting h(x) = Pseudo-code describing the generalized boosting algorithm is given in Figure 1; here Zt is a normalization constant that ensures the distribution Dt+l sums to 1, and at depends on the weak-learner. The weak-hypotheses that we use here are rules. In SLIPPER, rules are conjunctions of primitive conditions. As used by the boosting algorithm, however, a rule R can be any hypothesis that partitions the set of instances Xinto two subsets: the set of instances which satisfy (are covered by) the rule, and those which do not satisfied the rule. If x satisfies R, wewill write x E R. In order to makethe strong-hypothesis similar to a conventional ruleset, we will force the weak-hypothesis based on a rule R to abstain on all instances unsatisfied by R, by setting the prediction h(x) for x ~ R to 0. Wewill also force the rules to to predict with the same confidence Ca on every x E R; in other words, for the tth rule Rt generated by the weaklearner, wewill require that Vx ERt, atht(x) = Ca~. Thus, to classify an instance x with the strong-hypothesis, one simply adds up the confidence Ca~for each rule Rt that is satisfied by x, and predicts according to the sign of this sum. As a final constraint, we will require each rule R to be in one of two forms: either R is a "default rule" (i.e., x E 2ў =~ x E R) or else R is such that Ca is positive. Thus each non-default rule R is associated with a single real-valued confidence Ca, and can be interpreted as follows: if R is satisfied then predict class "positive" with confidence Ca, and otherwise abstain. In Figure 1, Zt is a real value used to normalize the distribution: Zt -- ~ D~(i) exp(-aty~ht(xi)). Thus Zt depends on both h~ and a~. Schapire and Singer (1998) showed that to minimize training error, the weak-learning algorithm should pick, on each round of boosting, the weak hypothesis ht and weight a~ which lead to the smallest value of Zt. Assumethat a rule R has been generated by the weak learner. Wewill now show howthe confidence value Ca for rule R can be set to minimize Zt. Omitting the dependency on t, Z can

rewritten in our case as

z = D(i)+ D(i)exp(-y (1)

xi~R

x~6R

where CR = ah(x). Let Wo = ~x,ўRD(i), W+ ~x,en:y,=+l D(i), and W_= ~-]z, eR:u,=-I D(i). We can nowfurther simplify Equ. (1) and rewrite Z

Z = Wo + W+ exp(-CR) + W_ exp(+CR) .

Following Schapire and Singer (1998), to find CR need to solve the equation ~ = 0, which implies that Z is minimized by setting

Since a rule may cover only a few examples, W_can be equal to 0, leading to extreme confidence values: to prevent this, in practice, we "smooth" the confidence by adding 1 to both W+and W_:

(4) + 1/(2n)] The smoothed confidence value of any rule R is therefore bounded from above by Ѕ ln(2n). The analysis of Singer and Schapire also suggests an objective function to be used by the weak-learner which constructs rules. Plugging the value of CRinto Equ. (2) we get that

Z = Wo +2x/W+W-

= 1- (W+ - 2x/W+W- + W_)

_- 1-

(5)

Thus, a rule R minimizes Z iff it maximizes ]V/~+ -

v/W-~-_I. Note that a rule which minimizes Z by maximizing V/-W~-- ~ may be negatively correlated with

the positive class, and hence its confidence value CRis

negative. As described earlier, in SLIPPERwe restrict ourselves to positively correlated rules, hence the objective function we attempt to maximize when searching

for a good rule is

2=

(6)

In summary,this use of boosting corresponds roughly to the outer "set-covering" loop found in many rule learners (Pagallo & Haussler 1990; Quinlan 1990; Brunk & Pazzani 1991; Fiirnkranz & Widmer 1994; Cohen 1995). The major difference is that examples covered by a rule are not immediately removed from the training set. Instead, covered examples are given lower weights; further, the degree to which an example's weight is reduced depends on the accuracy of the new rule. The formal analysis of boosting given by Schapire and Singer also suggests a newquality metric for rules: notice that Z encompassed a natural trade-off between

Given: (xl,Yl),...,(xm,Ym) Initialize D(i) = 1/m. For t = 1,...,T:

; xi E X, Yi E {--1,+1)

1. Train the weak-learner using current distribution D:

(a) Split data into GrowSet and PruneSet.

(b) GrowRule: starting with empty rule, greedily add conditions to maximizeEqu. (6).

(c) PruneRule: starting with the output of GrowRule, delete some final sequence of conditions to minimize Equ. (7), where CR, is computed using

Equ. (4) and GrowSet.

(d) Return as Rt either the output of PruneRule, or the default rule, whichever minimizes Equ. (5).

2. Construct ht : 2~ --~ R: Let CR, be given by Equ. (4) (evaluated on the entire

dataset). Then

( h~(x)

= CR,

ifx6R~ otherwise

3. Update: (a) For each xi 6 R~, set D(i) +- D(i)/exp(yi. CR,) (b) Let Z~= ~--]im=lD(i). (c) For each xi, set D(i) +-- O(i)/Zt.

Output]inal hypothesis: H(x) = sign (~R,.-xeR, 0R,)

Figure 2: The SLIPPER algorithm

accuracy (the proportion of the positive examplessatisfied by a rule to the total numberof examples that the rule satisfies) and coverage (the fraction of examples that satisfy the rule). Below, we will discuss howto construct rules based on the objective function Z as given by Equ. (6). Rule growing and pruning Wewill now describe the weak-learner which generates individual rules. This procedure is similar to the heuristic rule-building procedure used in RIPPER(Cohen 1995) and IREP (Fiirnkranz & Widmer 1994). The rule-builder begins by randomly splitting the dataset into two disjoint subsets, GrowSet and Pruneget. The split is constrained so that the total weight of examples in GrowSet is about 2/3. The rule-builder then invokes the GrowRuleroutine. GrowRulebegins with an empty conjunction of conditions, and considers adding to this conjunction any condition in one of the following forms: An -- v, where An is a nominalattribute and v is a legal value for A,~; or Ac <_ O or A~ > O, where A~ is a continuous variable and 8 is some value for Ac that occurs in the training data. GrowRule then adds the condition that attains the maximal value for Zt on GrowSet. This process is repeated until the rule covers no negative examples from GrowSet, or no further refinement improves Zt.

This rule is often too specific, and "overfits" the training data; thus the resulting rule is immediately pruned using the PrtmeRule routine. PruneRule considers deleting any final sequence of conditions from the rule. Each sequence of deletions defines a new rule whose goodness is evaluated on PruneSet. As before, each candidate rule Rt partitions the PruneSet into two subsets, depending on whether or not R~ is satisfied. Similar to the definition of W+and W_, let V+(respectively V_) be the total weight of the examples in PruneSet that are covered ^ by R~ and labeled +1 (respectively -1). Denote by CR, the (smoothed) prediction confidence obtained by evaluating Equ. (4) on the W+, W- associated with GrowSet. PruneRule minimizes the formula (1 - V+- V_) + V+exp (-Cn,) + V_ exp (+Cn,) This can be interpreted as the loss (as defined by Singer and Schapire) of the rule ~, with a ssociated c onfidence CR', as estimated on the examples in PruneSet. Subject to the limitations of this greedy, incomplete search procedure, this rule will have a low Z score. It is also guaranteed to be positively correlated with the positive class. Wealso allow a default rule (a rule that is satisfied for all examples)to be used in a hypothesis-indeed, without such a rule, it wouldbe impossible for the strong-hypothesis to classify any instances as negative. The rule-builder will thus return to the booster either the output of PruneRule, or the default rule-whichever rule has the lowest Z value, as determined by Equ. (5). (This behavior is different from other rulelearners, which typically add a single default rule after all other rules have been learned.) Note that the value of Equ. (7) and the confidence value CR, which was calculated on GrowSetis used only in the weak-learner search for a good rule~the booster will assign a confidence using Equ. (4) on the entire dataset. Pseudo-code for SLIPPERis given in Figure 2. Other details It is possible for the weak-learner to generate the same rule several times--for instance, the default rule is often generated manytimes during boosting. Therefore, after the last round of boosting, the final strong-hypothesis is "compressed" by removing duplicate rules. Specifically, if the strong-hypothesis contains a set of identical rules R1, ..., Rk, these are replaced by a single rule Rt with confidence CR, = ~i=k 1 CR,. This step reduces the size of the strong-hypothesis, thus reducing classif2ication time and improving comprehensibility. 2Note that this step does not alter the actual predictions of the learned ruleset. Other approaches that perform"lossy" compactionof the strong hypothesisby, for instance, deleting rules associated with lowconfidencevalues, might lead to better generalization error (see for instance (Margineantu&Dietterich 1997)) but are beyondthe scope of this this paper.

As described above, SLIPPER has one free parameter--the number of rounds of boosting T. Although there are theoretical analyses of the numberof rounds needed for boosting (Freund & Schapire 1997; Schapire et al. 1997), these tend not to give practically useful bounds. Therefore, weuse internal five-fold cross-validation (on the training set) to fix T. Five training/holdout divisions of the data are created in the usual way, and the algorithm of Figure 2 is run five times for Tmozrounds on each training sets (where Tmax is an upper bound set by the user). The number of rounds T* which produces the lowest average error on the holdout data is then determined, breaking ties in favor of smaller values of T*, and the algorithm is finally run again for T* rounds on the entire dataset. In the experiments below, we always used a value of Trna~ = 100. Experiments To evaluate SLIPPER, we used two sets of benchmark problems, each containing 16 two-class classification problems. The first set, the development set, was used in debugging SLIPPERand evaluating certain variations of it. The second set, the prospective set, was used as a secondary evaluation of the SLIPPERalgorithm, after development was complete. This two-stage procedure was intended as a guard against the possibility of "overfitting" the benchmarkproblems themselves; however, since the Experimental results are qualitatively similar on both the development and prospective sets, we will focus on results across all 32 benchmarkprob- lems in the discussion below. These results are summarized in Table 2 and Figure 3, and presented in more detail in Table 1. The benchmark problems are summarized in Table 1. The problems from the development set are discussed elsewhere (Cohen 1995). The problems the prospective set are taken without modification from the UC/Irvine repository (Blake, Keogh, & Merz 1989), with these exceptions: the hypothyroid and splice-junction problems were artificially made twoclass problems--in each case, the goal is to separate most frequent class from the remaining classes; for adult, we used a 5000-element subsample of the designated training set; and marketland market2are realworld customer modeling problems provided by AT&T. To measure generalization error, we used a designated test set, whenavailable; a single randompartition of the training set, for the larger problems; and stratified 10-fold cross-validation otherwise, as indicated. We compared SLIPPER's performance to RIPPER (Cohen 1995), with and without its "optimization" step; the C4.5 decision-tree learner (Quinlan 1994), with pruning, and the C4.5rules rule learner (henceforth, C4rules); and the C5.0 rule learner 3 (henceforth, C5rules), a proprietary, unpublished descendent of C4rules. RIPPERwithout optimization is in- 3That is, C5.0 run with the -r option.

Percent Error on Test Data

Problem Name Source #Train #Test #Feat

RIPPER

C4.5

C5.0

-opt +opt Trees Rules Rules SLIPPER

Prospective:

adult

uci 5000 16281 14 17.2 16.0 16.0 15.0 15.1

14.7

blackjack

att 5000 10000

4 29.1 29.1 27.9 28.0 27.8

27.9

market2

att 5000 6000 68 43.1 41.3 45.5 43.1 41.4

42.6

market3

art 5000 15000

4 9.1 8.6 9.5 9.3 8.6

8.9

splice-junction

uci 2190 1000 60 6.5 5.9 4.3 4.8 4.5

5.9

hypothyroid

uci 2514 1258 29 1.0 0.9 0.4 0.4 0.4

0.7

breast-wisc

uci

699 10CV

9 3.7 4.6 6.6 5.2 5.0

4.2

bands

uci

540 10CV

39 28.3 27.0 30.0 30.0 30.2

22.8

crx

uci

690 10CV

15 15.5 15.2 14.2 15.5 14.0

15.7

echocardiogram

uci

74 10CV 12 2.9 ;5.5 5.5 6.8 4.3

4.3

german

uci 1000 10CV 20 28.6 28.7 27.5 27.0 28.3

27.2

hepatitis

uci

155 10CV 19 20.7 23.2 18.8 18.8 20.1

17.4

heart-hungarian

uci

294 10CV

13 19.7 20.1 20.8 20.0 21.8

19.4

ionosphere

uci

351 10CV 34 10.3 10.0 10.3 10.3 12.3

7.7

liver

uci

345 10CV

6 32.7 31.3 37.7 37.5 31.9

32.2

horse-colic

uci

300 10CV 23 17.0 16.3 16.3 16.0 15.3

15.0

Development:

mushroom

uci 3988 4136

22

0.2 0.0 0.2

0.2

0.7

0.2

vote

uci

300 135

16 3.7 3.0 3.0

5.2

3.0

3.0

move

att

1483 1546

10 35.1 29.3 27.8 25.3 26.8

23.9

networkl

att 2500 1077 30 25.0 25.7 28.4 26.6 26.3

25.1

network2

art 2600 1226 35 21.7 21.3 23.3 22.3 22.9

26.6

market1

art 1565 1616 10 23.4 22.3 21.8 20.5 21.2

20.1

weather

art 1000 4597 35 28.5 28.9 31.3 28.7 29.2

28.7

coding

uci 5000 15000 15 34.3 32.8 34.1 32.6 32.4

30.2

ocr

att 1318 1370 576 3.0 3.4 3.6 3.6 4.7

2.0

labor

uci

57 10CV

16 18.0 18.0 14.7 14.7 18.4

12.3

bridges

uci

102 10CV

7 13.7 13.7 15.7 17.5 15.7

13.7

promoters

uci

106 10CV 57 18.1 19.0 22.7 18.1 22.7

18.9

sonar

uci

208 10CV 60 29.8 24.2 30.3 29.8 28.3

25.5

ticketl

att

556 10CV 78 1.8 1.6 1.6 1.6 1.6

2.7

ticket2

att

556 10CV 53 6.3 4.7 4.2 4.9 4.2

4.5

ticket3

att

556 10CV 61 4.5 3.2 2.9 3.4 2.9

4.3

Average: Prospective Set

17.83 17.75 18.20 17.97 17.55

16.65

Average: Development Set

16.70 15.70 16.60 15.93 16.31

15.11

Average: All Problems AvgeernkAllProblems 05133640L035931412153 #Lowest Error Rates: All Problems

17.26 16.72 17.40 16.95 16.93

6

9

6

3

8

15.88 13

Table 1: Summaryof the datasets used, and error rates for SLIPPER,four alternative rule learners (RIPPERwith and without optimization, C4rules, and Chrules), and the C4.5 decision tree learner.

cluded as a relatively simple separate-and-conquer vari- ant; this algorithm has been evaluated elsewhere under the names IREP, (Cohen 1995) and IRIP (Fiirnkranz 1998). The results are shown in detail in Table 1. SLIPPER obtains the average lowest error rate for both sets of benchmarks; also, among the rule learners SLIPPER, RIPPER, C4rules, and C5rules, SLIPPERobtains the lowest error rate 17 times, C5rules 10 times, RIPPER 9 times, and C4rules 5 times. Also amongthese rule learners, the average rank of SLIPPERis 2.0, compared

t4o 2.6 for RIPPERand Chrules, and 2.8 for C4rules. Summaries of the experimental results are given in Figure 3 and Table 2. In the scatterplot of Figure 3, each point compares SLIPPER to some second learning system L on a single dataset: the x-axis position of the point is the error rate of SLIPPER,and the yaxis position is the error rate of L. Thus, points above the lines y ---- x correspond to datasets for whichSLIP- 4Thecorrespondingfigures across all learning algorithms comparedare given in the table.

45 a

4O

/i

m /t

35

o

j-"

3O

~ч Q /

ox

25

";3....

20 ,° .....

15

o :.

Ч/

10

~., /"

5 ~Ч.~o -.~g"

vs. Ripper.opt o vs. Ripper+opt vs. C4.5ruleso vs. CS.0rulesx y=x.......

0 r/a i

,

i

,

f

,

i

,

5

10 15 20

25 30 35 40

45

Errorrate of SLIPPER

Figure 3: Summaryof experimental results. Points above the lines y = x correspond to datasets for which SLIPPER performs better than somesecond learner.

PERperforms better than some second learner. Visual inspection confirms that SLIPPERoften substantially outperforms each of the other rule learners, and that its performance is almost always close to the best of t5he other rule learners. In Table 2, let LRbe the learner corresponding to a row of the table, and let Lc correspond to a column. The upper triangle entries are the average, across all benchmarks, of the quantity error(Lc)/error(LR); for instance, the entries of the fourth columnindicate that SLIPPER's error rate is, on average, about 2% to 4% lower than the other rule learners. The lower triangle entries are the won-loss-tied record of learner LRversus Lc, a "win" indicating LRachieved a lower error rate. A record is underlinedif it is statistically significant at the 90%level, and bold-faced if it is statistically significant at the 95%level. 6 For instance, the first entry of the fourth row indicates that SLIPPERachieves a lower error rate than RIPPER20 times, a higher error rate 9 times, and the same error rate 3 times. SLIPPER's records versus C4rules and C5rules are similar. The last two lines of the table give SLIPPER'ws on-losstied records for the developmentset and prospective set only, indicating that these results are generally comparable across both test sets. (An exception is SLIPPER's performance versus C5rules: it appears to be superior on the development set, but only comparable on the prospective set.) Wealso measured the size of the rulesets produced by the different algorithms/ The most compact rule- ~Thesole exception to this is network2,on whichSLIPPERperforms noticeably worse than the other methods. 6That is, if one can reject the null hypothesis that the probability of a win is 0.50, given there is no tie, with a two-tailed binomialtest. 7In the 10-CVexperiments, welooked at the size of the ruleset generatedby runningon all the data, not the average of the cross-validation runs.

RIPPER C4rules C5rules SLIPPER SLIPPER

RIPPER C4rules C5rules SLIPPER

1.023 0.993

0.961

14-17-1

1.066

0.971

14-15-3 16-14-2

0.977

20-9-3 22-8-2 19-11-2

11-4-1 12-4-0 8-7-1 (Prosp.)

9-5-2 10-4-2 11-4-1 (Devel.)

Table 2: Summaryof experimental results. If LR and Lc are the learners corresponding to a row and column, respectively, the upper triangle entries are the average of error(Lc)/error(La). The lower triangle entries axe the won-loss-tied record of learner LRversus Lc, a "win" indicating LRachieved a lower error rate.

sets are produced by RIPPER:the average size of RIPPER's rulesets is 6.0 rules (or 8.1 without optimization), and RIPPERvirtually always produces the smallest ruleset, s The remaining three learners produce similar sized rulesets, with SLIPPERtending to produce somewhatsmaller rulesets than the other two. The average size rulesets for C4rules, C5rules, and SLIPPER are 22.1 rules, 30.7 rules, and 17.8 rules, respectively, and the respective average ranks amongthese three are 1.8, 2.3, and 1.9. The largest ruleset producedby SLIPPERis 49 rules (for coding). Finally, we evaluated the scalability of the rule learners on several large datasets. Weused adult; blackjack, with the addition of 20 irrelevant noise variables; and market3, for which many examples were available. C4rules was not run, since it is knownto have scalability problems (Cohen 1995). The results are shownin the log-log plots of Figure 4. 9 The fastest rule learner for these datasets is usually C5rules, followed by the RIPPERvariants. SLIPPER(at least in the current implementation) is muchslower than either C5rules or RIPPER;however, it scales very well with increasing amounts of data. In absolute terms, SLIPPER's performance is still quite reasonable: SLIPPERneeds 1-2 hours to process 100,000 examples of the blackjack+ and market3 datasets, and 30 minutes to process the 30,000 training examples from the adult dataset. To summarize, SLIPPER obtains the lowest error rates on average. SLIPPERalso scales well to large datasets, although it is somewhatless efficient than C5rules and RIPPER. SLIPPER's rulesets are comparable in size to those of C4rules and C5rules, although somewhat larger than RIPPER's. Concluding remarks Wehave described SLIPPER, a new rule learning algorithm which uses confidence-rated boosting to learn SHowever,it has been argued that RIPPERover-prunes on the sort of the smaller problemsthat predominatein the UC/Irvine repository (Frank &Witten 1998). °Timingresults are given in CPUseconds on a MIPSIrix 6.3 with 200 MHzR10000processors.

1OOOO IOO0

adultdataset 7 j//~S

IO0

. j,

10

1

............. . ~"Ї "~

.,..."" ~--'~"

0.1

ri.pspleipr-poep.r.t... dpper+o.p..t... c5rulesЇ * n*log(n-)

0.01 100

1000

10000

#examptes

10~00

1OOOO(] I00<30 I000 I00 10 1 0.1 0.01 IO0

blackjack+dataset

1OO0OO 1OOO0

1000 IO0 10

......... n*log(n)

1000

100(30

#examples

100000

1 0,1 0.01 IO0

market3dataset

Ї n'!~!-O--,:...

1000

10000 100000

#examples

Figure 4: Run-time performance of SLIPPER, RIPPER, and C5rules on large datasets.

an ensemble of rules. Although the SLIPPER algorithm is relatively simple, SLIPPERperforms well on a set of 32 benchmark problems: relative to RIPPER, SLIPPERachieves lower error rates 20 times, and the same error rate 3 times; relative to C4.5rules, SLIPPER achieves lower error rates 22 times, and the same rate 2 times; and relative to C5.0rules, SLIPPERachieves lower error rates 19 times, and the same rate 2 times. Using a two-tailed sign test, these differences between RIPPER, C4.5rules, and C5.0rules are significant at 94%, 98%, and 80% levels respectively. SLIPPER also performs best amongthese three systems according to several measures of aggregate performance, such as average rank. SLIPPER'srulesets are of moderate size--comparable to those produced by C4.5rules and C5.0rules--and the algorithm also scales well on large datasets. As noted above, SLIPPERis based on two lines of research. The first line of research is on scalable, noisetolerant separate-and-conquer rule learning algorithms (Pagallo & Haussler 1990; Quinlan 1990), such as reduced error pruning (REP) for rules (Brunk & Pazzani 1991), IREP (Ffirnkranz & Widmer 1994), RIPPER(Cohen 1995). The second line of research is on boosting (Schapire 1990; Freund 1995), in particular the AdaBoost algorithm (Freund & Schapire 1997), and its recent successor developed by Schapire and Singer (1998). SLIPPERis similar to an earlier application of boosting to rule learning (Freund & Schapire 1996), which AdaBoostwas used to boost a rule-builder called FindDecRule. In contrast to SLIPPER, Freund and Schapire used a heuristic based on an information gain criterion that has no formal guarantees. SLIPPERalso places a greater emphasis on generating comprehensible rulesets; in particular, SLIPPERgenerates relatively compact rulesets, and SLIPPER's use of confidencerated boosting allows it to construct rules that "abstain" on instances that are not covered by a rule; thus the label assigned to an instance depends only on the rules that "fire" for that instance. In Freund and Schapire's rule boosting algorithm, in contrast, the label for an instance always depends on all the rules in

the ensemble. The algorithm also always generates a ruleset of fixed size (in their experiments, 100rules). SLIPPER's use of boosting is a departure from the separate-and-conquer approach used by many earlier rule learners. Another alternative is the RISE algorithm (Domingos 1996), which combines rule learning and nearest-neighbour classification using a bottomup "conquering without separating" control structure. However, the ruleset constructed by RISE is somewhat moredifficult to interpret, since the label assigned to an instance depends not on the rules that cover it, but on the rule that is "nearest". More recently, Hsu, Etzioni, and Soderland (1998) described an experimental rule learner called DAIRY which extends the set-covering approach of traditional rule learners by "recycling" examples--that is, by reducing the weight of examples that have been "covered" by previous rules, rather than removingthese examples. DAIRY'srecycling method was shown experimentally to improve performance on a number of text classification problems. SLIPPER's combination of boosting and rule-building is similar to recycling, and could be viewedas a formally justified variant of it. Wenote that there are important practical advantages to using Learning Methods that are formally well understood. For instance, existing formal analysis (Schapire & Singer 1998) generalizes the boosting method used here to multi-class learning problems, and also to a setting in whichmisclassification costs are unequal. In further work, we plan to implement a multiclass version of SLIPPER, and an extension of SLIPPERfor minimizing an arbitrary cost matrix, which mapseach pair of (predicted label,correct label) to associated cost. Wealso plan to evaluate SLIPPER on text classification benchmarks: the current implementation of SLIPPER,which is based on code used in RIPPER, inherits from RIPPERthe ability to handle text efficiently. Acknowledgments Wewould like to thank Rob Schapire for helpful discussions, and HaymHirsh for commentson a draft of this paper.

References Blake, C.; Keogh, E.; and Merz, C.J. 1989. UCI repository of machine learning databases [www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science. Brunk, C., and Pazzani, M. 1991. Noise-tolerant relational concept learning algorithms. In Proceedings o/ the Eighth International Workshopon Machine Learning. Ithaca, NewYork: Morgan Kaufmann. Clark, P., and Niblett, T. 1989. The CN2induction algorithm. Machine Learning 3(1). Cohen, W. W. 1995. Fast effective rule induction. In Machine Learning: Proceedings of the Twelfth InterNational Conference. Lake Tahoe, California: Morgan Kaufmann. Domingos, P. 1996. Unifying instance-based and rulebased induction. Machine Learning 24(2):141-168. Frank, E., and Witten, I. 1998. Generating accurate rule sets without global optimization. In Machine Learning: Proceedings of the Fifteenth international conference. Freund, Y., and Schapire, R. E. 1996. Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference, 148-156. Freund, Y., and Schapire, R. E. 1997. A decisiontheoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1):119-139. Freund, Y. 1995. Boosting a weak learning algorithm by majority. Information and Computation 121(2):256-285. Fiirnkranz, J., and Widmer, G. 1994. Incremental reduced error pruning. In Machine Learning: Proceedings of the Eleventh annual conference. New Brunswick, New Jersey: Morgan Kaufmann. Ffirnkranz, J. 1998. Integrative windowing. Journal of Artificial Intelligence Research8:129-164. Hsu, D.; Etzioni, O.; and Soderland, S. 1998. A redundant covering algorithm applied to text classification. In AAAI Workshop on Learning for Text Categorization. Kearns, M., and Valiant, L. G. 1994. Cryptographic limitations on learning Booleanformulae and finite automata. Journal of the Association/or Computing Machinery 41(1):67-95. Margineantu, D. D., and Dietterich, T. G. 1997. Pruning adaptive boosting. In Machine Learning: Proceedings of the Fourteenth International Conference, 211218. Pagallo, G., and Haussler, D. 1990. Boolean feature discovery in empirical learning. MachineLearning 5(1).

Quinlan, J. R. 1990. Learning logical definitions from relations. MachineLearning 5(3). Quinlan, J. R. 1994. C~.5: programs for machine learning. Morgan Kaufmann. Schapire, R. E., and Singer, Y. 1998. Improved boosting algorithms using confidence-rated predictions. In Proceedings of the Eleventh Annual Conference on Computational learning theory, 80~91. Schapire, R. E.; Freund, Y.; Bartlett, P.; and Lee, W. S. 1997. Boosting the margin: A new explanation for the effectiveness of voting methods. In Machine Learning: Proceedings of the Fourteenth International Conference. Schapire, R. E. 1990. The strength of weak learnability. Machine Learning 5(2):197-227.

S Gatziu, KR Dittrich

doc.uments.com

About Us :: Privacy Policies :: Terms of Service :: Feedback :: Copyright :: Contact Us :: DMCA Policy

Copyright © 2018 doc.uments.com