logistic regression, regularisation, multinomial logistic regression, cancer classification, training set, crossvalidation, probabilities, P. M. Williams, multinomial logistic regression model, probit regression, log, gene selection, Neural Computation, T. Hastie, probabilistic classifier, prior distribution, Jeffreys prior, Exponential distribution, partial derivatives, Computing Sciences University of East Anglia Norwich, Norfolk, regression models, Mark Girolami Department of Computing Science University of Glasgow Glasgow, model parameter, pt, Artificial Intelligence, relative class frequencies, T. Zhang, E. R. Dougherty, F. J. Oles, Journal of Machine Learning Research, L. Narlikar, IEE Proceedings, Artificial Neural Networks, R. Tibshirani, Neural Information Processing Systems, linear classification, classification, C. M. Bishop, A. J. Hartemink, M. Figueiredo, John Wiley, Y. Grandvalet, Oxford University Press, gene microarrays, J. Zhu, Sprse multinomial logistic regression, B. Krishnapuram
Content:
Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
Gavin C. Cawley School of Computing Sciences
University of East Anglia Norwich, Norfolk, NR4 7TJ, U.K.
[email protected]Nicola L. C. Talbot School of Computing Sciences University of East Anglia Norwich, Norfolk, NR4 7TJ, U.K.
[email protected]Mark Girolami
Department of Computing Science University of Glasgow Glasgow, Scotland, G12 8QQ, U.K.
[email protected]Abstract Multinomial logistic regression provides the standard penalised maximumlikelihood solution to multiclass pattern recognition problems. More recently, the development of sparse multinomial logistic regression models has found application in text processing and microarray classification, where explicit identification of the most informative features is of value. In this paper, we propose a sparse multinomial logistic regression method, in which the sparsity arises from the use of a Laplace prior, but where the usual regularisation parameter is integrated out analytically. Evaluation over a range of benchmark datasets reveals this approach results in similar generalisation performance to that obtained using crossvalidation, but at greatly reduced computational expense. 1 Introduction Multinomial logistic and probit regression are perhaps the classic
statistical methods for multiclass pattern recognition problems (for a detailed introduction, see e.g. [1, 2]). The output of a multinomial logistic regression model can be interpreted as an aposteriori estimate of the probability that a pattern belongs to each of c disjoint classes. The probabilistic nature of the multinomial logistic regression model affords many practical advantages, such as the ability to set rejection thresholds [3], to accommodate unequal relative class frequencies in the training set and in operation [4], or to apply an appropriate loss matrix in making predictions that minimise the expected risk [5]. As a result, these models have been adopted in a diverse range of applications, including cancer classification [6, 7], text categorisation [8], analysis of DNA
binding sites [9] and call routing. More recently, the focus of research has been on methods for inducing sparsity in (multinomial) logistic or probit regression models. In some applications, the identification of salient input features is of itself a valuable activity; for instance in cancer classification from microarray gene expression data, the identification of biomarker genes, the pattern of expression of which is diagnostic of a particular form of cancer, may provide insight into the жtiology of the condition. In other applications, these methods are used to select a small number of
basis functions to form a compact nonparametric classifier, from a set that may contain many thousands of candidate functions. In this case the sparsity is desirable for the purposes of computational expediency, rather than as an aid to understanding the data.
A variety of methods have been explored that aim to introduce sparsity in nonparametric regression models through the incorporation of a penalty or regularisation term within the training criterion. In the context of leastsquares regression using Radial Basis Function (RBF) networks, Orr [10], proposes the use of local regularisation, in which a weightdecay regularisation term is used with distinct regularisation parameters for each weight. The optimisation of the Generalised CrossValidation (GCV) score typically leads to the regularisation parameters for redundant basis functions achieving very high values, allowing them to be identified and pruned from the network (c.f. [11, 12]). The computational efficiency of this approach can be further improved via the use of Recursive Orthogonal Least Squares (ROLS). The relevance vector machine (RVM) [13] implements a form of Bayesian automatic relevance determination (ARD), using a separable Gaussian prior. In this case, the regularisation parameter for each weight is adjusted so as to maximise the marginal likelihood, also known as the Bayesian evidence for the model. An efficient componentwise training algorithm is given in [14]. An alternative approach, known as the LASSO [15], seeks to minimise the negative loglikelihood of the sample, subject to an upper bound on the sum of the absolute value of the weights (see also [16] for a practical training procedure). This strategy is equivalent to the use of a Laplace prior over the model parameters [17], which has been demonstrated to control overfitting and induce sparsity in the weights of multilayer perceptron networks [18]. The equivalence of the Laplace prior and a separable Gaussian prior (with appropriate choice of regularisation parameters) has been established by Grandvalet [11, 12], unifying these strands of research. In this paper, we demonstrate that, in the case of the Laplace prior, the regularisation parameters can be integrated out analytically, obviating the need for a lengthy crossvalidation based model selection stage. The resulting sparse multinomial logistic regression algorithm with Bayesian regularisation (SBMLR) is then fully automated and, having storage requirements that scale only linearly with the number of model parameters, is well suited to relatively largescale applications. The remainder of this paper is set out as follows: The sparse multinomial logistic regression procedure with Bayesian regularisation is presented in Section 2. The proposed algorithm is then evaluated against competing approaches over a range of benchmark learning problems in Section 3. Finally, the work is summarised in Section 5 and conclusion drawn.
2 Method Let D = {(xn, tn)}n=1 represent the training sample, where xn X Rd is the vector of input features for the ith example, and tn T = {t  t {0, 1}c, t 1 = 1 } is the corresponding vector of desired outputs, using the usual 1ofc coding scheme. Multinomial logistic regression constructs a generalised linear model [1] with a softmax inverse link function [19], allowing the outputs to be interpreted as aposteriori estimates of the probabilities of class membership,
p(tni xn) = yin =
exp{ani }
c j=1
exp{anj
}
where
d
ani =
wij xnj
j=1
(1)
Assuming that D represents an i.i.d. sample from a conditional multinomial distribution, then the negative loglikelihood, used as a measure of the datamisfit, can be written as,
c
ED = EDn = 
tni log {yin}
n=1
n=1 i=1
The parameters, w of the multinomial logistic regression model are given by the minimiser of a penalised maximumlikelihood training criterion,
cd
L = ED + EW where EW =
wij 
(2)
i=1 j=1
and is a regularisation parameter [20] controlling the biasvariance tradeoff [21]. At a minima of L, the partial derivatives of L with respect to the model parameters will be uniformly zero, giving
ED wij
=
if wij > 0
and
ED wij
<
if wij = 0.
This implies that if the sensitivity of the negative loglikelihood with respect to a model parameter, wij, falls below , then the value of that parameter will be set exactly to zero and the corresponding input feature can be pruned from the model.
2.1 Eliminating the Regularisation Parameters
Minimisation of (2) has a straightforward Bayesian interpretation; the posterior distribution for w, the parameters of the model given by (1), can be written as
p(wD) P (Dw)P (w).
L is then, up to an additive constant, the negative logarithm of the posterior density. The prior over model parameters, w, is then given by a separable Laplace distribution
P (w) =
2
W
W
exp{EW } =
2
exp
{wi}
,
(3)
i=1
where W is the number of active (nonzero) model parameters. A good value for the regularisation parameter can be estimated, within a Bayesian framework, by maximising the evidence [22] or alternatively it may be integrated out analytically [17, 23]. Here we take the latter approach, where the prior distribution over model parameters is given by marginalising over ,
p(w) = p(w)p()d.
As is a scale parameter, an appropriate ignorance prior is given by the improper Jeffrey's prior,
p() 1/, corresponding to a uniform prior over log . Substituting equation (3) and noting that
is strictly positive,
p(w)
=
1 2W
W 1 exp{EW }d. 0
Using the Gamma integral,
0
x1eµxdx
=
() µ
[24,
equation
3.384],
we
obtain
p(w)
=
1 2W
(W ) EW W
=
 log p(w) W log EW ,
giving a revised optimisation criterion for sparse logistic regression with Bayesian regularisation,
M = ED + W log EW ,
(4)
in which the regularisation parameter has been eliminated, for further details and theoretical justification, see [17]. Note that we integrate out the regularisation parameter and optimise the model parameters, which is unusual in that most Bayesian approaches, such as the relevance vector machine [13] optimise the regularisation parameters and integrate over the weights.
2.1.1 Practical Implementation
The training criterion incorporating a fully Bayesian regularisation term can be minimised via a simple modification of existing cyclic coordinate descent algorithms for sparse regression using a Laplace prior (e.g. [25, 26]). Differentiating the original and modified training criteria, (2) and (4) respectively, we have that
L = ED + EW and M = ED + ~EW
where
1/~
=
1 W
W
wi.
(5)
i=1
From a gradient descent perspective, minimising M effectively becomes equivalent to minimising
L, assuming that the regularisation parameter, , is continuously updated according to (5) following
every change in the vector of model parameters, w [17]. This requires only a very minor modifica
tion of the existing training algorithm, whilst eliminating the only training parameter and hence the
need for a model selection procedure in fitting the model.
2.1.2 Equivalence of Marginalisation and Optimisation under the Evidence Framework Williams [17] notes that, at least in the case of the Laplace prior, integrating out the regularisation parameter analytically is equivalent to its optimisation under the evidence framework of MacKay [22]. The argument provided by Williams can be summarised as follows: The evidence framework sets the value of the regularisation parameter so as to optimise the marginal likelihood, P (D) = P (Dw)P (w)dw,
also known as the evidence for the model. The Bayesian interpretation of the regularised objective
function gives,
P (D)
=
1 ZW
exp {L} dw,
where ZW is a normalising constant for the prior over the model parameters, for the Laplace prior, ZW = (2/)W . In the case of multinomial logistic regression, ED represents the negative logarithm of a normalised distribution, and so the corresponding normalising constant for the data misfit term
is redundant. Unfortunately this integral is analytically intractable, and so we adopt the Laplace
approximation, corresponding to a Gaussian posterior distribution for the model parameters, centred on their most probable value, wMP,
L(w)
=
L(wMP)
+
1 2
(w

wMP)T
A(w

wMP)
where A = L is the Hessian of the regularised objective function. The regulariser corresponding to the Laplace prior is locally a hyperplane, and so does not contribute to the Hessian and so A = ED. The negative logarithm of the evidence can then be written as,

log
P (D)
=
ED
+
EW
+
1 2
log
A
+
log
ZW
+
constant.
Setting the derivative of the evidence with respect to to zero, gives rise to a simple update rule for
the regularisation parameter,
1 ~
=
1 W
W
wj ,
j=1
which is equivalent to the update rule obtained using the integrateout approach. Maximising the evidence for the model also provides a convenient means for model selection. Using the Laplace approximation, evidence for a multinomial logistic regression model under the proposed Bayesian regularisation scheme is given by
 log {D} = ED + W log EW  log
(W ) 2W
+
1 2
log
A
+
constant
where A = L.
2.2 A Simple but Efficient Training Algorithm
In this study, we adopt a simplified version of the efficient componentwise training algorithm of Shevade and Keerthi [25], adapted for multinomial, rather than binomial, logistic regression. The principal advantage of a componentwise optimisation algorithm is that the Hessian matrix is not required, but only the first and second partial derivatives of the regularised training criterion. The first partial derivatives of the data misfit term are given by,
EDn anj
=
c i=1
EDn yin
yin anj
where
EDn yin
=

tni yin
,
yin anj
= yiij
 yiyj
and ij = 1 if i = j and otherwise ij = 0. Substituting, we obtain,
ED ai
= [yin  tni ] n=1
=
ED wij
= [yin  tni ] xnj n=1
=
yinxnj 
tni xnj .
n=1
n=1
Similarly, the second partial derivatives are given by,
2ED wij
=
n=1
xnj
yin wij
= yin (1  yin) n=1
xnj
2.
The Laplace regulariser is locally a hyperplane, with the magnitude of the gradient given by the regularisation parameter, ,
EW wij
= sign {wij}
and
2EW wi2j
= 0.
The partial derivatives of the regularisation term are not defined at the origin, and so we define the effective gradient of the regularised loss function as follows:
L wij
=
ED wij
+
ED wij

ED wij
+
ED wij

0
if wij > 0
if wij < 0
if
wij
=
0 and
ED wij
+
<
0
if
wij
=
0 and
ED wij

>
0
otherwise
Note that the value of a weight may be stable at zero if the derivative of the regularisation term dominates the derivative of the data misfit. The parameters of the model may then be optimised, using Newton's method, i.e.
wij
wij

ED wij
2ED wi2j
1 .
Any step that causes a change of sign in a model parameter is truncated and that parameter set to zero. All that remains is to decide on a heuristic used to select the parameter to be optimised in each step. In this study, we adopt the heuristic chosen by Shevade and Keerthi, in which the parameter having the steepest gradient is selected in each iteration. The optimisation proceeds using two nested loops, in the inner loop, only active parameters are considered. If no further progress can be made by optimising active parameters, the search is extended to parameters that are currently set to zero. An optimisation strategy based on scaled conjugate gradient descent [27] has also be found to be effective.
3 Results The proposed sparse multinomial logistic regression method incorporating Bayesian regularisation using a Laplace prior (SBMLR) was evaluated over a suite of wellknown benchmark datasets, against sparse multinomial logistic regression with fivefold crossvalidation based optimisation of the regularisation parameter using a simple line search (SMLR). Table 1 shows the test error rate and crossentropy statistics for SMLR and SBMLR methods over these datasets. Clearly, there is little reason to prefer either model over the other in terms of generalisation performance, as neither consistently dominates the other, either in terms of error rate or crossentropy. Table 1 also shows that the Bayesian regularisation scheme results in models with a slightly higher degree of sparsity (i.e. the proportion of weights pruned from the model). However, the most striking aspect of the comparison is that the Bayesian regularisation scheme is typically around two orders of magnitude faster than the crossvalidation based approach, with SBMLR being approximately five times faster in the worst case (COVTYPE). 3.1 The Value of Probabilistic Classification Probabilistic classifiers, i.e. those that providing an aposteriori estimate of the probability of class membership, can be used in minimum risk classification, using an appropriate loss matrix to account for the relative costs of different types of error. Probabilistic classifiers allow rejection thresholds to be set in a straightforward manner. This is particularly useful in a medical setting, where it may be prudent to refer a patient for further tests if the diagnosis is uncertain. Finally, the output of
Table 1: Evaluation of linear sparse multinomial logistic regression methods over a set of nine benchmark datasets. The best results for each statistic are shown in bold. The final column shows the logarithm of the ratio of the training times for the SMLR and SBMLR, such that a value of 2 would indicate that SBMLR is 100 times faster than SMLR for a given benchmark dataset.
Benchmark Covtype Crabs Glass Iris Isolet Satimage Viruses Waveform Wine
Error Rate SBMLR SMLR
0.4051 0.0350 0.3318 0.0267 0.0475 0.1610 0.0328 0.1290 0.0225
0.4041 0.0500 0.3224 0.0267 0.0513 0.1600 0.0328 0.1302 0.0281
Cross Entropy SBMLR SMLR
0.9590 0.1075 0.9398 0.0792 0.1858 0.3717 0.1670 0.3124 0.0827
0.9733 0.0891 0.9912 0.0867 0.2641 0.3708 0.1168 0.3131 0.0825
Sparsity SBMLR SMLR
0.4312 0.2708 0.4400 0.4067 0.9311 0.3694 0.8118 0.3712 0.6071
0.3069 0.0635 0.4700 0.4067 0.8598 0.2747 0.7632 0.3939 0.5524
log TSMLR 10 TSBMLR 0.6965 2.7949 1.9445 1.9802 1.3110 1.3083 2.1118 1.8133 2.5541
a probabilistic classifier can be adjusted after training to compensate for a difference between the relative class frequencies in the training set and those observed in operation. Saerens [4] provides a simple expectationmaximisation (EM) based procedure for estimating unknown operational apriori probabilities from the output of a probabilistic classifier (c.f. [28]). Let pt (Ci) represent the apriori probability of class Ci in the training set and pt (Cixn) represent the raw output of the classifier for the nth pattern of the test data (representing operational conditions). The operational apriori probabilities, po (Ci) can then be updated iteratively via
p(os)(ixn) =
p(os) (i ) pt (i )
pt
(i
xn
)
c j=1
p(os)(j ) pt(j )
pt
(j
xn
)
and
p(os+1)(i)
=
1
N p(os)(ixn), n=1
(6)
beginning with p(o0) (Ci) = pt (Ci). Note that the labels of the test examples are not required for this procedure. The adjusted estimates of aposteriori probability are then given by the first part of equation (6). The training and validation sets of the COVTYPE benchmark have been artificially balanced, by random sampling, so that each class is represented by the same number of examples. The test set consists of the unused patterns, and so the test set apriori probabilities are both highly disparate and very different from the training set apriori probabilities. Figure 1 and Table 2 summarise the results obtained using the raw and corrected outputs of a linear SBMLR model on this dataset, clearly demonstrating a key advantage of probabilistic classifiers over purely discriminative methods, for example the support vector machine (note the same procedure could be applied to the SMLR model with similar results).
0.5 training set
test set
0.4
estimated
apriori probability
Table 2: Error rate and average crossentropy score for linear SBMLR models of the COVTYPE benchmark, using the raw and corrected outputs.
Statistic
Raw Corrected
Error Rate 40.51% 28.57% CrossEntropy 0.9590 0.6567
0.3 0.2 0.1 0 1234567 class Figure 1: Training set, test set and estimated apriori probabilities for the COVTYPE benchmark.
4 Relationship to Existing Work
The sparsity inducing Laplace density has been utilized previously in [15, 25, 26, 2931] and
emerges as the marginal of a scalemixtureofGaussians where the corresponding prior is an Expo
nential such that
Nw(0, )E ()d
=
2
exp(w)
where E () is an Exponential distribution over with parameter and = . In [29] this
hierarchical representation of the Laplace prior is utilized to develop an EM style sparse binomial
probit regression algorithm. The hyperparameter is selected via crossvalidation but in an attempt
to circumvent this requirement a Jeffreys prior is placed on and is used to replace the exponential
distribution in the above integral. This yields an improper parameter free prior distribution over
w which removes the explicit requirement to perform any crossvalidation. However, the method developed in [29] is restricted to binary classification and has compute scaling O(d3) which prohibits
its use on moderately highdimensional problems.
Likewise in [13] the RVM employs a similar scalemixture for the prior where now the Exponential distribution is replaced by a Gamma distribution whose marginal yields a Student prior distribution. No attempt is made to estimate the associated hyperparameters and these are typically set to zero producing, as in [29], a sparsity inducing improper prior. As with [29] the original scaling of [13] is, at worst, O(d3), though more efficient methods have been developed in [14]. However the analysis holds only for a binary classifier and it would be nontrivial to extend this to the multiclass domain.
A similar multinomial logistic regression model to the one proposed in this paper is employed in [26] where the algorithm is applied to large scale classification problems and yet they, as with [25], have to resort to crossvalidation in obtaining a value for the hyperparameters of the Laplace prior.
5 Summary In this paper we have demonstrated that the regularisation parameter used in sparse multinomial logistic regression using a Laplace prior can be integrated out analytically, giving similar performance in terms of generalisation as is obtained using extensive crossvalidation based model selection, but at a greatly reduced computational expense. It is interesting to note that the SBMLR implements a strategy that is exactly the opposite of the relevance vector machine (RVM) [13], in that it integrates over the hyperparameter and optimises the weights, rather than marginalising the model parameters and optimising the hyperparameters. It seems reasonable to suggest that this approach is feasible in the case of the Laplace prior as the pruning action of this prior ensures that values of all of the weights are strongly determined by the data misfit term. A similar strategy has already proved effective in cancer classification based on gene expression microarray data in a binomial setting [32], and we plan to extend this work to multiclass cancer classification in the near future.
Acknowledgements The authors thank the anonymous reviewers for their helpful and constructive comments. MG is supported by EPSRC grant EP/C010620/1. References
[1] P. McCullagh and J. A. Nelder. Generalized linear models, volume 37 of Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, second edition, 1989. [2] D. W. Hosmer and S. Lemeshow. Applied logistic regression. Wiley, second edition, 2000. [3] C. K. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16(1):4146, January 1970. [4] M. Saerens, P. Latinne, and C. Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Computation, 14(1):2141, 2001. [5] J. O. Berger. Statistical decision theory and Bayesian analysis. Springer Series in Statistics. Springer, second edition, 1985.
[6] J. Zhu and T. Hastie. Classification of gene microarrays by penalized logistic regression. Biostatistics, 5(3):427443, 2004. [7] X. Zhou,
X. Wang, and E. R. Dougherty. Multiclass cancer classification using multinomial probit regression with Bayesian gene selection. IEE Proceedings  Systems Biology, 153(2):7076, March 2006. [8] T. Zhang and F. J. Oles. Text categorization based on regularised linear
classification methods. Information Retrieval, 4(1):531, April 2001. [9] L. Narlikar and A. J. Hartemink. Sequence features of DNA binding sites reveal structural class of associated transcription factor. Bioinformatics, 22(2):157163, 2006. [10] M. J. L. Orr. Regularisation in the selection of radial basis function centres. Neural Computation, 7(3):606623, 1995. [11] Y. Grandvalet. Least absolute shrinkage is equivalent to quadratic penalisation. In L. Niklasson, M. Bodeґn, and T. Ziemske, editors, Proceedings of the
International Conference on
Artificial Neural Networks, Perspectives in Neural Computing, pages 201206, SkoЁvde, Sweeden, September 24 1998. Springer. [12] Y. Grandvalet and S. Canu. Outcomes of the quivalence of adaptive ridge with least absolute shrinkage. In Advances in Neural Information Processing Systems, volume 11. MIT Press, 1999. [13] M. E. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. Journal of
Machine Learning Research, 1:211244, 2001. [14] A. C. Faul and M. E. Tipping. Fast marginal likelihood maximisation for sparse Bayesian models. In C. M. Bishop and B. J. Frey, editors, Proceedings of the Ninth International Workshop on
artificial intelligence and Statistics, Key West, FL, USA, 36 January 2003. [15] R. Tibshirani. Regression shrinkage and selection via the LASSO. Journal of the
Royal Statistical Society  Series B, 58:267288, 1996. [16] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407499, 2004. [17] P. M. Williams. Bayesian regularization and pruning using a Laplace prior. Neural Computation, 7(1):117143, 1995. [18] C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995. [19] J. S. Bridle. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In F. Fogelman Soulieґ and J. Heґrault, editors, Neurocomputing: Algorithms, architectures and applications, pages 227236. SpringerVerlag, New York, 1990. [20] A. N. Tikhonov and V. Y. Arsenin. Solutions of illposed problems. John Wiley, New York, 1977. [21] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilema. Neural Computation, 4(1):158, 1992. [22] D. J. C. MacKay. The evidence framework applied to classification networks. Neural Computation, 4(5):720736, 1992. [23] W. L. Buntine and A. S. Weigend. Bayesian backpropagation. Complex Systems, 5:603643, 1991. [24] I. S. Gradshteyn and I. M. Ryzhic. Table of Integrals, Series and Products. Academic Press,
Fifth Edition, 1994. [25] S. K. Shevade and S. S. Keerthi. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19(17):22462253, 2003. [26] D. Madigan, A. Genkin, D. D. Lewis, and D. Fradkin. Bayesian multinomial logistic regression for author identification. In AIP Conference Proceedings, volume 803, pages 509516, 2005. [27] P. M. Williams. A Marquardt algorithm for choosing the step size in backpropagation learning with conjugate gradients.
Technical report CSRP229,
University of Sussex, February 1991. [28] G. J. McLachlan. Discriminant analysis and statistical pattern recognition. Wiley, 1992. [29] M. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9):11501159, September 2003. [30] B. Krishnapuram, L. Carin, M. A. T. Figueiredo, and A. J. Hartemink. Sprse multinomial logistic regression: Fast algorithms and generalisation bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):957968, June 2005. [31] J. M. BioucasDias, M. A. T. Figueiredo, and J. P. Oliveira. Adaptive total variation image deconvolution: A majorizationminimization approach. In Proceedings of the European Signal Processing Conference (EUSIPCO'2006), Florence, Italy, September 2006. [32] G. C. Cawley and N. L. C. Talbot. Gene selection in cancer classification using sparse logistic regression with Bayesian regularisation. Bioinformatics, 22(19):23482355, October 2006.