Censor, Sketch, and Validate for Learning from Large-Scale Data

Tags: G. B. Giannakis, Intelligent Systems and Technology, feature extraction, correlated data, linear classification, support vector machines, Linear regression, kernel matrix, Iteration, classification, KF Predictor Corrector, kernel regression, function approximation, comparable complexity, P. A. Traganitis, dynamical processes, REGURALIZED REGRESSION, K. Slavakis, streaming data, P. Traganitis, Georgios B. Giannakis, NIH 1R01GM104975-01 ISR, regressions, informative data, data reduction, Dimensionality reduction, F. Sheikholeslami, NSF, spectral clustering, D. Berberidis, Adaptive Estimation, K. Berberidis, G. Mateos, Random projections, SkeVa
Content: Censor, Sketch, and Validate for Learning from Large-Scale Data Georgios B. Giannakis Acknowledgments: D. Berberidis, F. Sheikholeslami, P. Traganitis; and Prof. G. Mateos NSF 1343860, 1442686, 1514056, NSF-AFOSR 1500713, MURI-FA9550-10-1-0567, and NIH 1R01GM104975-01 ISR, U. of Maryland March 4, 20116
Learning from "Big Data"
Challenges Big size ( and/or Fast streaming Incomplete Noise and outliers
Opportunities in key tasks
)
Dimensionality reduction
Online and robust
regression, classification
and clustering
Denoising and imputation
Internet
2
Roadmap Context and motivation Large-scale linear regressions Random projections for data sketching Adaptive censoring of uninformative data Tracking high-dimensional dynamical data Large-scale nonlinear function approximation Large-scale data and graph clustering Leveraging sparsity and low rank for anomalies and tensors Closing comments 3
Random projections for data sketching Ordinary least-squares (LS) Given
If
SVD incurs complexity
Q: What if
?
LS estimate via (pre-conditioning) random projection matrix Rd x D
For
complexity reduces to
M. W. Mahoney, Randomized Algorithms for Matrices and Data, Foundations and Trends
In machine learning, vol. 3, no. 2, pp. 123-224, Nov. 2011.
4
Performance of randomized LS Based on the Johnson-Lindenstrauss lemma [JL'84]
Theorem. For any
, if
, then w.h.p.
condition number of ; and Uniform sampling versus Hadamard preconditioning D = 10,000 and p =50 Performance depends on X
D. P. Woodruff, ''Sketching as a Tool for Numerical Linear Algebra,''
Foundations and Trends in Theoretical Computer Science, vol. 10, pp. 1-157, 2014.
5
Online censoring for large-scale regressions Key idea: Sequentially test and update LS estimates only for informative data Adaptive censoring (AC) rule: Censor if Criterion Threshold controls avg. data reduction: D. K. Berberidis, G. Wang, G. B. Giannakis, and V. Kekatos, "Adaptive Estimation from Big Data via Censored 6 Stochastic Approximation," Proc. of Asilomar Conf., Pacific Grove, CA, Nov. 2014.
Censoring algorithms and performance AC least mean-squares (LMS) AC recursive least-squares (RLS) at complexity
Proposition 1 AC-RLS AC-LMS
D. K. Berberidis, and G. B. Giannakis, "Online Censoring for Large-Scale Regressions," IEEE Trans. on SP, July 2015 (submitted); also Proc. of ICASSP, Brisbane, Australia, April 2015.
7
Censoring vis-a-vis random projections RPs for linear regressions [Mahoney `11], [Woodruff'14] Data-agnostic reduction; preconditioning costs AC for linear regressions Data-driven measurement selection Suitable also for Streaming Data Minimal memory requirements AC interpretations Reveals `causal' support vectors Censors data with low LLRs: 8
Performance comparison Synthetic: D=10,000, p=300 (50 MC runs); Real data: Highly non-uniform data
estimated from full set
AC-RLS outperforms alternatives at comparable complexity Robust to uniform (all "important") rows of X ; Q: Time-varying parameters? 9
Tracking high-dimensional dynamical data
REGURALIZED REGRESSION
ESTIMATE
PREDICTOR-based REGULARIZATION
REGURALIZED REGRESSION
DATA
TRANSITION MODEL
DATA
OBSERVATION
TIME-VARYING STATE
OBSERVATION
Low-complexity, reduced-dimension KF with large-scale ( Prediction: Correction:
) correlated data 10
Sketching for dynamical processes
Weighted LS correction incurs prohibitive complexity for Related works either costly at fusion center [Varshney etal'14] Data-agnostic with model-driven ensemble optimality [Krause-Guestrin'11] Our ecoPPnomical KF: Sketch informative
Data at time
Dimensionality Reduction
-based KF Predictor Corrector
11
RP-based KF Same predictor and sketched corrector RP-based sketching Sketched correction
Proposition 2. With
if
for
, then whp
D. K. Berberidis and G. B. Giannakis, "Data sketching for tracking large-scale dynamical processes,"
12
Proc. of Asilomar Conf., Pacific Grove, CA, Nov. 2015.
Simulated test
AC-KF outperforms RP-KF and KF with greedy design of experiments (DOE)!
AC-KF complexity is
much lower than
of greedy DOE-KF
A. Krause, C. Guestrin. "Submodularity and its Applications in Optimized Information Gathering:
An Introduction," ACM Trans. on intelligent systems and Technology, vol. 2, July 2011.
13
Roadmap Context and motivation Large-scale linear regressions Large-scale nonlinear function approximation Online kernel regression on a budget Online kernel classification on a budget Large-scale data and graph clustering Leveraging sparsity and low rank for anomalies and tensors Closing comments 14
Linear or nonlinear functions for learning?
Regression or classification: Given
, find
Lift via nonlinear map
to linear
Pre-select kernel (inner product) function
e.g.,
RKHS basis expansion
Kernel-based nonparametric ridge regression
Memory requirement
, and complexity
15
Low-rank lifting function approximation Low-rank (r) subspace learning [Mardani-Mateos-GG'14] here on lifted data
group-sparsity regularizer
BCD solver: at Iteration
,
and
available
S1. Find projection coefficients via regularized least-squares
S2. Find subspace factor via (in) exact group shrinkage solutions
Nystrom approximation: special case with
or
Low-rank subspace tracking via stochastic approximation (also with a "budget")
F. Sheikholeslami and G. B. Giannakis, "Scalable Kernel-based Learning via Low-rank Approximation
of Lifted Data," Proc. of Intl. Conf. on Machine Learning, New York City, submitted Feb. 2016.
17
Online kernel regression and classification
Kernel matrix approximation Proposition 4. If can be approximated as it holds that
Linear regression!
iid with
kernel matrix
, and w.p. at least
High-performance online kernel-based Feature Extraction on a budget (OK-FEB) bounds also on support vector machines for regression and classification 17
Kernel approximation via low-rank features
Infer annual income exceeds 50K using as features (education, age, gender,...)

80%-20% split for training-testing
4
Adult dataset
BCD-Ex
10
BCD-PG
Nyst
Prb-Nyst ImpNyst
· Run time for OK-FEB+LSVM vs K-SVM
SVD
3 10
Kernel matrix missmatch
Run time (sec)
5
10
15
20
25
30
35
40
45
50
Rank (r)
104
2 10
100
-2
10
5
10
15
20
25
30
35
40
45
50
Rank (r)
OK-FEB LSVM outperforms K-SVM (LibSVM) in both training and testing phases
C. C. Chang and C. J. Lin, "LIBSVM: A library for support vector machines,"
ACM Trans. on Intelligent Systems and Technology, vol. 2, pp.1-27, April 2011.
18
OK-FEB with linear classification and regression
Adult dataset (classification)
Slice dataset (regression)
Misclassification prob.
0.35 0.3 0.25 0.2 0 2 10 0 10 -2 10 0
0.5
1
1.5
2
2.5
3
Iteration (sample number)
x 104
0.5
1
1.5
2
2.5
Iteration (sample number)
Perceptron OGD RBP Forgetron Projectron BOGD BPAs 3 OKFExB104
CPU time(sec)
Average LS-error
0.25
0.2
0.15
0.1
0.05
0
0 0.5
1 1.5
2 2.5
3 3.5
4 4.5
5
5.5
Iteration (sample number)
4 x 10
4 10
2 10
0
10
OGD
Perceptron
-2
Norma
10
RBP
forgetron
0
1
2
3
4
5
BOGD6
Iteration (sample number)
4 OxK1F0EB
CPU time(sec)
OK-FEB LSVM outperforms budgeted K-SVM/SVR variants in classification/regression F. Sheikholeslami, D. K. Berberidis, and G. B. Giannakis, "Kernel-based Low-rank Feature Extraction on a Budget for Big data streams," Proc. of Globalsip Conf., Dec. 14-16, 2015; also in arxiv:1601.07947.19
Roadmap Context and motivation Large-scale linear regressions Large-scale nonlinear function approximation Large-scale data and graph clustering Random sketching and validation (SkeVa) SkeVa-based spectral and subspace clustering Leveraging sparsity and low rank for anomalies and tensors Closing comments 20
Big data clustering
Clustering: Given
, or their distances, assign them to K clusters
Centroids
Assignments
hard clustering:
NP-hard! Soft clustering:
K-means: locally optimal, but simple; complexity O(NDKI)
Probabilistic clustering amounts to pdf estimation Gaussian mixtures (EM-based estimation) Regularizer can account for unknown K
Q. What if
and/or
?
A1. Random Projections: Use dxD matrix R to form RX; apply K-means in d-space
C. Boutsidis, A. Zousias, P. Drineas, and M. W. Mahoney, "Randomized dimensionality reduction for K-means
clustering," IEEE Trans. on information theory, vol. 61, pp. 1045-1062, Feb. 2015.
21
Random sketching and validation (SkeVa)
Randomly select
"informative" dimensions
Algorithm For
Sketch
dimensions:
Run k-means on
Re-sketch
dimensions
Augment centroids
,
Validate using consensus set

Similar approaches possible for
Sequential and kernel variants available
P. A. Traganitis, K. Slavakis, and G. B. Giannakis, "Sketch and Validate for Big Data Clustering,"
IEEE Journal on Special Topics in signal processing, vol. 9, pp. 678-690, June 2015.
22
Divergence-based SkeVa
Idea: "Informative" draws yield reliable estimates of multimodal data pdf!
Compare pdf estimates
via "distances"
· Integrated square-error (ISE)
For Sketch points If
, then re-sketch points
If

Cluster
; associate
to
2326
RP versus SkeVa comparisons KDDb dataset (subset) D = 2,990,384, N = 10,000, K = 2 RP: [Boutsidis etal `15] versus SkeVa 2427
Performance and SkeVa generalizations
Di-SkeVa is fully parallelizable
Q. How many samples/draws SkeVa needs?
A. For independent draws,
can be lower bounded
Proposition 5. For a given probability of a successful Di-SkeVa draw r quantified by pdf dist. , the number of draws is lower bounded w.h.p. q by
Bound can be estimated online SkeVa module can be used for spectral clustering and subspace clustering 25
Identification of network communities Kernel K-means instrumental for partitioning of large graphs (spectral clustering) Relies on graph Laplacian to capture nodal correlations arXiv collaboration network (general relativity): N=4,158 nodes, 13,422 edges, K = 36 [Leskovec'11]
Spectral Clustering 3.1 sec
SkeVa (n = 500) 0.5 sec
For , kernel-based SkeVa reduces complexity to
SkeVa (n=1,000) 0.85 sec
P. A. Traganitis, K. Slavakis, and G. B. Giannakis, "Spectral clustering of large-scale communities via random sketching and validation," Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. 26
Modeling Internet traffic anomalies
Anomalies: changes in origin-destination (OD) flows [Lakhina et al'04] Failures, congestions, DoS attacks, intrusions, flooding
Graph G (N, L) with N nodes, L links, and F flows (F >> L); OD flow zf,t
Packet counts per link l and time slot t Anomaly {0,1} Matrix model across T time slots:
1
0.9 f2 0.8
0.7
0.6
l
0.5 f1 0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
M. Mardani, G. Mateos, and G. B. Giannakis, "Recovery of low-rank plus compressed sparse matrices with application
to unveiling traffic anomalies," IEEE Transactions on Information Theory, pp. 5186-5205, Aug. 2013.
27
Low-rank plus sparse matrices Z (and X:=RZ) low rank, e.g., [Zhang et al`05]; A is sparse across time and flows
Data: http://math.bu.edu/people/kolaczyk/datasets.html
|af,t|
x 108 4 2 0 0 200 400 600 800 1000 Time index(t) (P1) 28
Internet2 data real network data, Dec. 8-28, 2003
Detection probability Anomaly volume
1
0.8
0.6
[Lakhina04], rank=1
[Lakhina04], rank=2
0.4
[Lakhina04], rank=3
Proposed method
[Zhang05], rank=1
0.2
[Zhang05], rank=2
[Zhang05], rank=3
0
0
0.2
0.4
0.6
0.8
1
False alarm probability
6 5 4 3 2 1 0 100 Flows 50
---- True ---- Estimated Pfa = 0.03 Pd = 0.92
00
500 400 300 200 100 Time
Improved performance by leveraging sparsity and low rank Succinct depiction of the network health state across flows and time
Data: http://www.cs.bu.edu/~crovella/links.html
29
From low-rank matrices to tensors
Data cube
, e.g., sub-sampled MRI frames
ar
A=
i
PARAFAC decomposition per slab t [Harshman '70] Tensor subspace comprises R rank-one matrices
br
B=
i
cr
C=
i
Goal: Given streaming
, learn the subspace
matrices (A,B) recursively, and impute possible misses of Yt
J. A. Bazerque, G. Mateos, and G. B. Giannakis, "Rank regularization and Bayesian inference for tensor completion
and extrapolation," IEEE Trans. on Signal Processing, vol. 61, no. 22, pp. 5689-5703, Nov. 2013.
30
Online tensor subspace learning Image domain low tensor rank
Stochastic alternating minimization; parallelizable across bases
Real-time reconstruction (FFT per iteration)
M. Mardani, G. Mateos, and G. B. Giannakis, "Subspace learning and imputation for streaming big data
matrices and tensors," IEEE Trans. on Signal Processing, vol. 63, pp. 2663 - 2677, May 2015.
31
Dynamic cardiac MRI test in vivo dataset: 256 k-space 200x256 frames
Ground-truth frame
Sampling trajectory R=100, 90% misses R=150, 75% misses
Potential for accelerating MRI at high spatio-temporal resolution
Low-rank
plus
can also capture motion effects
M. Mardani and G. B. Giannakis, "Accelerating dynamic MRI via tensor subspace learning,"
Proc. of ISMRM 23rd Annual Meeting and Exhibition, Toronto, Canada, May 30 - June 5, 2015.
32
Closing comments
Large-scale learning Regression and tracking dynamic data Nonlinear non-parametric function approximation Clustering massive, high-dimensional data and graphs Other key Big Data tasks Visualization, mining, privacy, and security Enabling tools for Big data acquisition, processing, and storage
Fundamental theory, performance analysis decentralized, robust, and parallel algorithms Scalable computing platforms Big Data application domains ... Thank You! Sustainable Systems, Social, Health, and Bio-Systems, Life-enriching Multimedia, Secure Cyberspace, Business, and Marketing Systems ...
K. Slavakis, G. B. Giannakis, and G. Mateos, "Modeling and optimization for Big Data analytics,"
IEEE SIGNAL PROCESSING MAGAZINE, vol. 31, no. 5, pp. 18-31, Sep. 2014.
33

File: censor-sketch-and-validate-for-learning-from-large-scale-data.pdf
Title: PowerPoint Presentation
Author: slavakis
Published: Sat Mar 5 00:10:00 2016
Pages: 33
File size: 2.07 Mb


Noncoherent coded modulation, 12 pages, 1.15 Mb

, pages, 0 Mb

The Macbeth murder mystery, 3 pages, 0.31 Mb
Copyright © 2018 doc.uments.com