Learning from "Big Data"

Challenges Big size ( and/or Fast streaming Incomplete Noise and outliers

Opportunities in key tasks

)

Dimensionality reduction

Online and robust

regression, classification

and clustering

Denoising and imputation

Internet

2

Roadmap Context and motivation Large-scale linear regressions Random projections for data sketching Adaptive censoring of uninformative data Tracking high-dimensional dynamical data Large-scale nonlinear function approximation Large-scale data and graph clustering Leveraging sparsity and low rank for anomalies and tensors Closing comments 3

Random projections for data sketching Ordinary least-squares (LS) Given

If

SVD incurs complexity

Q: What if

?

LS estimate via (pre-conditioning) random projection matrix Rd x D

For

complexity reduces to

M. W. Mahoney, Randomized Algorithms for Matrices and Data, Foundations and Trends

In machine learning, vol. 3, no. 2, pp. 123-224, Nov. 2011.

4

Performance of randomized LS Based on the Johnson-Lindenstrauss lemma [JL'84]

Theorem. For any

, if

, then w.h.p.

condition number of ; and Uniform sampling versus Hadamard preconditioning D = 10,000 and p =50 Performance depends on X

D. P. Woodruff, ''Sketching as a Tool for Numerical Linear Algebra,''

Foundations and Trends in Theoretical Computer Science, vol. 10, pp. 1-157, 2014.

5

Online censoring for large-scale regressions Key idea: Sequentially test and update LS estimates only for informative data Adaptive censoring (AC) rule: Censor if Criterion Threshold controls avg. data reduction: D. K. Berberidis, G. Wang, G. B. Giannakis, and V. Kekatos, "Adaptive Estimation from Big Data via Censored 6 Stochastic Approximation," Proc. of Asilomar Conf., Pacific Grove, CA, Nov. 2014.

Censoring algorithms and performance AC least mean-squares (LMS) AC recursive least-squares (RLS) at complexity

Proposition 1 AC-RLS AC-LMS

D. K. Berberidis, and G. B. Giannakis, "Online Censoring for Large-Scale Regressions," IEEE Trans. on SP, July 2015 (submitted); also Proc. of ICASSP, Brisbane, Australia, April 2015.

7

Censoring vis-a-vis random projections RPs for linear regressions [Mahoney `11], [Woodruff'14] Data-agnostic reduction; preconditioning costs AC for linear regressions Data-driven measurement selection Suitable also for Streaming Data Minimal memory requirements AC interpretations Reveals `causal' support vectors Censors data with low LLRs: 8

Performance comparison Synthetic: D=10,000, p=300 (50 MC runs); Real data: Highly non-uniform data

estimated from full set

AC-RLS outperforms alternatives at comparable complexity Robust to uniform (all "important") rows of X ; Q: Time-varying parameters? 9

Tracking high-dimensional dynamical data

REGURALIZED REGRESSION

ESTIMATE

PREDICTOR-based REGULARIZATION

REGURALIZED REGRESSION

DATA

TRANSITION MODEL

DATA

OBSERVATION

TIME-VARYING STATE

OBSERVATION

Low-complexity, reduced-dimension KF with large-scale ( Prediction: Correction:

) correlated data 10

Sketching for dynamical processes

Weighted LS correction incurs prohibitive complexity for Related works either costly at fusion center [Varshney etal'14] Data-agnostic with model-driven ensemble optimality [Krause-Guestrin'11] Our ecoPPnomical KF: Sketch informative

Data at time

Dimensionality Reduction

-based KF Predictor Corrector

11

RP-based KF Same predictor and sketched corrector RP-based sketching Sketched correction

Proposition 2. With

if

for

, then whp

D. K. Berberidis and G. B. Giannakis, "Data sketching for tracking large-scale dynamical processes,"

12

Proc. of Asilomar Conf., Pacific Grove, CA, Nov. 2015.

Simulated test

AC-KF outperforms RP-KF and KF with greedy design of experiments (DOE)!

AC-KF complexity is

much lower than

of greedy DOE-KF

A. Krause, C. Guestrin. "Submodularity and its Applications in Optimized Information Gathering:

An Introduction," ACM Trans. on intelligent systems and Technology, vol. 2, July 2011.

13

Roadmap Context and motivation Large-scale linear regressions Large-scale nonlinear function approximation Online kernel regression on a budget Online kernel classification on a budget Large-scale data and graph clustering Leveraging sparsity and low rank for anomalies and tensors Closing comments 14

Linear or nonlinear functions for learning?

Regression or classification: Given

, find

Lift via nonlinear map

to linear

Pre-select kernel (inner product) function

e.g.,

RKHS basis expansion

Kernel-based nonparametric ridge regression

Memory requirement

, and complexity

15

Low-rank lifting function approximation Low-rank (r) subspace learning [Mardani-Mateos-GG'14] here on lifted data

group-sparsity regularizer

BCD solver: at Iteration

,

and

available

S1. Find projection coefficients via regularized least-squares

S2. Find subspace factor via (in) exact group shrinkage solutions

Nystrom approximation: special case with

or

Low-rank subspace tracking via stochastic approximation (also with a "budget")

F. Sheikholeslami and G. B. Giannakis, "Scalable Kernel-based Learning via Low-rank Approximation

of Lifted Data," Proc. of Intl. Conf. on Machine Learning, New York City, submitted Feb. 2016.

17

Online kernel regression and classification

Kernel matrix approximation Proposition 4. If can be approximated as it holds that

Linear regression!

iid with

kernel matrix

, and w.p. at least

High-performance online kernel-based Feature Extraction on a budget (OK-FEB) bounds also on support vector machines for regression and classification 17

Kernel approximation via low-rank features

Infer annual income exceeds 50K using as features (education, age, gender,...)

80%-20% split for training-testing

4

Adult dataset

BCD-Ex

10

BCD-PG

Nyst

Prb-Nyst ImpNyst

· Run time for OK-FEB+LSVM vs K-SVM

SVD

3 10

Kernel matrix missmatch

Run time (sec)

5

10

15

20

25

30

35

40

45

50

Rank (r)

104

2 10

100

-2

10

5

10

15

20

25

30

35

40

45

50

Rank (r)

OK-FEB LSVM outperforms K-SVM (LibSVM) in both training and testing phases

C. C. Chang and C. J. Lin, "LIBSVM: A library for support vector machines,"

ACM Trans. on Intelligent Systems and Technology, vol. 2, pp.1-27, April 2011.

18

OK-FEB with linear classification and regression

Adult dataset (classification)

Slice dataset (regression)

Misclassification prob.

0.35 0.3 0.25 0.2 0 2 10 0 10 -2 10 0

0.5

1

1.5

2

2.5

3

Iteration (sample number)

x 104

0.5

1

1.5

2

2.5

Iteration (sample number)

Perceptron OGD RBP Forgetron Projectron BOGD BPAs 3 OKFExB104

CPU time(sec)

Average LS-error

0.25

0.2

0.15

0.1

0.05

0

0 0.5

1 1.5

2 2.5

3 3.5

4 4.5

5

5.5

Iteration (sample number)

4 x 10

4 10

2 10

0

10

OGD

Perceptron

-2

Norma

10

RBP

forgetron

0

1

2

3

4

5

BOGD6

Iteration (sample number)

4 OxK1F0EB

CPU time(sec)

OK-FEB LSVM outperforms budgeted K-SVM/SVR variants in classification/regression F. Sheikholeslami, D. K. Berberidis, and G. B. Giannakis, "Kernel-based Low-rank Feature Extraction on a Budget for Big data streams," Proc. of Globalsip Conf., Dec. 14-16, 2015; also in arxiv:1601.07947.19

Roadmap Context and motivation Large-scale linear regressions Large-scale nonlinear function approximation Large-scale data and graph clustering Random sketching and validation (SkeVa) SkeVa-based spectral and subspace clustering Leveraging sparsity and low rank for anomalies and tensors Closing comments 20

Big data clustering

Clustering: Given

, or their distances, assign them to K clusters

Centroids

Assignments

hard clustering:

NP-hard! Soft clustering:

K-means: locally optimal, but simple; complexity O(NDKI)

Probabilistic clustering amounts to pdf estimation Gaussian mixtures (EM-based estimation) Regularizer can account for unknown K

Q. What if

and/or

?

A1. Random Projections: Use dxD matrix R to form RX; apply K-means in d-space

C. Boutsidis, A. Zousias, P. Drineas, and M. W. Mahoney, "Randomized dimensionality reduction for K-means

clustering," IEEE Trans. on information theory, vol. 61, pp. 1045-1062, Feb. 2015.

21

Random sketching and validation (SkeVa)

Randomly select

"informative" dimensions

Algorithm For

Sketch

dimensions:

Run k-means on

Re-sketch

dimensions

Augment centroids

,

Validate using consensus set

Similar approaches possible for

Sequential and kernel variants available

P. A. Traganitis, K. Slavakis, and G. B. Giannakis, "Sketch and Validate for Big Data Clustering,"

IEEE Journal on Special Topics in signal processing, vol. 9, pp. 678-690, June 2015.

22

Divergence-based SkeVa

Idea: "Informative" draws yield reliable estimates of multimodal data pdf!

Compare pdf estimates

via "distances"

· Integrated square-error (ISE)

For Sketch points If

, then re-sketch points

If

Cluster

; associate

to

2326

RP versus SkeVa comparisons KDDb dataset (subset) D = 2,990,384, N = 10,000, K = 2 RP: [Boutsidis etal `15] versus SkeVa 2427

Performance and SkeVa generalizations

Di-SkeVa is fully parallelizable

Q. How many samples/draws SkeVa needs?

A. For independent draws,

can be lower bounded

Proposition 5. For a given probability of a successful Di-SkeVa draw r quantified by pdf dist. , the number of draws is lower bounded w.h.p. q by

Bound can be estimated online SkeVa module can be used for spectral clustering and subspace clustering 25

Identification of network communities Kernel K-means instrumental for partitioning of large graphs (spectral clustering) Relies on graph Laplacian to capture nodal correlations arXiv collaboration network (general relativity): N=4,158 nodes, 13,422 edges, K = 36 [Leskovec'11]

Spectral Clustering 3.1 sec

SkeVa (n = 500) 0.5 sec

For , kernel-based SkeVa reduces complexity to

SkeVa (n=1,000) 0.85 sec

P. A. Traganitis, K. Slavakis, and G. B. Giannakis, "Spectral clustering of large-scale communities via random sketching and validation," Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. 26

Modeling Internet traffic anomalies

Anomalies: changes in origin-destination (OD) flows [Lakhina et al'04] Failures, congestions, DoS attacks, intrusions, flooding

Graph G (N, L) with N nodes, L links, and F flows (F >> L); OD flow zf,t

Packet counts per link l and time slot t Anomaly {0,1} Matrix model across T time slots:

1

0.9 f2 0.8

0.7

0.6

l

0.5 f1 0.4

0.3

0.2

0.1

0

0

0.2

0.4

0.6

0.8

1

M. Mardani, G. Mateos, and G. B. Giannakis, "Recovery of low-rank plus compressed sparse matrices with application

to unveiling traffic anomalies," IEEE Transactions on Information Theory, pp. 5186-5205, Aug. 2013.

27

Low-rank plus sparse matrices Z (and X:=RZ) low rank, e.g., [Zhang et al`05]; A is sparse across time and flows

Data: http://math.bu.edu/people/kolaczyk/datasets.html

|af,t|

x 108 4 2 0 0 200 400 600 800 1000 Time index(t) (P1) 28

Internet2 data real network data, Dec. 8-28, 2003

Detection probability Anomaly volume

1

0.8

0.6

[Lakhina04], rank=1

[Lakhina04], rank=2

0.4

[Lakhina04], rank=3

Proposed method

[Zhang05], rank=1

0.2

[Zhang05], rank=2

[Zhang05], rank=3

0

0

0.2

0.4

0.6

0.8

1

False alarm probability

6 5 4 3 2 1 0 100 Flows 50

---- True ---- Estimated Pfa = 0.03 Pd = 0.92

00

500 400 300 200 100 Time

Improved performance by leveraging sparsity and low rank Succinct depiction of the network health state across flows and time

Data: http://www.cs.bu.edu/~crovella/links.html

29

From low-rank matrices to tensors

Data cube

, e.g., sub-sampled MRI frames

ar

A=

i

PARAFAC decomposition per slab t [Harshman '70] Tensor subspace comprises R rank-one matrices

br

B=

i

cr

C=

i

Goal: Given streaming

, learn the subspace

matrices (A,B) recursively, and impute possible misses of Yt

J. A. Bazerque, G. Mateos, and G. B. Giannakis, "Rank regularization and Bayesian inference for tensor completion

and extrapolation," IEEE Trans. on Signal Processing, vol. 61, no. 22, pp. 5689-5703, Nov. 2013.

30

Online tensor subspace learning Image domain low tensor rank

Stochastic alternating minimization; parallelizable across bases

Real-time reconstruction (FFT per iteration)

M. Mardani, G. Mateos, and G. B. Giannakis, "Subspace learning and imputation for streaming big data

matrices and tensors," IEEE Trans. on Signal Processing, vol. 63, pp. 2663 - 2677, May 2015.

31

Dynamic cardiac MRI test in vivo dataset: 256 k-space 200x256 frames

Ground-truth frame

Sampling trajectory R=100, 90% misses R=150, 75% misses

Potential for accelerating MRI at high spatio-temporal resolution

Low-rank

plus

can also capture motion effects

M. Mardani and G. B. Giannakis, "Accelerating dynamic MRI via tensor subspace learning,"

Proc. of ISMRM 23rd Annual Meeting and Exhibition, Toronto, Canada, May 30 - June 5, 2015.

32

Closing comments

Large-scale learning Regression and tracking dynamic data Nonlinear non-parametric function approximation Clustering massive, high-dimensional data and graphs Other key Big Data tasks Visualization, mining, privacy, and security Enabling tools for Big data acquisition, processing, and storage

Fundamental theory, performance analysis decentralized, robust, and parallel algorithms Scalable computing platforms Big Data application domains ... Thank You! Sustainable Systems, Social, Health, and Bio-Systems, Life-enriching Multimedia, Secure Cyberspace, Business, and Marketing Systems ...

K. Slavakis, G. B. Giannakis, and G. Mateos, "Modeling and optimization for Big Data analytics,"

IEEE SIGNAL PROCESSING MAGAZINE, vol. 31, no. 5, pp. 18-31, Sep. 2014.

33

doc.uments.com

About Us :: Privacy Policies :: Terms of Service :: Feedback :: Copyright :: Contact Us :: DMCA Policy

Copyright © 2018 doc.uments.com