Constructive regularization of the random matrix norm.erebrova/hdi_workshop_presentation-1.pdf · Matrix normsLocal norm regularizationConstructive regularizationSub-block regularizationAppendix

Matrix norms Local norm regularization Constructive regularization Sub-block regularization Appendix

Constructiveregularization of the random matrix norm.

Liza Rebrova

University of California Los Angeles

Structural inference in High Dimensional Models workshop,September 2018


Non-asymptotic random matrix theory framework

A = (Aij)n×m. Aij are taken from some distribution.

Usually, we have

• no specific distribution assumption

• no symmetry assumption

• high probability results (hold with probability 1− o(1))

• for the dimensions large enough (all large matrices withn,m > N0)

Concentration on the sphere Convex set in Rn

(Right picture is taken from “Estimation in high dimensions” by R.Vershynin)


Operator (spectral) norm

By definition,

‖A‖ := sup‖x‖2=1

‖Ax‖2 = supu,v∈Sn−1

|〈Au, v〉| = s1(A)

Norm of the inverse 1/‖A−1‖ = inf‖x‖2=1

‖Ax‖2 = sn(A)

Singular values – real spectrum of the matrix

s(A) =√

eig(ATA), s1 ≥ s2 ≥ . . . ≥ sn ≥ 0.

Key idea: spectrum stabilizes as the size of the matrices →∞

eigenvalues of a Wigner matrixReal vs complex component of gaussian matrixeigenvalues

(Left picture is taken from “Estimation in high dimensions by R.Vershynin”)


What is optimal norm order?

Let A = (Aij)n×n be a square random matrix with i.i.d. entries.

Gaussian Subgaussian

for any t ≥ 0 for any t ≥ C0

s1(A) s1 ≤ 2√n + t s1 ≤ t

√n

with prob 1− 2e−t2/2 with prob 1− e−ct

2n

from Gordon’s theorem from Bernstein’s inequality

Blue - gaussian, Red - subgaussian, Green - heavy-tailed

Def.: Aij are subgaussian if P{|Aij | > t} ≤ C1e−c2t2 for any t > 0

(Picture is taken from D.Mixon blog “Short, fat matrices”)


Not an optimal order

Light tails ((sub)gaussian, 4 finite moments): with high probability,

‖A‖ = smax(A) ∼√n and smin(A) ∼ 1/

√n.

Heavy tails (2 finite moments): with high probability,

‖A‖ = smax(A) �√n and smin(A) ∼ 1/

√n.

Example (‖A‖ ∼ n�√n)

• Litvak-Spector: Constructive example of ‖A‖ ∼ O(n1−β) forany β ≥ 0 with probability at least 1/2.

• Bai-Silverstein-Yin: 4 moments are needed for ‖A‖ →√n.

•

sup‖x‖=1

‖Ax‖ ≥

1 1 1 11 1 1 11 1 1 11 1 1 1

·

n−1/2

n−1/2

n−1/2

n−1/2

= n


Local norm regularization

Questions:1. Can we regularize the norm correcting just a small fraction ofthe entries of A?2. What in the structure of a heavy-tailed matrix causes norm toblow up from the “ideal” order O(

√n)?

Local regularization: A 7→ A, such that

• A differs from A in a small εn × εn sub-matrix

• ‖A‖ .√n

Theorem (with R. Vershynin, informal statement)

Let A be a large enough random square matrix with i.i.d. elements.Local regularization is possible with high probability ⇐⇒

EAij = 0 and EA2ij is bounded.


Local norm regularization

Theorem (Part 1: local obstructions)

Let A = (Aij)n×n has i.i.d. entries, such thatEAij = 0, EA2

ij = 1. For any ε ∈ (0, 1/6],

with probability ≥ 1− 11e−εn/12

there exists an εn × εn sub-matrix A0 ⊂ A:

‖A \ A0‖ ≤ Cε√n, Cε = C · ln(ε−1)√

ε

0 εn

n A \ A0A0

A \ A0 = zero outall entries in A0

• log-optimal dependence of on size ε

• can consider any ε < 1 in trade of larger constants

• inconstructive: does not identify A0


Proof idea

”Ideal” norm relation?

‖A‖ . ‖A‖∞→2√n

. ‖A‖2→∞ .√n

Definition

‖A‖∞→2 := ‖A : l∞ → l2‖ = maxx∈{−1,1}n ‖Ax‖2 (Cut norm)‖A‖2→∞” = ‖A : l2 → l∞‖ = maxi ‖row(A)i‖2 (Max row norm)

Example (True for gaussian matrices)

For gaussian matrix (i.i.d. N(0,1) entries) we have:

‖A‖2→∞ ∼√n, ‖A‖∞→2 ∼ n, ‖A‖ ∼

√n


Proof idea

”Ideal” norm relation?

(((((((((((((((((hhhhhhhhhhhhhhhhh‖A‖ . ‖A‖∞→2√

n. ‖A‖2→∞ .

√n

Not true for heavy-tailed :) Instead, we prove

‖AJc3‖ .‖AJc2‖∞→2√n

. ‖AJc1‖2→∞ .

√n,

where J1 ⊂ J2 ⊂ J3 (|Ji | ≤ εn) are small column subsets to remove

• (*) εn column cut ∼ εn × εn sub-matrix cut

• (**) last step: Grothendieck-Pietsch factorization for matrices(inconstructive!)


(*) εn columns cut is as good

It is enough to show that εn - columns cut regularizes the norm:

0 + =

‖green‖ ≤√n ‖brown‖ ≤

√n ‖dashed‖ ≤ 2

√n

0

0

0


(**) Grothendieck-Pietsch factorization

Standard estimate: 1√n‖B‖∞→2 ≤ ‖B‖ ≤ ‖B‖∞→2

We want: ‖AJc3‖ . 1√

n‖AJc2‖∞→2 with high probability

Theorem (Grothendieck-Pietsch, sub-matrix version)

Let B be a n × n1 real matrix and δ > 0. Then there existsJ ⊂ [n1] with |J| ≥ (1− ε)n1 such that

‖B[n]×J‖ ≤2‖B‖∞→2√

εn1.

We apply it to B = AJc2with n1 = (1− ε

2)n to find |J| ≥ (1− ε)n.


Constructive regularization

Question:1. Can we regularize the norm correcting just a small fraction ofthe entries of A?

Yes, iff EAij = 0, EA2ij = 1.

2. What in the structure of a heavy-tailed matrix causes norm toblow up from the “ideal” order O(

√n)?

Or, how to perform local regularization constructively?


Individual entries correction

Theorem (more than 2 moments, any δ > 0)

Let A = (Aij)n×n has i.i.d. entries, s.t. EAij = 0, E|Aij |2+δ ≤ 1.With high probability, zeroing n1−δ/9 largest entries of A leads to

‖A‖ ≤ 8√n.

Proof based of Bandeira-Van Handel theorem: for any γ > 1

‖A‖ ≤ γ · σ + t with prob. 1− n exp(− t2

cγσ2∗),

where

• σ is max expected row/col norm; σ2row = maxi ‖row(EA2ij)‖22

• σ∗ is max entry; σ2∗ = maxij ‖Aij‖∞We have t, σ ∼

√n, and σ∗ �

√n.


If we have just finite 2nd moment...

Matrix Bernstein inequality: zeroing a few entries ∴ ‖A‖ . ln n√n.

Example (failure of individual corrections approach)

Consider scaled Bernoulli matrix Aij ∼√n · Ber( 1n ).

• There is a row with at least (ln n/ ln ln n) non-zero elementsw.h.p. So, norm regularization is needed, as

‖A‖ ≥ ‖Ai‖2 �√n

• Entries are {0,√n}, so looking at the size only, we can only

delete all or nothing.

• There are too many non-zero entries to fit in εn × εnsub-matrix

Need to use some information about entries locations with respectto each other (in given realization)


Constructive norm regularization

Theorem (Main)

Let A = (Aij)n×n has i.i.d. symmetrically distributed entries, suchthat EA2

ij = 1. For any ε ∈ (0, 1/6] and r > 1,

with probability ≥ 1− n0.1−r

zeroing out εn rows and εn columns with the largest L2-normsleads to the matrix A:

‖A‖ ≤ C√

cε ln ln n · n, where cε = ln(ε−1)/ε

• simple & constructive way to regularize the norm

• better description of the obstructions (to the good norm)

• extra ln ln n term and symmetry assumprion


Constructive norm regularization

Theorem (Main, equivalent version)

Let A = (Aij)n×n has i.i.d. symmetrically distributed entries, suchthat EA2

ij = 1. For any ε ∈ (0, 1/6] and r > 1,

with probability ≥ 1− n0.1−r

zeroing out any product subset of the entries such that on the restall rows and columns have ‖rowi (A)‖2, ‖coli (A)‖2 ≤ C

√cεn

produces A:

‖A‖ ≤ C√

cε ln ln n · n, where cε = ln(ε−1)/ε

• simple & constructive way to regularize the norm

• better description of the obstructions (to the good norm)

• extra ln ln n term and symmetry assumprion


Proof background: Bernoulli matrices

B is n × n matrix with 0-1 entries, EBij = p.

E(Bij − EBij)2 ∼ p ∴ optimal norm ‖B − EB‖ ∼ √np.

This is known:

• (Feige-Ofek) if p &√

ln n, then ‖B − EB‖ ∼ √np w.h.p.

• (Krievelevich-Sudakov) if p �√

ln n, No, counterexample

Note: p = 1/n means exactly 2 finite (constant) momentsRegularization for the sparse case:

• (Feige-Ofek) zero out all rows and columns of A with morethan 10d non-zero elements,

• (Le-Levina-Vershynin) reweight or zero out some elements s.t.sum of elements in every row and column is at most 10d ,

where d := E{ number of non-zero elements in a row/column }.


Proof background: Bernoulli matrices

B is n × n matrix with 0-1 entries, EBij = p.

E(Bij − EBij)2 ∼ p ∴ optimal norm ‖B − EB‖ ∼ √np.

This is known:

• (Feige-Ofek) if p &√

ln n, then ‖B − EB‖ ∼ √np w.h.p.

• (Krievelevich-Sudakov) if p �√

ln n, No, counterexample

Note: p = 1/n means exactly 2 finite (constant) momentsRegularization for the sparse case:

• (Feige-Ofek) zero out all rows and columns of A with morethan 10d non-zero elements,

• (Le-Levina-Vershynin) reweight or zero out some elements s.t.sum of elements in every row and column is at most 10d ,

where d := E{L2-norm of a row/column }.


Constructive regularization: proof ideas

High-level idea: split

|Aij | ∼∑k

2k1{|Aij |∈(2k−1,2k ]} =∑

2kBk

and apply Bernoulli results at each ”level”.

Some challenges:

1. Even though rows/columns of A have bounded L2-norms,some levels can be too heavy (compensated by other light)

2. Pass to absolute value (we cannot directly approximate ‖|A|‖,it is too large - mean zero is needed for local regularization)

3. Consider as few levels as possible


1. Bernoulli matrix decomposition

With probability 1− 3n−r all entries of B = B1 t B2 t B3:

• #(rowi (B1)) . rnp, #(coli (B1)) . rnp - bounded rows&cols

• #(rowi (B2)) . r - very sparse rows

• #(coli (B3)) . r - very sparse columns

To do

B1

B2

B3

p−1

n

n/2

Lemma (1, pn > 4)

In every n2−k × n2−k submatrixof B there are at most 2−k/pcolumns with > C1rnp non-zeros

Lemma (2, pn > 4)

In every n2−k × 2−k/p submatrixof B there are at most 2−kn/4columns with > C2r non-zeros







To do

B1

B2

B3

p−1

n

n/4

Lemma (1, pn > 4)

In every n2−k × n2−k submatrixof B there are at most 2−k/pcolumns with > C1rnp non-zeros

Lemma (2, pn > 4)

In every n2−k × 2−k/p submatrixof B there are at most 2−kn/4columns with > C2r non-zeros







To do

B2

B3

n/2

n/2 Lemma (1, pn ≤ 4)

In every n2−k × n2−k submatrixof B there are at most n2−k−1

columns with > C1r non-zeros

Lemma (2, pn ≤ 4)

In every n2−k × 2−k−1 submatrixof B there are at most 2−k−1ncolumns with > C2r non-zeros







To do

B2

B3

n/4

n/2 Lemma (1, pn ≤ 4)

In every n2−k × n2−k submatrixof B there are at most n2−k−1

columns with > C1r non-zeros

Lemma (2, pn ≤ 4)

In every n2−k × 2−k−1 submatrixof B there are at most 2−k−1ncolumns with > C2r non-zeros


1. Bernoulli matrices: after decomposition

Recall:|Aij | ∼

∑k

2k1{|Aij |∈(2k−1,2k ]} =∑

2kBk

• For Apart1 =∑

BkB2 ∪ Bk

B3 use

Lemma (Norm of sparse matrices)

For any matrix Q and vectors u, v ∈ Sn−1, we have

‖Q‖ ≤ maxj‖colj(Q)‖2 ·

√max

i#(rowi (Q)).

↓ ↓√n

√const ·#(terms)

• For each Bkij ∈ B1 all rows and columns are bounded by

O(npk) =⇒ we can use results for Bernoulli matrices


2. Heavy and light indices: Bernoulli

Using definition ‖B‖ = supu,v∈Sn−1 |∑

ij Bijuivj |.Light indices := {(i , j) : |uivj | ≤

√p/n} for every u, v .

Split the sum

|∑ij

(Bij − EBij)uivj | ≤

|∑light

(Bij − EBij)uivj |+ |∑heavy

EBijuivj |+ |∑heavy

Bijuivj |

• Light part - bounded members - Bernstein’s concentration

• Expectation part - #(heavy indices) ≤ n/p - Cauchy-Swartz

• Heavy part - Feige-Ofek theorem (bound follows from tailestimate for e(S ,T ) = number of non-zero entries in someS × T sub-block)


2. Heavy and light indices: general case

Light indices := {(i , j) : |uivjAij | ≤√

4/n} for every u, v .Split the sum

|∑ij

Aijuivj | ≤ |∑light

Aijuivj |+∑heavy

|Aij |uivj

E|Aij | 6= 0, but we do not care, split into Bernoulli levels and useFeige-Ofek theorem at each level!∑

heavy

|Aij |uivj ≤∑ij

∑k

2kBkij uivj ≤

∑k

2k√npk

≤√n∑k

22kpk ·√

#(levels).

From second moment condition 1 ≥ EA2ij ≥ 0.25

∑k 22kpk .

Number of levels is an extra term - minimize it.


3. Only average levels matter

• Large entries (&√ncε) are zeroed (they produce heavy rows)

• Small entries (.√

n/ ln n) are bounded separately byBandeira-van Handel theorem

Number of levels is at most

log2(Ccεn)− log2( cn

ln n

)≤ log2

Ccεn · ln nc1n

∼ log log n.

Note: symmetry is needed only to keep zero mean in varioustruncations.

Q.E.D.


What is we want to zero out εn × εn block only?

Need to find the most ”dense” part of the matrix.Enough to find exceptional εn subset of columns (only),exceptional εn subset of rows (only) and take an intersection.

A A




0A


√n

A

0




0


√n

0

0




0 + =


√n ‖dashed‖ ≤ 2

√n

0

0

0


Algorithm idea

Idea: find εn columns to replace with zeros, such that all rows andcolumns have bounded L2-norms + apply Main Theorem.

Lemma (with K.Tikhomirov)

B is n × n matrix with 0-1 entries, EBij = p. Then for any L ≥ 10with probability 1− exp(−n exp(−Lpn)): if we define

Wij :=

{1, if #(rowj(B)) ≤ Lnp or Bij = 0,

Lnp#(rowj (B)) , otherwise.

and Vj :=∏n

i=1Wij , and J := {j : Vj < 0.1}, then

|J| ≤ n exp(−Lnp) and∑j∈Jc

Bij ≤ 10Lnp, for any i ∈ [n].


Damping: Bernoulli example

Idea: we construct a diagonal matrix of weights that regularizeseach row

0 1 0 0 10 0 0 0 00 1 1 0 01 1 0 0 11 0 0 0 0

·

0δ1

00

δ1

=

0 δ1 0 0 δ10 0 0 0 00 1 1 0 01 1 0 0 11 0 0 0 0

1-st row: damping with the weight 0 < δ1 < 1




0 1 0 0 10 0 0 0 00 1 1 0 01 1 0 0 11 0 0 0 0

·

0δ1

00

δ1

=

0 δ1 0 0 δ10 0 0 0 00 1 1 0 01 1 0 0 11 0 0 0 0

2-nd row: all good




0 1 0 0 10 0 0 0 00 1 1 0 01 1 0 0 11 0 0 0 0

·

0δ21

δ10

δ1

=

0 δ1 0 0 δ10 0 0 0 00 δ1 δ1 0 01 1 0 0 11 0 0 0 0

3-rd row: damping with the weight 0 < δ1 < 1




0 1 0 0 10 0 0 0 00 1 1 0 01 1 0 0 11 0 0 0 0

δ2

δ21δ2δ1

0δ1δ2

=

0 δ1 0 0 δ10 0 0 0 00 δ1 δ1 0 0δ2 δ2 0 0 δ21 0 0 0 0

4-th row: damping with the weight 0 < δ2 < δ1 < 1




0 1 0 0 10 0 0 0 00 1 1 0 01 1 0 0 11 0 0 0 0

δ2

δ21δ2δ1

0δ1δ2

=

0 δ1 0 0 δ10 0 0 0 00 δ1 δ1 0 0δ2 δ2 0 0 δ21 0 0 0 0

5-th row: all good




0 1 0 0 10 0 0 0 00 1 1 0 01 1 0 0 11 0 0 0 0

δ2

δ21δ2δ1

0δ1δ2

=

0 δ1 0 0 δ10 0 0 0 00 δ1 δ1 0 0δ2 δ2 0 0 δ21 0 0 0 0

2-nd column has small weight: to be deleted


Lemma (with K.Tikhomirov)

B is n × n matrix with 0-1 entries, EBij ≤ p. Then for any L ≥ 10with probability 1− exp(−n exp(−Lpn)): if we define Wij as beforeand Vj :=

∏ni=1Wij , and J := {j : Vj < 0.1}, then

|J| ≤ n exp(−Lnp) and∑j∈Jc

Bij ≤ 10Lnp, for any i ∈ [n].

How do we use Lemma? Split

A2ij ≤

∑k

qk1{A2ij∈(qk−1,qk ]}, Ik := (qk−1, qk ].

Then∑

k qk−1P{A2ij ∈ Ik} ≤ EA2

ij = 1. Apply Lemma to Bk with

entries Bkij = A2

ij1{Aij∈Ik} ∼ qk1{Aij∈Ik} to get

‖rowj(AJc )‖22 .∑k

qknP{A2ij ∈ Ik} ≤ 2n.


Quantiles and regularization process

To pass from Bernoulli to general case now we need pk to be incontrol: not too small (probability estimate), not too large(cardinality estimate).

Definition (2−k quantiles)

Denote as 2−k -quantiles of |Aij | points qk , such that

P{|Aij | > qk} = 2−k .

LetAk := A · 1{Aij∈(qk−1,qk ]}

Note: quantiles qk can be approximated by order statistics of Aij

(it is a free set of samples from the distribution!) We useAk ∼ Ak . So, the algorithm is distribution-oblivious.


Submatrix norm regularization algorithm

1. delete too large entries

2. small entries are fine without regularization

3. for each average k construct weights for Ak : W kij and V k

j tofind an exceptional subset of columns Jk :

| ∪k Jk | ≤ εn/2 with high probability

4. J = J ∪ (∪kJk), where J is a subset of εn/2 columns withlargest norms

5. repeat the process for AT to find an exceptional row subset I

6. intersection of I and J gives a εn × εn exceptional matrix A0

=⇒ ‖A‖ = ‖A \ A0‖ ∼√n ln ln n by Main Theorem.


THANKS FOR YOUR ATTENTION!


Applications to community detection

A - adjacency matrix of Erdos-Renyi random graph G (n, pij).Stochastic block model:

EA =

p p s sp p s s

s s p ps s p p

Inhomogeneous Erdos-Renyi model G(n, (pij))

Edges are still independent, but can have di↵erent probabilities pij .

Allows to model networks with structure = communities (clusters).

Example. Stochastic block model with two communities G(n, p, q):Edges within each community: probability p; across communities: probability q < p.

Spectral method for community detection is based on the idea:

1. EA eigenstructure reveals communities,

2. eigenstructure[A] ∼ eigenstructure[EA].

So, eigenstructure of A (observed) reveals communities too.

Condition 2 is satisfied only for dense graphs. Idea for sparsegraphs: regularize graph locally to impose ‖A− EA‖ ∼ 0.


Obstructions for random graphs

Sparse graphs: maximal expecteddegree d := max pijn� log n.

[Feige-Ofek] obstructions to‖A− EA‖ ∼ 0 are few highdegree vertices of the graph.

For the regularization it is enough to

• Feige-Ofek: delete all high-degree vertices (degree > 10d)

• Le-Levina-Vershynin: reweight or delete some of the edgesadjacent to high-degree vertices (to make all the degreesbounded by 10d)

• R.: enough to delete a small ne−d × ne−d subgraph; gotdescription of the “bad” subgraph (we can direct its edges s.t.every vertex has a finite number of the outcoming edges)


Theorem (Part 2: global obstructions)

Let A is an n × n matrix with i.i.d. entries, such that

• EA2ij ≥ M,

• |Aij | ≤√n almost surely.

If M = M(C , ε) is a large enough constant, then any εn × εnsub-matrix A0 has large norm

‖A0‖ ≥ C√n,

with probability at least 1− exp(−εn).

So, if we were to cut some partfor regularization, we need to cutalmost everything! No εn × εnsub-matrix can survive.

εn

n 0

Constructive regularization of the random matrix norm.erebrova/hdi_workshop_presentation-1.pdf · Matrix normsLocal norm regularizationConstructive regularizationSub-block regularizationAppendix

Documents