Top Banner
Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal
34

Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Dec 26, 2015

Download

Documents

Abel Dean
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Statistical Inference for Large Directed Graphs with

Communities of Interest

Deepak Agarwal

Page 2: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Outline

• Communities of Interest : overview

• Why a probabilistic model?

• Bayesian Stochastic Blockmodels

• Example

• Ongoing work

Page 3: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Communites of interest

• Goal: understand calling behavior of every TN on ATT LD network: massive graph

• Corinna, Daryl and Chris invented COI’s to scale

computation using Hancock (Anne Rogers and Kathleen Fisher)

• Definition: COI of TN X is a subgraph centered around X– Top k called by X + other– Top k calling X + other

Page 4: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

COI signature

X

Otheroutbound

Otherinbound

Page 5: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

• Entire graph union of COI’s

• Extend a COI by recursively growing the spider – Captures calling behavior more accurately

• Definition for this work: – Grow the spider till depth 3. Only retain depth 3 edges

that are between depth 2 nodes.

Page 6: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Extended COI

me

other

other

X

x

Page 7: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Enhancing a COI !!• Missed calls:

– Local calls where TN’s not ATT local– Outbound OCC calls– Calls to/from the bin “other”

• Big outbound and inbound TNs– Dominate the COI, lot of clutter.– Need to down weight their calls.

• Other issuesWant to quantify things like tendency to call, tendency of

being called, tendency of returning calls for every TN.

Page 8: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Our approach so far

• COI -> social network

• Want a statistical model that estimates missing edges, add desired ones and remove (or down weight) undesired ones.

Page 9: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

me COI from top probability edges of a statistical model.

The model adds new edges. (brown arrows)

Removes undesired ones.

Page 10: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Getting a sense of data

Some descriptive statistics

based on a random sample

of 500 residential COI’s.

Page 11: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

density = 100*ne/(g(g-1))

ne = number of edges

g = number of nodes

Page 12: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.
Page 13: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Under random

Average conditional on out -degrees

Page 14: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Under random:Conditional on outdegrees

Page 15: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Under random:Conditional on indegrees

Page 16: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.
Page 17: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Distribution of “Other"

Page 18: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Representing the Data

• Collection of all edges with activity

• Matrix with no diagonal entries

• Collection of several 2x2 contingency tables

Page 19: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

COI: gxg matrix without diagonal entries

Page 20: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

COI: collection of 2x2 tables.

• Data matrix a collection of g(g-1)/2 2x2 tables (called dyads).

mijaij

ajinij

pij

pji 1

i->j

j-> i

present

absent

present absent

1-pij

1-pji

Row total

Column total

Page 21: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

More probabilities than edges.

Need to express them in terms of fewer parameters which could be learned from data.

Page 22: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

i j jiijijjjriis

jjj

iii

wzwrws

wwwMClikelihood

,

)

exp(

All Greek letters to be estimated from data

Computation: 2 minutes for a typical COI on fry

Likelihood, gradient and Hessian computed using C, optimizer in R.

Optimizer goes crazy due to presence of so many zero degrees

Do regularization, known as “shrinkage estimation” in statistics.

Incur bias for small degree nodes but get reduction in variance.

Page 23: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Meaning of parameters

• Node i: – αi: expansiveness (tendency to call) – βi: attractiveness (tendency of being called)

• Global parameters:– θ: density of COI (reduces with increasing

sparseness)– ρ: reciprocity of COI (tendency to return calls)– λs: “caller” specific effect– λr: “cal lee” specific effect– γ: “call” specific effect

Page 24: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Differential reciprocity

• Different reciprocity for each node:– Add another parameter ηi to node i

– Replace ρM by ρM + Σ iηi Mi in the likelihood

– Called “differential reciprocity” model

– Computationally challenging, have implemented it.

Page 25: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Missing edges?• Can estimate all parameters as long as we

have some observed edges in data matrix– for each row (to estimate expansiveness)– for each column (to estimate attractiveness)

• Missing local calls -> o.k.

• OCC -> problem, entire row missing.– Impute data using reasonable assumptions m

times (typically m=3 o.k.) and combine results. Working on it.

Page 26: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Incorporating edge weights

• Edge weights binned into k bins using a random sample of 500 COI’s. Weights in ith bin assigned a score i.

tij unknown,

w’s weights on

dyad (i,j). tij

imputed using

Hyper geometric

wij

wjik

tij

Row total

Column totalk - wji

k - wij

Page 27: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Example• COI with 117 nodes, 172 edges.

• 14 missing edges, local calls from14 non ATT local customers to seed node (local list provided by Gus).

• One edge attribute: number of common “buddies” between TN i and TN j

• Tried Bizocity, “Localness to seed” for caller and cal lee effects, eventually settled with one caller effect viz localness to seed, no cal lee effect.

Page 28: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Parameter estimates.• θ = -6.28; ρ=2.76 (higher side)

• λs=.29 (TN’s local to seed have a higher tendency to call)

• γ=.41 (common acquaintances between two TN’s increase their tendency to call each other)

Page 29: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.
Page 30: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Pruning the big (red) nodes

• Down weight expansiveness/attractiveness based on proportion of volume going to “other”, higher value get down weighted more by adding “offset”– Renormalize the new probability matrix to have the same mass as

the original one.

• Offset function used:

aotherotherap

aotherotherf

if ))5tan(.)5tan(.1log(

if 0)(

Page 31: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.
Page 32: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Matrix obtained by takingunion of top 50 data edges,top 50 edges from original model,top 50 edges from pruned model.

Page 33: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.
Page 34: Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.

Where to from here?

• Estimate missing OCC calls :multiple imputation.

• Scale the algorithm to get parameter estimates for every TN, maybe on a weekly basis, enrich customer signature.

• Can compute Hellinger distance between two COIs in closed form. Could be useful in supervised learning tasks like tracking Repetitive debtors.