1 Ph.D. Thesis Proposal Data Caching in Ad Hoc and Sensor Networks Bin Tang Computer Science Department Stony Brook University.

1

Ph.D. Thesis Proposal

Data Caching in Ad Hoc and Sensor Networks

Bin Tang

Computer Science DepartmentStony Brook University

2

Summary of My Work Data Caching

Update cost constraint Optimal algorithm for tree; approximation algorithm for

general graph. Memory constraint with multiple data items

Approximation algorithm for general graph number constraint w/h read/write/storage cost

Optimal algorithm for tree

Localized distributed implementations. Compare with existing work

3

Motivation Ad hoc and sensor networks are resource

constrained Limited bandwidth, battery energy, and

memory

Caching can save access (communication) cost, and thus, bandwidth and energy Under update cost, memory, number constraint

4

Rooted in…

Facility location problem: set up facilities in a network to minimize total access cost and setting up cost

K-median problem: set up k facilities to minimize total access cost

5

1. Cache Placement in Sensor Networks Under Update Cost Constraint

6

Problem Statement Sensor Network Model

A data item stored at a server node. Updated at a certain frequency. Other nodes access the data item at a

certain frequency.

Problem StatementSelect nodes to cache the data item to:

Goal: Minimize “total access cost” Constraint: Total update cost.

7

Why update cost constraint?

Nodes close to the server bear most of the update cost.

8

Problem Formulation Given:

Network graph G(V,E). A data item stored at a server node Update frequency Access frequency for each other node Update cost constraint Δ

Goal: Select cache nodes to minimize the “total access

cost” Total update cost is less than Δ

9

Total Access/Update Cost Total Access Cost =

∑ i є V (hop length between i and its nearest cache x access frequency of i)

Total Update cost = cost of the optimal Steiner tree over server and all caches

10

Algorithm Design Outline Tree Networks

Optimal dynamic programming algorithm.

General Networks Multiple-unicast update model --

Approximation algorithm.

Steiner-tree update model – Heuristic and Distributed.

11

Tree Networks

12

Subtree notation

Server: “r”

Consider a subtree Tv.

Let path (v,x) on its leftmost branch be all caches.

Let C_v be the optimal access cost in Tv using additional update cost δ

Next: Recursive equation for C_v

r

Tr

v

Tvx

13

Dynamic Programming Algorithm for Tvunder update cost constraint δ

Let u = leftmost deepest node in the optimal set of caches in Tv

Path(v,u) can be all caches (update cost doesn’t increase)

For a fixed u, C_v =

Constant + optimal access cost in Rv,u for constraint (δ – δ_u)

Here, δ_u is the cost to update u (using path(v,x)).

Tv = Lv,u + Tu + Rv,u

14

DP recursive equation for Tv

C_v = minu є Tv (access cost in Lv,u using path(v,x) or path(v,u)

+ access cost in Tu using u + optimal cost in Rv,u with

constraint δ – δ_u)

Here, δ_u is the cost in updating u (using path(v,x)).Note that Rv,u has a path (v, parent(u)) of caches on its leftmost branch.

15

Time complexity Time complexity: O(n4+n3 Δ)

Analysis Precomputation takes O(n4)

Lv,u with cache path (v,x): O(n4), for all v,u,x Tu: O(n2), for all u

Recursive equation takes O(n3 Δ) n2Δ entries: for each pair of (v,x) and all values of Δ Each entry takes O(n): n possible u

16

General Graph Network Two Update Cost Models

Multiple-Unicast

Optimal Steiner Tree

17

Multiple-Unicast Update Model Update cost: Sum of shortest path lengths

from server to each cache node

Benefit of node A: Decrease in total access cost due to selection of A as a cache

Benefit per unit update cost.

18

Greedy Algorithm

Iteratively: Select the node with the highest benefit per unit update cost, until the update cost is exhausted

Theorem: Greedy solution’s benefit is at least 63% of the optimal benefit.

19

Steiner-Tree Update Cost Model Steiner-tree update cost: Cost of 2-

approximation Steiner tree over cache nodes

Incremental Steiner update cost of node A: Increase in Steiner-tree update cost due to A becoming a cache

Greedy-Steiner Algorithm:Iteratively, select the node with the highest benefit per unit above-defined update cost.

20

Distributed Greedy-Steiner Algorithm

Each non-cache node estimates its benefit per unit update cost

If the estimate is maximum among all its non-cache neighbors, then it decides to cache

Algorithm: In each rounds, each node decides to cache based

on above. The server gathers new cache node information,

and computes the total update cost The remaining update cost is broadcast to the

network, and the new round begins

21

Performance Evaluation (i) network-related -- number of nodes and

transmission radius, (ii) application-related -- number of clients.

Random network of 2,000 to 5,000 nodes in a 30 x 30 region.

22

Compared Caching Schemes Centralized Greedy

Centralized Greedy-Steiner

Distributed Greedy-Steiner

Dynamic Programming on Shortest Path Tree of Clients

Dynamic Programming on Steiner Tree over Clients and Server

23

Varying Network Size – Transmission radius =2, percentage of clients = 50%, update cost = 25% of the Steiner tree cost

24

Varying Transmission Radius - Network size = 4000, percentage of clients = 50%, update cost = 25% of the Steiner tree cost

25

Varying number of clients – Transmission Radiu =2, update cost = 50% of the Steiner tree cost, network size = 3000

26

To Recap: Data caching problem under update cost

constraint.

Optimal algorithm for tree; an approximation algorithm for general graph.

Efficient distributed implementations.

More general cache placement problem: (a) under memory constraint; (b) multiple data items.

27

2. Data Caching under Memory Constraint

28

Problem Addressed

In a general ad hoc network with limited memory at each node, where to cache data items, such that the total access (communication) cost is minimized?

29


Network graph G(V,E) Multiple data items Access frequencies (for each node and data item) Memory constraint at each node

Select data items to cache at each node under memory constraint

Minimize total access cost = ∑nodes ∑data items [(distance from node to the nearest

cache for that data item) x (access frequency) ]

30

Related Work Related to facility-location problem and K-

median problem; No memory constraint

Baev and Rajaraman 20.5-approximation algorithm for uniform-size data

item For non-uniform size, no polynomial-time

approximation unless P = NP We circumvent the intractability by

approximating “benefit” instead of access cost

31

Related Work - continued

Two major empirical works on distributed caching Hara [infocom’99] Yin and Cao [Infocom’ 04] (we compare our work

with theirs)

Our work is the first to present a distributed caching scheme based on an approximation algorithm

32

Algorithms

Centralized Greedy Algorithm (CGA) Delivers a solution whose “benefit” is at least 1/2 of

the optimal benefit

Distributed Greedy Algorithm (DGA) Purely localized

33

Centralized Greedy Algorithm (CGA)

Benefit of caching a data item at a node

= the reduction of total access cost

i.e., (total access cost before caching) – (total access cost after caching)

34

Centralized Greedy Algorithm (CGA)

CGA iteratively selects the most beneficial (data item, node to cache at) pair.

I.e., we pick (at each stage) the pair that has the maximum benefit.

Theorem: CGA is (1/2)–approximate for uniform data item.

¼-approximate for non-uniform size data item

35

CGA Approximation Proof Sketch

G’: modified G, where each node has twice memory of that in G caches data items selected by CGA and optimal

B(Optimal in G)

< B(Greedy + Optimal in G’)

= B(Greedy) + B(Optimal) w.r.t Greedy

< B(Greedy) + B(Greedy) [Due to greedy choice]

= 2 x B(Greedy)

36

Distributed Greedy Algorithm (DGA)

Each node caches the most beneficial data items, where the benefit is based on “local traffic” only.

“Local Traffic” includes: Its own data requests Data requests to its data items Data requests forwarding to others

37

DGA: Nearest Cache Table

Why do we need it? Forward requests to the nearest cache Local Benefit calculation

What is it? Each nodes keeps the ID of nearest cache for

each data item Entries of the form: (data item, the nearest cache) Above is on top of routing table.

Maintenance – next slide

38

Maintenance of Nearest-cache Table

When node i caches data Dj

broadcast (i, Dj) to neighbors Notify server, which keeps a list of caches

On recv (i, Dj) if i is nearer than current nearest-cache of Dj,

update and forward

39

Maintenance of Nearest-cache Table -II

i deletes Dj get list of caches Cj from server of Dj

broadcast (i, Dj, Cj) to neighbors

On recv (i, Dj, Cj) if i is current nearest-cache for Dj, update

using Cj and forward

40

Maintenance of Nearest-cache Table -III

More details pertaining to Mobility Second-nearest cache entries (needed for

benefit calculation for cache deletions) Benefit thresholds

41

Performance Evaluation

CGA vs. DGA Comparison

DGA vs. HybridCache Comparison

42

CGA vs. DGA

Summary of simulation results: DGA performs quite close to CGA, for

wide range of parameter values

43

Varying Number of Data Items and Memory Capacity – Transmission radius =5, number of nodes = 500

44

DGA vs. Yin and Cao’s work.

Yin and Cao:[infocom’04] CacheData – caches passing-by data item CachePath – caches path to the nearest cache HybridCache – caches data if size is small

enough, otherwise caches the path to the data Only work of a purely distributed cache placement

algorithm with memory constraint

45

DGA vs. HybridCache Simulation setup:

Ns2, routing protocol is DSDV Random waypoint model, 100 nodes move at a

speed within (0,20m/s), 2000m x 500m area Tr=250m, bandwidth=2Mbps

Performance metrics: Average query delay Query success ratio Total number of messages

Server Model: 1000 data items, divided into two

servers. Data item size: [100, 1500] bytes

Data access models Random: Each node accesses 200 data

items randomly from the 1000 data items Spatial: (details skipped)

Naïve caching algorithm: caches any passing-by data, uses LRU for cache replacement

Varying query generate time on random access pattern

48

Summary of Simulation Results

Both HybridCache and DGA outperform Naïve approach

DGA outperforms HybridCache in all metrics Especially for frequent queries and small

cache size For high mobility, DGA has slightly worse

average delay, but much better query success ratio

49

To Recap: Data caching problem for multiple items

under memory constraint Centralized approximation algorithm Localized distributed implementation No update or storage cost are considered

(otherwise, no performance guarantee)

Can we consider and minimize the total cost of read/write/storage ?

50

3. Data Caching Under Number Constraint

51


Network graph G(V,E). A data item to be stored in the network Access (read) frequency for each node Write frequency for each node Caching (storage) cost for each node Number of allowable caching node: P

Goal: Select cache nodes to minimize the “total cost” Under number constraint

52

Total Cost

= Total read cost + total write cost + total storage cost

= ∑ i є V (hop length between i and its nearest cache x access frequency of i)

+ ∑ i є V (cost of optimal steiner tree over i and all caches x write frequency of i)

+ ∑ i є cache nodes (storage cost at i)

53

Related Work

K-median problem (access and storage cost)

Tamir attains the best time complexity in tree

We generalize it with write cost in both tree ( O(n2P3) ) and general graph Kalpakis et al. solves the same problem, with time

complexity O(n6P3)

54

Tree Topology

55

Tamir’s DP Algorithm on tree Tr

Transform arbitrary tree into full binary tree

Each non-leaf node v has two children: v1, v2

For each v in binary tree, compute and sort the distance from v to all nodes

“leaves to root” dynamic programming algorithm

56

Our DP Algorithm

Ideal: For each node v in Tr:

the cost of sub-tree Tv =

access cost of nodes in Tv

+ storage cost of caching nodes in Tv

+ write cost of all the writer nodes in Tr due to edges in Tv

57

DP Algorithm - Definitions G(v, q, r): optimal cost for subtree Tv, exact q

caches in Tv, closest to v is at most r hops away

F(v, q, r): optimal cost for Tv, exact q caches in Tv; some cache nodes outside of Tv, closest to v is r hops away

F’(v, r): optimal cost for Tv, no cache in Tv; some cache nodes outside of Tv, closest to v is r hops away

58

Recursive DP Equations: p cache nodes allowed

1. G(v, q, 0) -- v is cache node= storage cost at v

+ the cost of Tv1, Tv2 + the write cost on vv1, vv2

2. G(v, q<p, r>0) – there is some cache node outside of Tv = min{ G(v, q, r-1), // there is cache in Tv r-1 hops from

v cost in “closest cache to v is r hops away” }

59

Recursive DP Equations - continued

3. G(v, q=P, r>0) – no cache node outside of Tv = min{ G(v, q, r-1),

the cost of “closest cache is r hops away” }

4. F(v, q, r) – there is cache node outside of Tv= min {G(v, q, r-1),

the cost of “closest cache to v is r hops away

}

60

Minimum total cost of original tree Tr = min {1≤p≤P} G(r, p, L}, L is the hops of

r to the farthest node in Tr

Time Complexity – O(n2P3) For each p, vary q from 1 to q For each (v, q), vary closest cache node to v

(n possibilities) and spit q in to Tv1, Tv2 (q such possibilities)

61

Conclusion

We design optimal, near optimal and heuristics for data caching under different constraint in ad hoc and sensor networks

We show our algorithms can be implemented in distributed way

62

Questions?

1 Ph.D. Thesis Proposal Data Caching in Ad Hoc and Sensor Networks Bin Tang Computer Science Department Stony Brook University.

Documents

u access cost

u optimal cost

u slide

u tv access cost

update cost constraint

total access cost slide

total access cost constraint

additional update cost