1 Ph.D. Thesis Proposal Data Caching in Ad Hoc and Sensor Networks Bin Tang Computer Science Department Stony Brook University
Dec 21, 2015
1
Ph.D. Thesis Proposal
Data Caching in Ad Hoc and Sensor Networks
Bin Tang
Computer Science DepartmentStony Brook University
2
Summary of My Work Data Caching
Update cost constraint Optimal algorithm for tree; approximation algorithm for
general graph. Memory constraint with multiple data items
Approximation algorithm for general graph number constraint w/h read/write/storage cost
Optimal algorithm for tree
Localized distributed implementations. Compare with existing work
3
Motivation Ad hoc and sensor networks are resource
constrained Limited bandwidth, battery energy, and
memory
Caching can save access (communication) cost, and thus, bandwidth and energy Under update cost, memory, number constraint
4
Rooted in…
Facility location problem: set up facilities in a network to minimize total access cost and setting up cost
K-median problem: set up k facilities to minimize total access cost
5
1. Cache Placement in Sensor Networks Under Update Cost Constraint
6
Problem Statement Sensor Network Model
A data item stored at a server node. Updated at a certain frequency. Other nodes access the data item at a
certain frequency.
Problem StatementSelect nodes to cache the data item to:
Goal: Minimize “total access cost” Constraint: Total update cost.
7
Why update cost constraint?
Nodes close to the server bear most of the update cost.
8
Problem Formulation Given:
Network graph G(V,E). A data item stored at a server node Update frequency Access frequency for each other node Update cost constraint Δ
Goal: Select cache nodes to minimize the “total access
cost” Total update cost is less than Δ
9
Total Access/Update Cost Total Access Cost =
∑ i є V (hop length between i and its nearest cache x access frequency of i)
Total Update cost = cost of the optimal Steiner tree over server and all caches
10
Algorithm Design Outline Tree Networks
Optimal dynamic programming algorithm.
General Networks Multiple-unicast update model --
Approximation algorithm.
Steiner-tree update model – Heuristic and Distributed.
11
Tree Networks
12
Subtree notation
Server: “r”
Consider a subtree Tv.
Let path (v,x) on its leftmost branch be all caches.
Let C_v be the optimal access cost in Tv using additional update cost δ
Next: Recursive equation for C_v
r
Tr
v
Tvx
13
Dynamic Programming Algorithm for Tvunder update cost constraint δ
Let u = leftmost deepest node in the optimal set of caches in Tv
Path(v,u) can be all caches (update cost doesn’t increase)
For a fixed u, C_v =
Constant + optimal access cost in Rv,u for constraint (δ – δ_u)
Here, δ_u is the cost to update u (using path(v,x)).
Tv = Lv,u + Tu + Rv,u
14
DP recursive equation for Tv
C_v = minu є Tv (access cost in Lv,u using path(v,x) or path(v,u)
+ access cost in Tu using u + optimal cost in Rv,u with
constraint δ – δ_u)
Here, δ_u is the cost in updating u (using path(v,x)).Note that Rv,u has a path (v, parent(u)) of caches on its leftmost branch.
15
Time complexity Time complexity: O(n4+n3 Δ)
Analysis Precomputation takes O(n4)
Lv,u with cache path (v,x): O(n4), for all v,u,x Tu: O(n2), for all u
Recursive equation takes O(n3 Δ) n2Δ entries: for each pair of (v,x) and all values of Δ Each entry takes O(n): n possible u
16
General Graph Network Two Update Cost Models
Multiple-Unicast
Optimal Steiner Tree
17
Multiple-Unicast Update Model Update cost: Sum of shortest path lengths
from server to each cache node
Benefit of node A: Decrease in total access cost due to selection of A as a cache
Benefit per unit update cost.
18
Greedy Algorithm
Iteratively: Select the node with the highest benefit per unit update cost, until the update cost is exhausted
Theorem: Greedy solution’s benefit is at least 63% of the optimal benefit.
19
Steiner-Tree Update Cost Model Steiner-tree update cost: Cost of 2-
approximation Steiner tree over cache nodes
Incremental Steiner update cost of node A: Increase in Steiner-tree update cost due to A becoming a cache
Greedy-Steiner Algorithm:Iteratively, select the node with the highest benefit per unit above-defined update cost.
20
Distributed Greedy-Steiner Algorithm
Each non-cache node estimates its benefit per unit update cost
If the estimate is maximum among all its non-cache neighbors, then it decides to cache
Algorithm: In each rounds, each node decides to cache based
on above. The server gathers new cache node information,
and computes the total update cost The remaining update cost is broadcast to the
network, and the new round begins
21
Performance Evaluation (i) network-related -- number of nodes and
transmission radius, (ii) application-related -- number of clients.
Random network of 2,000 to 5,000 nodes in a 30 x 30 region.
22
Compared Caching Schemes Centralized Greedy
Centralized Greedy-Steiner
Distributed Greedy-Steiner
Dynamic Programming on Shortest Path Tree of Clients
Dynamic Programming on Steiner Tree over Clients and Server
23
Varying Network Size – Transmission radius =2, percentage of clients = 50%, update cost = 25% of the Steiner tree cost
24
Varying Transmission Radius - Network size = 4000, percentage of clients = 50%, update cost = 25% of the Steiner tree cost
25
Varying number of clients – Transmission Radiu =2, update cost = 50% of the Steiner tree cost, network size = 3000
26
To Recap: Data caching problem under update cost
constraint.
Optimal algorithm for tree; an approximation algorithm for general graph.
Efficient distributed implementations.
More general cache placement problem: (a) under memory constraint; (b) multiple data items.
27
2. Data Caching under Memory Constraint
28
Problem Addressed
In a general ad hoc network with limited memory at each node, where to cache data items, such that the total access (communication) cost is minimized?
29
Problem Formulation Given:
Network graph G(V,E) Multiple data items Access frequencies (for each node and data item) Memory constraint at each node
Select data items to cache at each node under memory constraint
Minimize total access cost = ∑nodes ∑data items [(distance from node to the nearest
cache for that data item) x (access frequency) ]
30
Related Work Related to facility-location problem and K-
median problem; No memory constraint
Baev and Rajaraman 20.5-approximation algorithm for uniform-size data
item For non-uniform size, no polynomial-time
approximation unless P = NP We circumvent the intractability by
approximating “benefit” instead of access cost
31
Related Work - continued
Two major empirical works on distributed caching Hara [infocom’99] Yin and Cao [Infocom’ 04] (we compare our work
with theirs)
Our work is the first to present a distributed caching scheme based on an approximation algorithm
32
Algorithms
Centralized Greedy Algorithm (CGA) Delivers a solution whose “benefit” is at least 1/2 of
the optimal benefit
Distributed Greedy Algorithm (DGA) Purely localized
33
Centralized Greedy Algorithm (CGA)
Benefit of caching a data item at a node
= the reduction of total access cost
i.e., (total access cost before caching) – (total access cost after caching)
34
Centralized Greedy Algorithm (CGA)
CGA iteratively selects the most beneficial (data item, node to cache at) pair.
I.e., we pick (at each stage) the pair that has the maximum benefit.
Theorem: CGA is (1/2)–approximate for uniform data item.
¼-approximate for non-uniform size data item
35
CGA Approximation Proof Sketch
G’: modified G, where each node has twice memory of that in G caches data items selected by CGA and optimal
B(Optimal in G)
< B(Greedy + Optimal in G’)
= B(Greedy) + B(Optimal) w.r.t Greedy
< B(Greedy) + B(Greedy) [Due to greedy choice]
= 2 x B(Greedy)
36
Distributed Greedy Algorithm (DGA)
Each node caches the most beneficial data items, where the benefit is based on “local traffic” only.
“Local Traffic” includes: Its own data requests Data requests to its data items Data requests forwarding to others
37
DGA: Nearest Cache Table
Why do we need it? Forward requests to the nearest cache Local Benefit calculation
What is it? Each nodes keeps the ID of nearest cache for
each data item Entries of the form: (data item, the nearest cache) Above is on top of routing table.
Maintenance – next slide
38
Maintenance of Nearest-cache Table
When node i caches data Dj
broadcast (i, Dj) to neighbors Notify server, which keeps a list of caches
On recv (i, Dj) if i is nearer than current nearest-cache of Dj,
update and forward
39
Maintenance of Nearest-cache Table -II
i deletes Dj get list of caches Cj from server of Dj
broadcast (i, Dj, Cj) to neighbors
On recv (i, Dj, Cj) if i is current nearest-cache for Dj, update
using Cj and forward
40
Maintenance of Nearest-cache Table -III
More details pertaining to Mobility Second-nearest cache entries (needed for
benefit calculation for cache deletions) Benefit thresholds
41
Performance Evaluation
CGA vs. DGA Comparison
DGA vs. HybridCache Comparison
42
CGA vs. DGA
Summary of simulation results: DGA performs quite close to CGA, for
wide range of parameter values
43
Varying Number of Data Items and Memory Capacity – Transmission radius =5, number of nodes = 500
44
DGA vs. Yin and Cao’s work.
Yin and Cao:[infocom’04] CacheData – caches passing-by data item CachePath – caches path to the nearest cache HybridCache – caches data if size is small
enough, otherwise caches the path to the data Only work of a purely distributed cache placement
algorithm with memory constraint
45
DGA vs. HybridCache Simulation setup:
Ns2, routing protocol is DSDV Random waypoint model, 100 nodes move at a
speed within (0,20m/s), 2000m x 500m area Tr=250m, bandwidth=2Mbps
Performance metrics: Average query delay Query success ratio Total number of messages
Server Model: 1000 data items, divided into two
servers. Data item size: [100, 1500] bytes
Data access models Random: Each node accesses 200 data
items randomly from the 1000 data items Spatial: (details skipped)
Naïve caching algorithm: caches any passing-by data, uses LRU for cache replacement
Varying query generate time on random access pattern
48
Summary of Simulation Results
Both HybridCache and DGA outperform Naïve approach
DGA outperforms HybridCache in all metrics Especially for frequent queries and small
cache size For high mobility, DGA has slightly worse
average delay, but much better query success ratio
49
To Recap: Data caching problem for multiple items
under memory constraint Centralized approximation algorithm Localized distributed implementation No update or storage cost are considered
(otherwise, no performance guarantee)
Can we consider and minimize the total cost of read/write/storage ?
50
3. Data Caching Under Number Constraint
51
Problem Formulation Given:
Network graph G(V,E). A data item to be stored in the network Access (read) frequency for each node Write frequency for each node Caching (storage) cost for each node Number of allowable caching node: P
Goal: Select cache nodes to minimize the “total cost” Under number constraint
52
Total Cost
= Total read cost + total write cost + total storage cost
= ∑ i є V (hop length between i and its nearest cache x access frequency of i)
+ ∑ i є V (cost of optimal steiner tree over i and all caches x write frequency of i)
+ ∑ i є cache nodes (storage cost at i)
53
Related Work
K-median problem (access and storage cost)
Tamir attains the best time complexity in tree
We generalize it with write cost in both tree ( O(n2P3) ) and general graph Kalpakis et al. solves the same problem, with time
complexity O(n6P3)
54
Tree Topology
55
Tamir’s DP Algorithm on tree Tr
Transform arbitrary tree into full binary tree
Each non-leaf node v has two children: v1, v2
For each v in binary tree, compute and sort the distance from v to all nodes
“leaves to root” dynamic programming algorithm
56
Our DP Algorithm
Ideal: For each node v in Tr:
the cost of sub-tree Tv =
access cost of nodes in Tv
+ storage cost of caching nodes in Tv
+ write cost of all the writer nodes in Tr due to edges in Tv
57
DP Algorithm - Definitions G(v, q, r): optimal cost for subtree Tv, exact q
caches in Tv, closest to v is at most r hops away
F(v, q, r): optimal cost for Tv, exact q caches in Tv; some cache nodes outside of Tv, closest to v is r hops away
F’(v, r): optimal cost for Tv, no cache in Tv; some cache nodes outside of Tv, closest to v is r hops away
58
Recursive DP Equations: p cache nodes allowed
1. G(v, q, 0) -- v is cache node= storage cost at v
+ the cost of Tv1, Tv2 + the write cost on vv1, vv2
2. G(v, q<p, r>0) – there is some cache node outside of Tv = min{ G(v, q, r-1), // there is cache in Tv r-1 hops from
v cost in “closest cache to v is r hops away” }
59
Recursive DP Equations - continued
3. G(v, q=P, r>0) – no cache node outside of Tv = min{ G(v, q, r-1),
the cost of “closest cache is r hops away” }
4. F(v, q, r) – there is cache node outside of Tv= min {G(v, q, r-1),
the cost of “closest cache to v is r hops away
}
60
Minimum total cost of original tree Tr = min {1≤p≤P} G(r, p, L}, L is the hops of
r to the farthest node in Tr
Time Complexity – O(n2P3) For each p, vary q from 1 to q For each (v, q), vary closest cache node to v
(n possibilities) and spit q in to Tv1, Tv2 (q such possibilities)
61
Conclusion
We design optimal, near optimal and heuristics for data caching under different constraint in ad hoc and sensor networks
We show our algorithms can be implemented in distributed way
62
Questions?