Top Banner
1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet
25

1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

Jan 12, 2016

Download

Documents

Patrick Edwards
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

1

Semi-structured data

Patrick Lambrix

Department of Computer and Information Science

Linköpings universitet

Page 2: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

2

Semi-structured data

Data is not just text, but is not as well-structured as data in databases

Occurs often in web databanks Occurs often in integration of databanks

Page 3: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

3

Semi-structured data - properties irregular structure implicit structure partial structure a posteriori ’data guide’

versus a priori schema large data guides

Page 4: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

4

Semi-structured data - properties It should be possible to ignore the data

guide upon querying Data guide changes fast object can change type/class difference between data guide and data

is blurred

Page 5: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

5

Semi-structured data - model

network of nodes object model (oid) query: path search in the network

Page 6: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

6

OEM (Object Exchange Model)

Graph Nodes: objects

oid

atomic or complex

- atoms: integer, string, gif, html, …

- value of a complex object is a set of

object references (label, oid) Edges have labels OEM is used by a number of systems (ex. Lorel)

Page 7: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

7

OEM example

1 2 G uide

1 9 3 5 5 4 7 7

1 7 1 3 1 4

go urm et C hef C hu

4 4 1 5 1 6

El C am ino R eal P alo A lto 92310

1 8 2 3 2 56 6 5 5 7 9 8 0

V ietnam es e S aigo n M o untainV iew

M enlo P ark c heap fas t fo o d S and ra

92310

res taurant res taurant c afe

nearb y

nearb y

nearb y

c atego ry nam e ad d res s

s treet c ity zip c o d e

zip c o d e

c atego ry nam e ad d res s ad d res s p ric e p ric e c atego ry nam e

Restaurant Guide

Page 8: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

8

Lorel query language1. Find all places to eat Vietnamese food

select P

from RestaurantGuide.% P

where P.category grep “ietnamese”

2. Find the names and streets of all restaurants in Palo Alto

select R.name, A.street

from RestaurantGuide.restaurant{R}.address A

where A.city = “Palo Alto”

Page 9: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

9

3. Find all restaurants to eat with zipcode 92310 select RestaurantGuide.restaurant where

RestaurantGuide.restaurant(.address)?.zipcode = 92310

Wildcards and variables ? - 0 or 1 path - object variables + - 1 or more paths select P from Guide.% P * - 0 or more paths select A from #.address{A} # - any path - path variables % - 0 or more chars select Guide.#@P.name

Lorel query language

Page 10: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

10

Data Guides A structural summary over a data source

that is used as a dynamic schema Is used in query formulation and

optimization Is often created a posteriori Properties:

concise accurate convenient

Page 11: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

11

Data Guides - definitions

Label path: sequence of labels L1.L2. … .Ln

Data path: alternating sequence of labels and oid:s L1.o1.L2.o2. … .Ln.on

Data path d is an instance of label path l if the sequences of labels are identical in l and d.

Page 12: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

12

Data Guides - definitions

A data guide for object s is an object d such that every label path of s has exact one data path instance in d, and each label path in d is a label path of s.

Page 13: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

13

Data Guides

A data source can have several data guides

Minimal data guidesthe smallest data guides

Page 14: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

14

Data Guides - example

1

2 3 4

A B B

5 6 7

8 9 10

C C C

D D D

18

19

20

21

C

D

A B

(a) (c)

Data model minimal Data Guide

Page 15: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

15

Minimal Data Guides

Concise

May be hard to maintain Example: child node for 10 with label E

Page 16: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

16

Strong Data Guides

Intuitively:

”label paths that reach the same set of objects in the data model = label paths that reach the same objects in the data guide”

Page 17: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

17

Strong Data Guides - definitions

An object o can be reached from s via l if there is a data path of s that is an instance of l and that has o as last oid

(L1.o1.L2.o2. … Ln.o)

The target set for label path l in object s is the set of objects that can be reached from s via l. Notation: T(s,l)

L(s,l): set of label paths of s that have the same target set in s as l.

Page 18: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

18

Definition:d is a strong data guide for s if for all label paths l of s it holds that L(s,l) = L(d,l)

There is a 1-1-mapping between target sets in the data model and nodes in a strong data guide.

Strong Data Guides - definitions

Page 19: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

19

Data Guides - example

1

2 3 4

A B B

5 6 7

8 9 10

C C C

D D D

11

12 13

A B

14 15

16 17

C C

D D

18

19

20

21

C

D

A B

(a) (b) (c)

Data model strong Data Guide minimal Data Guide

Page 20: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

20

Strong Data Guides - algorithmImplementation:- Traverse data model depth-first.- Each time you find a new target set for

label path l, create a new object in the data guide.

If the target set is already represented in the data guide, do not create a new object, but link to the existing object.

Page 21: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

21

Strong Data Guides - use

Easier to maintain Used as path index for query

optimization

Page 22: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

22

Semi-structured data-exercises

Page 23: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

23

Represent the relations below using the OEM data model.

Exercise 1

r_id name

r1 Hamletr2 Normandier3 Mc Donald's

c _id name

c 1 Linkopingc 2 Norkoping

r_id c _id street

r1 c 1 Storgatanr2 c 1 St.Larsgatanr3 c 2 Kungsgatan

R es taurantsC ities

R es taurants & C ities

Page 24: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

24

Using the data model from the previous question, formulate the following queries using Lorel:

find all the restaurants that are located in Linkoping

find the address (city and street) of the “Hamlet” restaurant

list the restaurants by city (equivalent of GROUP BY)

Exercise 2

Page 25: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

25

Draw the strong Data Guide forthe restaurant guide data model below.

Exercise 3

1 G uide

2 3 4

5 6 7

go urm et C hef C hu

1 6 1 7 1 8

El C am ino R eal P alo A lto 92310

9 1 0 1 18

S aigo n M enlo P ark

res taurant res taurant c afe

nearb y

c atego ry nam e ad d res s

s treet c ity zip c o d e

c o ntac t nam e ad d res s

nearb y

nearb y

1 9

m anager

2 6

p ho ne

71-72-73

c o ntac t

2 0

res ervatio n

2 7

p ho ne

11-12-13

1 2 1 3 1 4

fas t fo o d S and ra

2 1 2 2 2 3

R yd s vagen L inko p ing 58435

1 5

c atego ry nam e ad d res s

s treet c ity zip c o d e

c o ntac t

2 5

m anager

2 9

p ho ne

2 4

res ervatio n

2 8

p ho ne

31-32-33 34-35-36

RestaurantGuide