Silke Eckstein Andreas Kupfer Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de XML Databases 10. XML Storage 1 – Overview 10.1 Motivation 10.2 Text-based storage 10.2.1 Index structures 10.3 Model-based storage 10.4 Schema-based storage 10.5 Conclusion 10.6 Overview and References XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 2 10. XML Storage 1 • Applications require different types of XML documents – Structure vs. content – Regular vs. irregular • Thus, XML documents are – Data-centric – Document-centric – or somewhere in-between • Questions – Storage of XML documents – Efficient processing of queries on the stored documents or data • There are several methods for storage – 1 st goal: Learn and understand methods – 2 nd goal: Classify methods • Principles • Advantages and disadvantages • Usage XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 3 10.1 Motivation • Characterisation of XML documents: – Data-centric documents • Structured, regular • E.g. product catalog, order, invoice – Document-centric documents • Unstructured, irregular • E.g. scientific article, book, email, web page – Semi-structured documents • Data-centric and document-centric parts • E.g. publications, Amazon, MS Press (example chapters) XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 4 10.1 Motivation • Requirements for the physical layer: – Order preserving and lossless storage of XML documents – Efficient access to XML documents or parts thereof • Quick response time for – Queries – Update operations • Indexing • Transaction processing • Support of XPath and XQuery • Support of SAX and DOM for applications XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 5 10.1 Motivation • Storage approaches for XML documents – Text-based • Storage as character data – Model-based • Generic storage of the graph structure • Storage of the DOM – Schema-based • Mapping to (object-)relational databases – Deriving the database schema from the XML structure – Using user defined mapping procedures XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 6 10.1 Motivation
10
Embed
10. XML Storage 1 XML Databases 10 . XML Storage 1 – Overvie · • Indexes atomar values of an XML document, like element content or attribute values • Index format for structured
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Silke EcksteinAndreas KupferInstitut für InformationssystemeTechnische Universität Braunschweighttp://www.ifis.cs.tu-bs.de
XML Databases10. XML Storage 1 –Overview
10.1 Motivation
10.2 Text-based storage
10.2.1 Index structures
10.3 Model-based storage
10.4 Schema-based storage
10.5 Conclusion
10.6 Overview and References
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 2
10. XML Storage 1
• Applications require different types of XML documents– Structure vs. content– Regular vs. irregular
• Thus, XML documents are– Data-centric– Document-centric – or somewhere in-between
• Questions– Storage of XML documents– Efficient processing of queries on the stored documents or data
• There are several methods for storage– 1st goal: Learn and understand methods– 2nd goal: Classify methods
• Principles• Advantages and disadvantages• Usage
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 3
10.1 Motivation
• Characterisation of XML documents:
– Data-centric documents
• Structured, regular
• E.g. product catalog, order, invoice
– Document-centric documents
• Unstructured, irregular
• E.g. scientific article, book, email, web page
– Semi-structured documents
• Data-centric and document-centric parts
• E.g. publications, Amazon, MS Press (example chapters)
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 4
10.1 Motivation
• Requirements for the physical layer:
– Order preserving and lossless storage of XML documents
– Efficient access to XML documents or parts thereof
• Quick response time for
– Queries
– Update operations
• Indexing
• Transaction processing
• Support of XPath and XQuery
• Support of SAX and DOM for applications
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 5
10.1 Motivation
• Storage approaches for XML documents
– Text-based
• Storage as character data
– Model-based
• Generic storage of the graph structure
• Storage of the DOM
– Schema-based
• Mapping to (object-)relational databases
– Deriving the database schema from the XML structure
– Using user defined mapping procedures
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 6
10.1 Motivation
10.1 Motivation
10.2 Text-based storage
10.2.1 Index structures
10.3 Model-based storage
10.4 Schema-based storage
10.5 Conclusion
10.6 Overview and References
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 7
10. XML Storage 1
• The whole XML document text is stored ascharacter data– File in the file system– CLOB (Character-Large-OBject) in the DBS
• Operations documents as a whole are very efficient– Reading and writing the whole document– But the content is monolithic and opaque with respect to
the relational query engine (query can't inspect a fragment)
• Getting granular access requires additional support– Full text index– Path index
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 8
10.2 Text-based storage
• Index structures for XML documentsallow efficient access for specific queries
– Different types of indexes are optimized for different types of queries
• Generate redundancy
– Index has to be up-to-date by propagating datachanges
• Index structures can be storage structures as well
– They define the storage method
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 9
10.2.1 Index structures
• Types of index structures– Value index
• Indexes atomar values of an XML document, like element content orattribute values
• Index format for structured parts of XML documents• Already known from databases (B-trees, hash index, …)
– Full text index• Indexes single words from the full text• Index format for unstructured parts of XML documents• Already known from Information Retrieval (inverted lists, tries, suffix
trees, …)
– Path index• Indexes subtrees/paths in an XML document• Index format for semistructured parts of XML documents• Already known from object-databases (access support relations, …)
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10
10.2.1 Index structures
• B-tree as value index for an XML fragment document
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 11[Tür08]
– Stop word removal– Elimination of uncommon items
• Linguistic methods– Normalization of words (e.g. capitalisation, hyphenation,) – Word decomposition by rules (engl.) or dictionaries (german)– Stemming
• Knowledge-based methods– Use of ontologies and thesauri to search for synonyms, hypernyms and
hyponyms
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 12
10. 2.1 Index structures
• Inverted list as full text index for XML
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 13[Tür08]
10. 2.1 Index structures
word occurrence word position in the text
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 14[Tür08]
10. 2.1 Index structures
word occurrenceword occurrence
• Path index
– Structure information must be identifiable andreconstructable
• Assigning the markup to the content as well as
• Representing the hierarchical nesting and order ofelements/attributes
– Especially suited for keyword search with regard tostructure or path expressions
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 15
10. 2.1 Index structures
FOR $b IN //book
WHERE CONTAINS($b/author,"Benjamin")
RETURN $b
• Types of path indexes– Nested path index
• Access to root node from everynode
– Multi-index• Accessing parent nodes
– Join-index• Access parent and child nodes
– Access Support Relations (ASR)
• Generalization of indexes above,by listing all paths in a table
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 16[Tür08]
10. 2.1 Index structures
• Conclusion– Efficient query processing on XML documents
requires different types of index structures
– Value index• For efficient access to structured parts
• Keyword search, value search
– Full text index• For efficient access to unstructured parts
– Path index• Using the document structure
• Navigating queries
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 17
10. 2.1 Index structures
• Summary text-based storage– Schema definition:
• not required
– Document reconstruction:• documents stay in their original format
– Queries:• Information retrieval queries• Processing the markup of the queries• XML queries possible
– Special features:• Full text functions
– Efficiency:• Character string must be parsed on every access with XML processorsà expensive
• No concurrency on read or write à no parallel processing
– Usage: • For document-centric XML applications• Suitable to only a limited extent also for semi-structured applications
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 18
10.2 Text-based storage
10.1 Motivation
10.2 Text-based storage
10.2.1 Index structures
10.3 Model-based storage
10.4 Schema-based storage
10.5 Conclusion
10.6 Overview and References
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 19
10. XML Storage 1
• Idea: generic storage of the graph structure– XML elements, XML attributes, … are nodes of a graph– Nesting of elements defines edges– Nodes get an (internal) ID based on graph traversal
• Using relations or object classes to store elements andattributes
• Document structure can be restored completely• Extension for data type adapted storage is possible
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 20
10.3 Model-based storage
ID Element name Value Reference to preceeding Rank
ID Attribute name Value Reference to element
Elements
Attributes
• The EDGE approach [FK99]
– Variant BINARY: horizontal partition of EDGE based on label
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 21[Tür08]
10.3 Model-based storage
XML documents
• XML queries
– XML queries (XPath, XQuery) are mapped to SQL queries (taking storage structures into account)
– Result of XML query is generated from result ofdatabase query
• "Labeling" of the result tuples
• Result is in XML format
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 22[Tür08]
10.3 Model-based storage
• Example: list bargain buy with prices
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 23
10.3 Model-based storage
SELECT a.content, b.content FROM Edge a, Edge b
WHERE (a.label = 'price') AND (a.content < 10.00)
AND (b.label = 'description')
AND (b.parent = a.parent) AND (a.key = b.key)
[Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 24[Tür08]
10.3 Model-based storage
• DOM-based storage
– Information from theDocument Object Modelare stored in the database
– Storage alternatives
• (Object-)relational databases
• Object-oriented databases
• Developing own datastructure
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 25[Tür08]
10.3 Model-based storage
Node type:
ELEMENTNode type:
ATTRIBUTE
Node type:
TEXT
DOM-based storage – example • XML Queries
– XML queries (DOM method invocations) are mappedto SQL queries (taking storage structures intoaccount)
– Result of method invocation is generated from resultof database query
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 26[Tür08]
10.3 Model-based storage
Summary model-based storage– Schema definition:
• not required for storage
– Document reconstruction:• Possible, but expensive
– Queries:• XML queries possible• Adapted database queries
– Special features:• Querying many elements/attributes is expensive
– Efficiency:• Navigation from the given context is efficient• Restoring the document and evaluating path expressions is inefficient
– Usage: • For data- and document-centric as well as for semi-structured
XML applications
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 27
10.3 Model-based storage
10.1 Motivation
10.2 Text-based storage
10.2.1 Index structures
10.3 Model-based storage
10.4 Schema-based storage
10.5 Conclusion
10.6 Overview and References
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 28
10. XML Storage 1
• Motivation– XML content shall be stored in a conventional database– Accepting the loss of native access– DB schema is derieved from a DTD or an XML schema
• Problem– Generate DB schema automatically– Thereby use as much structure information as possible
• General approach for mapping from a DTD– Transform DTD into a tree representation– Nodes: element types, attributes, etc. (type layer!!!)– Edges: nesting relationships of element types and their restrictions– Traverse tree in order to transform nodes and edges into database
tables (according to certain rules)
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 29
10.4 Schema-based storage
• Generating the DB schema for a DTD:
– Rules to map element types:
– Rules to map attributes:
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 30
10.4 Schema-based storage
XML element type à column of a tableSequence of element types à columns of a tableAlternative of element types à column of a tableElement type with quantifier ? à column with null valuesElement type with quantifier +,* à set/list of columns (SET OF, LIST OF)Nested element types à TUPLE OF
XML attribute à column of a tableIMPLIED à null values allowedREQUIRED à null values not allowedDefault value à DEFAULT constraint
• Mapping to relational databases– DTD is usually required– Queries use SQL functionality– RDBMS data types are used (e.g. prices are NUMERIC)– Problem: Mapping of collection types
• Subdivide into additional relations
– Example:
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 31
10.4 Schema-based storage
Comment_ID Customer_info Feedback
44901 C0001 F0001
ID Fname Lname Email
C0001 Charles Sanchez C.Sanchez@hotmail...
ID Type Content
F001 opinion Darjeeling Special…
Comment:
Customer_Info:
Feedback:
• Mapping with STORED (Semistructured TO RElational Data)– Basic idea: Use data mining techniques on the XML structure to find a good
mapping to tables [DFS99]
– Input• XML documents (or an average sample of the collection)
• Query workload
• Restrictions of storage space, number of tables, …
• No DTD or XML schema is required!
– Output• Relational schema
• STORED-queries: Mapping instructions for XML documents to DB tables
– Procedure• Determine the XML subtrees with the largest support in the collection and in the
queries
• These subtrees are materialised in tables
• Irregular data is stored in overflow tables according to the EDGE approach
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 32
10.4 Schema-based storage
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
• Mapping with STORED – example
10.4 Structure-based storage
XML documents shown as tree structure
Subtrees with
high support
Subtrees with
high support
33[Tür08]
• Mapping to object relational databases– DTD is usually required
– Queries use SQL functionality
– "Natural" mapping to tupletypes, collection types
– In case of irregular document structure databases containmany null values.
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 34
10.4 Schema-based storage
Comment_ID <Customer_info> <Feedback>
44901
Fname Lname Email
Charles Sanchez C.Sanchez@hotmail...
Type Content
opinion Darjeeling Specia…
Comment:
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 35[Tür08]
10.4 Schema-based storage
• Mapping of recursive data definitions– DTDs can be recursive
– Infinite recursion is impossible on instance layer of a database
– Procedure:• Marking the nodes
• Subdividing into separate tables
• Use primary and foreign keys in RDBMS
• Use reference types in ORDBMS
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 36
10.4 Schema-based storage
<!ELEMENT book (front, body, references)>
<!ELEMENT references (book+)>
• Mapping of element sequences
– Sequence can be important
• Use an additional attribute in these cases
– Example:
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 37
10.4 Schema-based storage
Order Lesson
1 Introduction
2 XML basics
<lecture>
<lesson>Introduction</lesson>
<lesson>XML basics</lesson>
…
⇓⇓⇓⇓⇓⇓⇓⇓
• Mapping of alternatives
– XML allows to specify alternatives
– Example:
– Three possible storage variants
• Each alternative is stored as separate table column
• Subdivide alternatives in separate tables
• Use a table column of type XML type
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 38
10.4 Schema-based storage
<!ELEMENT car (compactCar | sedan | van)*>
• Variant 1 – all alternatives in one table
•
– Problem: many null values (wasting storage space)
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 39[Tür08]
10.4 Schema-based storage
• Variant 2 – subdivided into multiple tables
•
– For queries, combination of tables is needed
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 40[Tür08]
10.4 Schema-based storage
• Variant 3 – Using column type XML
– XML type allows XML queries or DOM methods
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 41[Tür08]
10.4 Schema-based storage
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 42[Tür08]
10.4 Schema-based storage
Mapping of mixed content – example
• Mapping of mixed content
– Mapping to plain tables is ill-suited
– Use variant 3 from above or
• Content model ANY is not representable at all
– Arbitrary content, arbitrary element types
– Often the fitting storage structure can only bedecided on instance layer
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 43
10.4 Schema-based storage
• Schema-based storage with automaticmapping
– Advantages
• Queries, data types, aggregation functions, views
• Integration in other databases when storing structured data
– Disadvantages
• Large schema, sparsely filled databases (many null values)
• No flexible data types, storage of alternatives has problems
• Less flexible queries
– No information retrieval queries possible without additional extensions
– No full text operations for semi- or unstructured data
– Usually native access is not possible any more
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 44
10.4 Schema-based storage
• Mapping solutions with different specializations