Basic Technologies (Unicode, URIs, Namespaces, XML) Camilo Thorne Room 00.012 Institut f¨ ur Maschinelle Sprachverarbeitung Universit¨ at Stuttgart +49 (0) 711 685-81369 [email protected]Semantic Web, SS 2017 (based on slides by W. Kessler) C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 1 / 35
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 2 / 35
Outline
1 Recap
2 Unicode: One Character Set to Represent Them All
3 URIs: Unique Resource Identifiers
4 XML: eXtensible Markup Language
5 XML Namespaces
6 XML Schema: Defining XML in XML
7 Summary
8 References
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 3 / 35
Outline
1 Recap
2 Unicode: One Character Set to Represent Them All
3 URIs: Unique Resource Identifiers
4 XML: eXtensible Markup Language
5 XML Namespaces
6 XML Schema: Defining XML in XML
7 Summary
8 References
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 4 / 35
Recap on Modeling Basics
Pinpoint entities, concepts, relations, states of affairs and constraintsmentioned in the following text, and build a formal representation:
Frames were proposed by Marvin Minsky in the paper “A Frame-work for Representing Knowledge.” Frames consist of slots andvalues. Frames are the primary data structure used in AI framelanguages. Frames are similar to class hierarchies in object-oriented languages, but their design goals are different.
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 5 / 35
Outline
1 Recap
2 Unicode: One Character Set to Represent Them All
3 URIs: Unique Resource Identifiers
4 XML: eXtensible Markup Language
5 XML Namespaces
6 XML Schema: Defining XML in XML
7 Summary
8 References
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 6 / 35
Unicode
First computers only “spoke” English and stored the characters with 7bit, the first bit of a byte is 0→ ASCII: A is 01000001
With the first bit set to 1, we can encode “other” stuff→ e.g., in Latin-1: A is 01000001, a is 11100100
You have to know the encoding to display a text correctly which isoften not specified anywhere – this is madness!
Since 1987, there have been attempts to create one character set forevery existing writing system
In 1991 the first Unicode standard was published
Unicode maps each character to a (abstract, hexadecimal) codepoint: A is U+0041, a is U+00E4
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 7 / 35
UTF-8: An Encoding for Unicode
The way to store a character in bits/bytes is not part of the Unicodestandard
There are many encodings for Unicode, the most widely used isUTF-8
UTF-8 is a variable length encoding and stores Unicode code pointsin one or up to six bytes (up to 6× 8 = 48 bits)
Code points 0-127 are stored in one byte, so that text using onlyEnglish characters looks the same in ASCII and UTF-8
B Examples:
Character Unicode UTF-8
A U+0041 01000001
a U+00E4 11000011 10100100
e U+20AC 11100010 10000010 10101100
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 8 / 35
UTF-8: An Encoding for Unicode
The way to store a character in bits/bytes is not part of the Unicodestandard
There are many encodings for Unicode, the most widely used isUTF-8
UTF-8 is a variable length encoding and stores Unicode code pointsin one or up to six bytes (up to 6× 8 = 48 bits)
Code points 0-127 are stored in one byte, so that text using onlyEnglish characters looks the same in ASCII and UTF-8
B Examples:
Character Unicode UTF-8
A U+0041 01000001
a U+00E4 11000011 10100100
e U+20AC 11100010 10000010 10101100
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 8 / 35
Quiz: Unicode
Which of these statements are true?
A) Unicode is an encoding
B) UTF-8 is an encoding
C) One character uses at most 2 byte in UTF-8 encoding
D) There are Unicode code points for Egyptian Hieroglyphs
E) Everybody uses UTF-8 encoding per default today
F) Documents you hand in during this course should use UTF-8
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 9 / 35
Outline
1 Recap
2 Unicode: One Character Set to Represent Them All
3 URIs: Unique Resource Identifiers
4 XML: eXtensible Markup Language
5 XML Namespaces
6 XML Schema: Defining XML in XML
7 Summary
8 References
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 10 / 35
Unique Resource Identifiers (URIs)
“Everything has a URI”
The URI is a unique identifier for a specific resource, i.e., no tworesources can have the same URI in the same domain
One resource can have several URIs, e.g., I have a URI that refers tome as a teacher and one that refers to me as a singer
A URI could be anything, it can be a URL (Unified Resource Locator,or Web address), but not all URIs are URLs
A URI does not necessarily enable access to a resource
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 11 / 35
URI Examples
For us, URIs will always look like URLs, e.g.,http://www.example.org/#JohnSmith.
URIs have two parts:
Namespace http://www.example.org/#
Local name JohnSmith
We can define prefixes for namespaces and abbreviate URIs withprefix:LocalName.
We will define ex as prefix for the example namespace, sohttp://www.example.org/#JohnSmith is abbreviated asex:JohnSmith
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 12 / 35
Quiz: URIs
Which of these statements are true?
A) Two different URIs can never refer to the same object.
B) Two different objects can have the same URI.
C) All URIs are URLs.
D) INwOXOz96UQOU is a valid URI.
E) URIs must be assigned by the W3C to be valid.
C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 13 / 35