Top Banner
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach André Freitas Insight Centre for Data Analytics Rio Big Data Meetup (June 2014)
115

Coping with Data Variety in the Big Data Era: The Semantic Computing Approach

Aug 27, 2014

Download

Software

André Freitas

Big Data is based on the vision of providing users and applications with a more complete picture of the reality supported and mediated by data. This vision comes with the inherent price of data variety, i.e. data which is semantically heterogeneous, poorly structured, complex and with data quality issues. Despite the hype on technologies targeting data volume and velocity, solutions for coping with data variety remain fragmented and with limited adoption. In this talk we will focus on emerging data management approaches, supported by semantic technologies, to cope with data variety. We will provide a broad overview of semantic computing approaches and how they can be applied to data management challenges within organizations today. This talk will allow the audience to have a glimpse into the next-generation, Big Data-driven information systems.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Coping with Data Variety in the Big Data Era: The Semantic Computing Approach Andr Freitas Insight Centre for Data Analytics Rio Big Data Meetup (June 2014)
  • Outline Shift in the Information Systems Landscape Semantic Computing Semantics Technologies that Work Today: Data Creation Semantics Technologies that Work Today: Data Consumption Case Study: Treo QA System Conclusions
  • Shift in the Information Systems Landscape
  • Big Data Vision: More complete data-based picture of the world for systems and users.
  • Big Data Dimensions Volume Velocity Variety
  • Big Data Dimensions Volume Velocity Variety Veracity Value
  • Big Data Definitions 7 Data Variety What is Big Data?
  • Cost of Making Sense of It A lot of Big Data is a lot of small data put together. Most of Big Data is not a uniform big block. Each data piece is very small and very messy, and a lot of what we are doing there is dealing with that variety.
  • Cost of Making Sense of It It is more about the rate of change, the amount and the resources that you need to deal with it.If the programming effort per amount of high quality data is really high, the data is big in the sense of high cost to produce new information. Big Data seems to be about addressing challenges of scale, in terms of how fast things are coming out at you versus how much it costs to get value out of what you already have.
  • Cost of Making Sense of It You can have Big Data challenges not only because you have PBs of data but because data is incredibly varied and therefore consumes a lot of resources to make sense of it.
  • Cost of Making Sense of It The speed in which data is generated and the speed in which it needs to be processed in order to use it effectively.
  • Schema Growth Heterogeneous, complex and large-scale databases. Very-large and dynamic schemas. 10s-100s attributes 1,000s-1,000,000s attributes circa 2000 circa 2014
  • Semantic Heterogeneity Decentralized content generation. Multiple perspectives (conceptualizations) of the reality. Ambiguity, vagueness, inconsistency.
  • Data variety + Data quality - Data Programs Full data coverage Full automation
  • Structure level Unstructured Data Structured Data Consistent Comparable Processable Easy to generate Easy to analyze Semantic Computing
  • The Futurist Perspective
  • The Futurist Perspective AI vision Full automation Perfect natural language interaction
  • The Realist Perspective What can be achieved with semantic computing today?
  • Google Knowledge Graph
  • FB Graph Search
  • Apple Siri
  • IBM Watson
  • QA: Vision
  • Semantic Computing
  • (Some) Challenges in Semantics Knowledge Representation Model Reasoning Large, inconsistent, heterogeneous Data Expected Result: intelligent behavior Semantic flexibility, predictive power, automation ... Acquisition, Learning There is an economical model behind each element!
  • Meaning Word meaning is usually represented in terms of some formal, symbolic structure, either external or internal to the word External structure - Associations between different concepts Internal structure - Feature (property, attribute) lists The semantic properties of a word are derived from the formal structure of its representation - e.g. Inference algorithm, etc. Semantics = Meaning representation model (data) + inference model
  • Formal Representation of Meaning (Problems) Different meanings - bank (financial institution) bank (river side) Meaning variation in context Meaning evolution Ambiguity, vagueness, inconsistency
  • Formal Representation of Meaning (Problems) Different meanings - bank (financial institution) bank (river side) Meaning variation in context - clever politician, clever tycoon Meaning evolution Ambiguity, vagueness, inconsistency Word meaning acquisition & representation Lack of flexibility Scalability
  • Most semantic models have dealt with particular types of constructions, and have been carried out under very simplifying assumptions, in true lab conditions. If these idealizations are removed it is not clear at all that modern semantics can give a full account of all but the simplest models/statements. Sahlgren, 2013 Formal World Real World Baroni et al. 2013 Semantics for a Complex World
  • Semantics Technologies that Work Today Data Creation
  • Data Creation Human interaction element (Data Curation) Semantic representation Information extraction
  • Data Curation
  • Entity-Centric Content Generation
  • Defining Core Categories
  • Disambiguation/Synonym
  • Defining Attributes & Relationships
  • Data curation elements Data curation platforms - Spreadsheets - Open Refine - Karma Algorithmic curation - Validation & Annotation robots Curation at source - Minimal Information Models (MIRIAM) Data curation roles Crowdsourcing
  • Standardized Data Models Provides a minimum level of data interoperability Examples: - Resource Description Framework (RDF) - Linked Comma Separated Value (CSV) - Javascript Object Notation (JSON)
  • Resource Description Framework (RDF) Graph data model Entity-centric data integration Facilitates decentralized content generation URIs for concept identfiers Associated structured query language (SPARQL)
  • Resource Description Framework (RDF) dbpedia:General_Electric "US$ 147.3 billion"@en dbp:revenue rdf:type dbo:Organization dbpedia:Fairfield, Connecticutdbp:locationCity
  • Resource Description Framework (RDF) dbpedia:General_Electric "US$ 147.3 billion"@en dbp:revenue rdf:type dbo:Organization sec:General_Electric ifrs:CashFlowsFromUsedInOperationsTotal dbpedia:Fairfield, Connecticutdbp:locationCity
  • Resource Description Framework (RDF) dbpedia:General_Electric "US$ 147.3 billion"@en dbp:revenue rdf:type dbo:Organization sec:General_Electric ifrs:CashFlowsFromUsedInOperationsTotal dbpedia:Fairfield, Connecticutdbp:locationCity owl:sameAs
  • Resource Description Framework (RDF) dbpedia:General_Electric "US$ 147.3 billion"@en dbp:revenue rdf:type dbo:Organization sec:General_Electric ifrs:CashFlowsFromUsedInOperationsTotal dbpedia:Fairfield, Connecticutdbp:locationCity geo:Fairfield "N 41 13' 29'' geo:latitude owl:sameAs
  • Resource Description Framework (RDF) dbpedia:General_Electric "US$ 147.3 billion"@en dbp:revenue rdf:type dbo:Organization sec:General_Electric ifrs:CashFlowsFromUsedInOperationsTotal dbpedia:Fairfield, Connecticutdbp:locationCity geo:Fairfield "N 41 13' 29'' geo:latitude owl:sameAs owl:sameAs
  • Representation Rules (SWRL, RIF) Ontology (OWL) Logical Constraints Taxonomy (RDFS) Classes in sub-/super-class hierarchy Relational (RDF) Attributes Associations Dictionary Terms and definitions Increasing Semantic Representation
  • Representation Increasing Semantic Representation
  • Linked Data HTTP request RDF JSON SPARQL R2RML Relational Database
  • http://dbpedia.org/resource/Jupiter
  • Open Data Common-sense Knowledge Base Domain-specific Knowledge Base Entity reference system
  • DBpedia - http://dbpedia.org/ YAGO - http://www.mpi-inf.mpg.de/yago-naga/yago/ Freebase - http://www.freebase.com/ Wikipedia dumps - http://dumps.wikimedia.org/ ConceptNet - http:// conceptnet5.media.mit.edu/ Geonames - http://www.geonames.org/ Common Crawl - http://commoncrawl.org/ Open Data
  • Standardized Vocabularies Open conceptual models to be reused across different datasets Provides conceptual model level interoperability Useful to be used for modelling recurrent domains of discourse
  • Standardized Vocabularies FOAF SIOC COGS Data Cube Vocabulary PROV-O DCTERMS WGS84 Geo Positioning SDMX QUDT SSN Schema.org VoID Data Catalog ... http://lov.okfn.org/dataset/lov/
  • Entity Recognition & Linking Align terms in unstructured text to entities in a structured KB Integrates structured to unstructured data
  • Entity Recognition & Linking Align terms in unstructured text to entities in a structured KB Integrates structured to unstructured data
  • Entity Recognition & Linking Align terms in unstructured text to entities in a structured KB Integrates structured to unstructured data Can be used to support semantic search Provides a first level of structure to unstructured data Exploratory browsing
  • Entity Recognition & Linking Example: GE has also been implicated in the creation of toxic waste.
  • Entity Recognition & Linking Example: GE has also been implicated in the creation of toxic waste.
  • Entity Recognition & Linking Example: GE has also been implicated in the creation of toxic waste. yago:ConglomerateCompanies yago:MedicalEquipmentManufacturers yago:CompaniesListedOnTheNewYorkStockExchange
  • Entity Recognition & Linking Example: GE has also been implicated in the creation of toxic waste.
  • DBpedia Spotlight - http://spotlight.dbpedia.org NERD (Named Entity Recognition and Disambiguation) - http://nerd.eurecom.fr/ Stanford Named Entity Recognizer - http://nlp.stanford.edu/software/CRF-NER.shtml Entity Recognition/Linking
  • Syntactic Parsers GE/NNP has/VBZ also/RB been/VBN implicated/VBN in/IN the/DT creation/NN of/IN toxic/JJ waste/NN
  • Stanford parser - http://nlp.stanford.edu/software/lex-parser.shtml - Languages: English, German, Chinese, and others MALT - http://www.maltparser.org/ - Languages (pre-trained): English, French, Swedish C&C Parser - http://svn.ask.it.usyd.edu.au/trac/candc Parsers
  • GATE (General Architecture for Text Engineering) - http://gate.ac.uk/ NLTK (Natural Language Toolkit) - http://nltk.org/ Stanford NLP - http://www-nlp.stanford.edu/software/index.shtml LingPipe - http://alias-i.com/lingpipe/index.html Text Processing Tools
  • Database Representation Easy evolution of schemas (schema-less) Graph Databases - OpenLink Virtuoso - Neo4J - Transforming Lucene into a Graph Database NoSQL ...
  • Apache Unstructured Information Management Architecture (UIMA) - Component software architecture for the analysis of unstructured data - http://uima.apache.org/ NLP Interchange Format (NIF) - RDF & OWL-based - http://persistence.uni-leipzig.org/nlp2rdf/ NLP Integration
  • Relation/Graph Extraction Reverb - http://reverb.cs.washington.edu/ Graphia - http://graphia.dcc.ufrj.br/
  • Relation/Graph Extraction In 2002, GE acquired the wind power assets of Enron.In 2002 GE acquired the wind power assets of Enron
  • Relation/Graph Extraction General Electric Company, or GE , is an American multinational conglomerate corporation incorporated in Schenectady , New York
  • Semantics Technologies that Work Today Data Consumption
  • Vector Space Models Representation useful for approximate search Search over structured and unstructured data Construction of approximate semantic models
  • Vector Space Models http://en.wikipedia.org/wiki/General_Electric General Electric ... General Electric company
  • Lucene & Solr - http://lucene.apache.org/ Terrier - http://terrier.org/ Indexing & Search Engines
  • Distributional Hypothesis Words occurring in similar (linguistic) contexts tend to be semantically similar He filled the wampimuk with the substance, passed it around and we all drunk some We found a little, hairy wampimuk sleeping behind the tree
  • Distributional Semantic Models (DSMs) Computational models that build contextual semantic representations from corpus data Semantic context is represented by a vector Vectors are obtained through the statistical analysis of the linguistic contexts of a word Salience of contexts (cf. context weighting scheme) Semantic similarity/relatedness as the core operation over the model
  • DSMs as Commonsense Reasoning Commonsense is here car dog cat bark run leash
  • DSMs as Commonsense Reasoning
  • DSMs as Commonsense Reasoning
  • DSMs as Commonsense Reasoning
  • DSMs as Commonsense Reasoning car dog cat bark run leash ... vs. Semantic best-effort
  • Distributional Semantic Models (DSMs)
  • Amtera Esprit (distributional semantic relatedness) - http://www.mashape.com/amtera/esa-semantic-relatedness WS4J (Java API for several semantic relatedness algorithms) - https://code.google.com/p/ws4j/ SecondString (string matching) - http://secondstring.sourceforge.net S-space (distributional semantics framework) - https://github.com/fozziethebeat/S-Space String similarity and semantic relatedness
  • WordNet - http://wordnet.princeton.edu/ Wiktionary - http://www.wiktionary.org/ FrameNet - https://framenet.icsi.berkeley.edu/fndrupal/ VerbNet - http://verbs.colorado.edu/~mpalmer/projects/verbnet.html BabelNet - http://babelnet.org/ Lexical Resources
  • Entity Recognition & Linking Distributional Semantics Relation/Graph Extraction Internal Datasets Reference Corpora Semantic Pipeline Vocabulary Management Semantic Search & QA Crawling & Indexing Open Data Vocabularies, Taxonomies, Lexical Resources Internal Documents Knowledge Graph Management Knowledge Graph Data Curation Platform Crowdsourcing Services Applications User feedback Provenance Management
  • Case Study: Treo QA System
  • Querying your Knowledge Graph Gaelic: direction
  • Solution (Video)
  • More Complex Queries (Video)
  • Vocabulary Problem Query: Who is the daughter of Bill Clinton married to? Possible representations = Commonsense Knowledge Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and 9,434,677 instances
  • Vocabulary Problem Query: Who is the daughter of Bill Clinton married to? Semantic approximationSemantic Gap Possible representations = Commonsense Knowledge Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and 9,434,677 instances
  • Core Principles Minimize the impact of Ambiguity, Vagueness, Synonymy. Address the simplest matchings first (heuristics). Semantic Relatedness as a primitive operation. Distributional semantics as commonsense knowledge.
  • Step 1: POS Tagging Who/WP is/VBZ the/DT daughter/NN of/IN Bill/NNP Clinton/NNP married/VBN to/TO ?/. Query Pre-Processing (Question Analysis)
  • Step 2: Core Entity Recognition Rules-based: POS Tag + TF/IDF Who is the daughter of Bill Clinton married to? (PROBABLY AN INSTANCE) Query Pre-Processing (Question Analysis)
  • Step 3: Determine answer type Rules-based. Who is the daughter of Bill Clinton married to? (PERSON) Query Pre-Processing (Question Analysis)
  • Step 4: Dependency parsing dep(married-8, Who-1) auxpass(married-8, is-2) det(daughter-4, the-3) nsubjpass(married-8, daughter-4) prep(daughter-4, of-5) nn(Clinton-7, Bill-6) pobj(of-5, Clinton-7) root(ROOT-0, married-8) xcomp(married-8, to-9) Query Pre-Processing (Question Analysis)
  • Step 5: Determine Partial Ordered Dependency Structure (PODS) Rules based. Remove stop words. Merge words into entities. Reorder structure from core entity position. Query Pre-Processing (Question Analysis) (INSTANCE) ANSWER TYPE QUESTION FOCUS Bill Clinton daughter married to
  • Question Analysis Query Features Bill Clinton daughter married to (INSTANCE) (PREDICATE) (PREDICATE) Query Features PODS
  • Query Plan Map query features into a query plan. A query plan contains a sequence of core operations. (INSTANCE) (PREDICATE) (PREDICATE) Query Features Query Plan (1) INSTANCE SEARCH (Bill Clinton) (2) p1