Ilian Uzunov (Georgi Georgiev): Ilian Uzunov (Georgi Georgiev)

Scaling Seman+c Technology to Increase User Engagement -‐ FT.com

September, 16th 2015

Ontotext, Scaling Semantic Technology #1 Sept, 2015

•  Introducing Ontotext •  Related Reads – a FT.com use case

•  What we managed to achieve

•  Hands on FT.com live

•  PosiHve signs across the news and media domain

•  Hands on NOW – News on the Web demo service

Outline

Why? enable be>er search, analy+cs and content delivery

What? data and content management technology graph database engine + text-‐mining solu+ons

How? seman+c analysis of text, linking text to data NoSQL database with inference

Best for: dealing with heterogeneous dynamic data

Clients: BBC, FT, Bloomberg, DK, AstraZeneca, Wiley, etc.

Facts: 70 staff; HQ in Sofia; sales in London & New York

USP: the best semanHc graph database engine text-‐mining pla[orm integrated with graph database

Company Brief

Sample RDF Graph: Data and Schema

#4 Sept, 2015

myData:Maria

ptop:Agent

ptop:Person

ptop:Woman

ptop:childOf

ptop:parentOf

rdfs:range

owl:inverseO

inferred

myData:Ivan

owl:relativeOf

owl:inverseOfowl:SymmetricProperty

rdfs:subPropertyOf

owl:inverseOf

rdf:type

Ontotext, Scaling Semantic Technology

Interlinking Text and Data

Seman+c Annota+on

Ontotext, Scaling Semantic Technology #6

pmid:17714090

umls:C0035204

Bronchial Diseases

Respiration Disorders

umls:C0006261

Chronic Obstructive Airway Diseases

Asthma umls:C000496

Ian A Yang

Clinical and experimental pharmacology …

Sept, 2015

Technology PorTolio

Ontotext and Financial Times

Ontotext, Scaling Semantic Technology

Profile •  Top 3 business media •  Focused both on B2C publishing and B2B

services

Goals •  Create a horizontal pla[orm for both data

and content based on semanHcs and serve all funcHonality through it

Challenges •  CriHcal part of the enHre workflow •  MulHple development projects in parallel

with up to 2 months Hme between incepHon and go live

•  Horizontal pla[orm with focus on organizaHons, people, GPEs and relaHons between them

•  AutomaHc extracHon of all these concepts and relaHonships

•  Separate stream of work for a user behavior based recommenda+on of relevant content and data across the enHre media

#8 Sept, 2015

Serve relevant arHcles to increase user engagement

and improve usability

FT Primary Objec+ve

Subject: User Object: Ar+cle, Media Asset, Data, … AcHon: Read, Preview, Comment, …

Subject, Object, Ac+on

action

Contextual Recommenda+on

Contextual Similarity

Behavioural Recommenda+on

Behavioural Similarity

User Prof

Contextual and Behavioural in Combina+on

Behavioural and

Contextual SimilarityReads

User Prof

Average News Ar+cle Metadata

Article

promoted (popular)

updated

created

summary

comments

FT Ar+cle Metadata

Summary

editorial

img:alt

people

regions

organisations

Metadata Used

Summary

editorial

img:alt

people

regions

organisations

concepts keyphrases

User Ac+ons

Limited to User reads ArHcle

User Ac+ons: Another Perspec+ve

perform

comments

preview

contains leads to read

leads to preview

Article

Search Action

Result

FTS Q. TagCat

Tag set

results

cattaxonomy

Search Log-----------------------------------------------------------------

•  Relies on the previous choices of an individual user (a user's profile)

•  Results on the basis of the similarity of items, defined in terms of their content

•  The recommended content is rather homogeneous

“Content”-‐based Recommenda+on

Two-‐fold scoring approach

•  Similarity to recently viewed arHcles (context)

•  Relevance to a long-‐term user profile –  Weights reflecHng the relaHve importance of the individual terms (staHc component)

–  TransiHon likelihoods among any pair of terms (dynamic component)

Content-‐based Ranking Mechanisms

•  Rely on staHsHcs that reflect the past choices of all users

•  Results based on user raHngs, and the similarity of users or items

•  Content-‐agnosHc •  Aware of the quality of content

Collabora+ve Filtering

Collabora+ve Ranking Mechanisms

User to Content Similarity Score

User to User Sim. Score

Content to Content Sim. Score

•  Combines both approaches to improve the quality of predicHon

•  Implemented via staHsHcal models

•  Takes a wide array of features into consideraHon

Hybrid Approach

Ini+al Architecture

Final Architecture

SOLR 1

SOLR 2

SOLR 3

CS Node 3

CS Node 1

CS Node 2

ReplicationGroup I

FT API

Fetch &Annotation

OWLIMWorker

RecommendationAPI

Varnish Cache

Article

1. get related

2. ask

4. query

3. on cache miss

1. pull content

2. annotate3. indexannotatecontent

storeuser

profiles

updatepopularity

click stream

update user

AWS INSTANCE

AWS INSTANCEAWS INSTANCE

AWS Elastic LB

1. Pull content – annotate/enrich – index

2. Accumulate/update user profile

3. Recommend

Main Ac+ons

Implementa+on Overview

Profile Update Request

(User ID, Item ID)

Query Generation Items Index (Solr)

Profile Storage

(Cassandra)

Recommendation Request (User ID)

Profile Update

User: - context - static component - dynamic component Article: - co-visitation matrix - popularity

Boosted sub-queries for all involved ranking schemes: content-based, collaborative, popularity, recency

•  8m named enHHes and metadata about them

•  20m labels of People and OrganisaHons

•  CES cluster which can be scaled horizontally to handle peak loads

•  Live dicHonary updates coming from GraphDB through the EUF (EnHty Update Feed) plugin

•  Max throughput -‐ 10 docs/sec on a single c3.2xlarge AWS node, mulHple by N to get an N nodes cluster throughput

•  Reliability has been 100%, but the soluHon hasn't been stressed as much as we've designed it for

Wrap up -‐ Concept Extrac+on Highlights

•  100% reliability in producHon for a full year (Ontotext also manages the deployment)

•  API handling 1,5m requests a day on average, up to 3m requests a day (1/3 recommendaHons, 1/3 logging user acHon, 1/3 checking whether a user has enough history to ask for behavioural recommendaHons)

•  Roughly 200m recommendaHons served and 200m user acHons tracked to day since go live

•  450 873 documents indexed

•  No caching, since everything is effecHvely a personalized search request

Wrap up -‐ Recommenda+on Highlights

•  GraphDB had to comply with a set of tests designed by FT and OT: Network lag, Disk Space, Disk Load, Less Memory, CPU Load, etc.

•  Comprehensive support for OWL and SPARQL

•  Efficient inference through the enHre life-‐cycle of the data

•  High-‐availability cluster architecture – proven and mature for more than 5 years now –  GraphDB first HA implementaHons works at BBC since 2010 –  Unmatched HA Tests and TransacHon load benchmarks

•  FTS and NoSQL Connectors for seamless integraHon

Wrap up – GraphDB Highlights

•  Washington Post tests new ‘Knowledge Map’ feature “Our ulHmate goal is to mine big data to surface highly personalized and

contextual data for both journalisHc and naHve content.”

•  New York Times RnD Lab announced an experimental project “Editor” 1) recognize a term that can be categorized, 2) link that enHty to exisHng

databases or microservices, 3) make this enriched informaHon accessible to journalists

•  BBC Structured Journalist Manifesto Structured journalism : 1) On the reporter side -‐ automaHon helps

improve a journalist’s reporHng and make it less cumbersome, 2) on the audience side semtech helps scale things that can improve the reader’s experience

Posi+ve Signs from the News Industry

Selec+on of Ontotext Customers

Thanks!

We will be delighted to have a word with you auer the session or later today or tomorrow!

•  Dr. Georgi Georgiev – Head of Ontotext Text Analysis Unit -‐ georgi.georgiev@ontotext.com

•  Ilian Uzunov – Sales Director CEMEAA -‐ ilian.uzunov@ontotext.com

•  Nikolay Krustev – GraphDB Sales Engineer -‐ nikolay.krustev@ontotext.com

Ilian Uzunov (Georgi Georgiev): Ilian Uzunov (Georgi Georgiev)

Data & Analytics

Georgi Georgiev, 21. April 2015€¦ · Georgi Georgiev,...

Copyright by Martin Hristov Georgiev 2015

La Evangelio Laü Spiritismo Kardec... · La Evangelio laý...

Hungary Economic Overview Hristo Georgiev Dilyan Dimitrov.

Assoc. prof. T. Uzunov, PhD

Ilian Sourcebook

Architecture, Services and Programming Model Ilian Iliev...

Web Applications in Hatch Radoslav Georgiev Telerik...

High Dynamic Range Image Capture with - Todor Georgiev

Papazoglu on Georgiev

Radiance Photography - Todor Georgiev - Adobe ·...

Ilian David Sánchez Rodríguez - Dirty money 2.0

АВТОРЕФЕРАТ -...

Mutant Chronicles - Ilian Sourcebook

Robots Industriels -Ilian Bonev-Cours 02

Ilian Galben Thesis