Top Banner
R to Forecast Solr Activity Patrick Beaucamp Bpm-Conseil, France – [email protected]
39

R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

Jan 21, 2018

Download

Technology

LucidWorks
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

R to Forecast Solr Activity

Patrick Beaucamp

Bpm-Conseil, France – [email protected]

Page 2: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

Agenda •  Why Search Monitoring is important ? •  How and what to Monitor ? •  How and What to Analyze ?

•  What can be forecast •  How to forecast ? •  Infrastructure and Algorithms to Forecast

•  Example with AklaBox Platform •  Software Packages – Forecast Infrastructure

Page 3: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

3

Why Search monitoring (dashboard, forecast) is important ?

User Behavior Analysis for Sales & Marketing Team, Web Design Team

WebSite as a Vitrin :

Which Menu & Sub menu are visited ?

Where are the dead branch ?

No real « Search Approach »

Before

Page 4: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

4

Why Search monitoring (dashboard, forecast) is important ?

User Behavior Analysis for Sales & Marketing Team, Web Design Team

WebSite as a Search Interface

What people are looking for ?

How are they searching?

Now

Review your SEO

Page 5: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

5

Why Search monitoring (dashboard, forecast) is important ?

Technology Evolution : Solr, Drupal-Solr, WordPress-Solr, etc …

It’s easy to add a « search » feature

In WebSite (Drupal Hosting) Company don’t want to live

this again !

Page 6: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

6

Why Search monitoring (dashboard, forecast) is important ?

WebSite Content Evolution : Providing relevant content

Search Engine Optimisation & Keyword Strategy

Follower & Alerts

Page 7: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

7

How and what to monitor – Infrastructure Activity

Infrastructure Activity - Technical Side v  CPU & Memory, bandwidth Logs Analysis or Jmx process, Product like Nagios

Page 8: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

8

How and what to monitor – Solr Activity Solr Indexation activity v  Nutch processing or ManiFold CE process v  Fusion’s Anda crawler v  User Activity (load documents)

Solr Search Activity – Functional Side v  TimeStamp (when people are searching) v  Search Criteria – User Behavior (combined search) Solr Log Analysis (Cluster of Application) SolrCore Metrics –Solr 6.4 section <solr><metrics> in solr.xml (250 metrics) Web Server Log Analysis (Apache Log) Web Application Analysis (your own Search Platform)

Page 9: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

9

How and what to Visualize - Solr Activity

Ready to use Packages to explore Solr Activity / logs •  LucidWorks Banana •  ELK •  Some other like : Thoth / Trulia – Carbon Graphite, etc …

Page 10: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

10

How and what to monitor – Architecture Impact Architecture for French Ministry of Environment

1 Web Platform With clustered Solr Infra

Page 11: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

11

How and what to monitor – Architecture Impact Architecture for French Ministry of Environment

1 Web Platform With clustered Solr Infra

Page 12: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

12

How and what to monitor – Architecture

Vanilla

French Region with BI (Vanilla) and AklaBox (Document Management

2 Web Platforms Share the same

Solr Infra

Page 13: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

13

How and what to monitor - Solr Activity Logging

Remember : => You can only explore what you log Remember : => Solr Log configuration is easy, but you need to practice Remember : ⇒ Good and Bad New : ⇒ This session is not a Solr logging tutorial

Page 14: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

14

01How and what to Analyze Be aware of your infrastructure (shared solr) Where do you log ? https://findwise.com/blog/using-log4j-tomcat-solr-how-make-customized-file-appender/ Important : You may need to write your own parser & log aggregator (Remember the L in ELK suit of programs - LogStash) You need to clarify your objective : •  Solr response time analysis (infrastructure side) •  Search Keywords analysis (user behavior side)

Page 15: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

15

01Understand Solr Activity … understand your activity

Why are there some pick of activity ? Does it has an impact on the server (response time, stability) How can we anticipate pick of activity ? •  Events : « Accord de Paris », « NBA Final » •  Related events : Brexit, Election •  Marketing campaign => Information about external data

Page 16: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

16

01What can be Forecast … does it has interest ?

•  Server consumption (CPU, Memory) •  User Activity per period •  Search Criteria •  Much More … … basically, all that can be logged then analyzed is a candidate to forecast, using either Time series or Predictive Events

Page 17: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

17

01How to Forecast

Collection of data •  Logs from server •  External data

Data preparation to re-work some data

Work on the Log, for example to extract Search Keywords Work on Intermediate Result (example Search Keyword) to classify them, group them, etc … (example : aggregation of search « potus » and « Obama » (between this date and this date) and «Trump » (from this date) 

Page 18: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

18

01How to Forecast

You Need of basic Statistics & Analytics •  What is the usage of a cluster ? •  What does it mean to have 2 data correlated (and not

correlated) ? •  What does it means to have data dependancy •  Why do I have to deal with outlier ?

Question : Outlier are wrong data or exceptional data ?

Page 19: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

19

01How to Forecast – its all about DATA

Language & platform to •  Explore & Visualize (Statistical methods) … Data Understanding •  Analyze & Understand (outlier, trend, correlation, dependancy) •  Build Forecast Model •  Insert external data that impact behavior (wheather activity,

marketing campaign, business event)

Review your model : compare reality and forecasted data !

Page 20: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

20

01Data Workflow : from Log to Dashboard •  Log « Analyzer/Manager », like LogStash – but also your own parser •  Load Logs into DBMS and/or Spark2 (depend of your analyze strategy) •  Statistic program running inside Spark (R, python, scala, Julia …) •  Data Preparation Interface (Exploration, Classification, Recoding of data) •  Exploration : any Dashboard package that can run R or python programs,

such as Tableau, Vanilla FlexBoard, Zepellin, Jupyter …

Page 21: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

21

01Forecast using R – What is R ? R is a programming language and software environment for statistical computing and

graphics.

www.R-project.org

Page 22: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

22

01R Challenge – Package Management

Page 23: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

23

01R Challenge – Development Studios

Web Based

Page 24: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

24

01R Challenge – Visualization & Dashboard Shiny

Zepellin

Jupyter

Vanilla Air

Page 25: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

25

01R Challenge – Enterprise Ready Platform

Shiny

Microser Server (Revolution Analytics)

Oracle R Server

Vanilla Air

End of 2015 : creation of the R Foundation

Certified Packages Server Side Architecture

Page 26: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

26

01Forecast using R

Package to analyze Solr logs •  Cluster of events : cluster package (algorithm like clara) •  Time series for some events : ts, timeSeries package •  Search Keywords : qdap package •  User IP (Origin) : rgeolocate package •  Package to run semantics analysis (similar words) : tm package •  Your own package to analyze data Some basic Statistics or data exploration : •  keywords search evolution, •  group of keywords …

Page 27: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

27

01Forecast using R Need to integrate external data, example :

Brexit event -> Search for Irish passport US Election & related events

Marketing Campaign & its impact Basic data base integration : •  Finances data (yahoo, quandl, •  Weather data •  Social media data (twitter, facebook)

Page 28: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

28

01

Algorithms & Visualisation - Time-Series Analysis

•  Frequency & time representation

Page 29: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

29

01

Algorithms & Visualisation - Correlation Analysis

Marketing campaign & Keywords •  Search on Keyword A and B is correlated with Campaign 1 •  Campaign 1 has no incidence on Search on Keyword C To know that 2 facts are correlated is as important as knowing they are not correlated.

Page 30: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

30

01

Algorithms & Visualisation - Cluster

Building a 2D visualisation with 2 dimensions, and creating groups Group 1 : US Visitors, group of Keyword « Finance » Group 2 : West European Visitors, group of Keywords « History » Cluster of Search Criteria : Synonymous management (Valls -> Primer Minister)

Page 31: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

31

01Algorithms & Visualisation - Principal Component Analysis Building a 2D or 3D visualisation with multiples dimensions, and detecting difference between axis (usually good if a dimension carries 30% of the information) Axis 1 : US Visitors, multiples search, group of Keyword « Finance » Axis 2 : West European Visitors, group of Keywords « History »

Page 32: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

32

01

Algorithms & Visualisation - Dependancy

Building relation between Keywords Search (Pair of terms) Equivalence in Retail : detection of products beeing bought together Available also with Solr Semantic Knowledge Graph

People searching on « Sport » are searching « Base Ball » People searching on « History » are searching « France » … if first search is not relevant, then second search may never occur

Page 33: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

33

01R & AklaBox - Solr : Usage

R to analyse document During document upload R to create Custom Search

algorithms

Souvenir

Page 34: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

34

01R & AklaBox - Solr : Usage

R to run Analysis & Predictive model using Solr log activity

Search Engine powered by R (create custom Search Algorithm)

Document Digitalization & OCR : Document Recognition (R program to analyze document content and classify

document)

Page 35: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

35

01Platforms : Vanilla & Vanilla Air

Vanilla Air as a server side R infrastructure Vanilla Hub to integrate external data Vanilla Portal to display FlexBoard Dashboard Powered by R Taking advantage of Document management Features : •  CMIS support (Dashboard publication & distribution) •  Solr indexation (Dashboard indexation) •  Search engine (Dashboard Access)

Page 36: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

36

01Platforms : Vanilla & Vanilla Air

Page 37: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

37

01Vanilla FlexBoard – Some Solr data Visualization

Page 38: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

Q & A ?

Page 39: R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil

Thank You