Top Banner
GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang, Sunny Khatri, George Chitouras [email protected]fl.edu Sunday, June 23, 13
34

GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

May 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

GPText: Greenplum Parallel Statistical Text Analysis

FrameworkKun Li, Christan Grant, Daisy Zhe Wang, Sunny

Khatri, George Chitouras

[email protected]

Sunday, June 23, 13

Page 2: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

In-Database Analytics

Sunday, June 23, 13

Page 3: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

In-Database Analytics

• NLP support in RDBMs is limited

Sunday, June 23, 13

Page 4: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

In-Database Analytics

• NLP support in RDBMs is limited

• ML Algorithms in RDBMs is non-trivial

Sunday, June 23, 13

Page 5: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

In-Database Analytics

• NLP support in RDBMs is limited

• ML Algorithms in RDBMs is non-trivial

• Text search in RDBMs is slow

Sunday, June 23, 13

Page 6: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

GPText

A framework for large-scale statistical text analytics over a parallel DBMS.

• The DB provides parallelism and scale.

• Integrated text analytics algorithms with MADlib.

• Specialized architecture for text indexing and search using Solr.

Sunday, June 23, 13

Page 7: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Outline

• Introduction

• GreenplumDB

• GreenplumDB ∪ MADlib

• In-DB Conditional Random Field package

• GreenplumDB ∪ MADlib ∪ Solr

• Demo Screenshots

• Conclusion

Sunday, June 23, 13

Page 8: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Greenplum DB• A shared nothing

parallel dbms.

• Parallel PostgreSQL instances.

• Queries are distributed over segments with a parallel query optimizer.

Sunday, June 23, 13

Page 9: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Greenplum ∪ MADLib

• An open source library for in-database analytics

• A collaborative effort between Berkeley, Wisconsin, and UF

• Maintained by Greenplum

Sunday, June 23, 13

Page 10: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

MADlib

Sunday, June 23, 13

Page 11: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

MADlib

Sunday, June 23, 13

Page 12: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

MADlib

Sunday, June 23, 13

Page 13: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Conditional Random Fields

• The linear-chain CRF is used to find the most likely sequence of token labels.

Sunday, June 23, 13

Page 14: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Conditional Random Fields

• The linear-chain CRF is used to find the most likely sequence of token labels.

Sunday, June 23, 13

Page 15: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

CRF Architecture

Sunday, June 23, 13

Page 16: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

CRF Inference

Sunday, June 23, 13

Page 17: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

CRF Scalability

•Single Host•64 MBs, 32 cores•CoNLL 2000 dataset

Sunday, June 23, 13

Page 18: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Greenplum ∪ MADLib ∪ Solr

Sunday, June 23, 13

Page 19: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Greenplum ∪ MADLib ∪ Solr

GPText

Sunday, June 23, 13

Page 20: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

GPText queriesselect * from

gptext.create_index(<schema-name>,<table-name>, <id_col>,<def-search-column>);

select * from gptext.index(table(select * from <schema-name>.<table>),<index-name>);

select * from gptext.commit_index(<index-name>);

create table sigmod_terms as select * from gptext.terms(table(select 1 scatter by 1), <index-name>, <search-column>, 'sigmod*', 'rows=10');

Sunday, June 23, 13

Page 21: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Concept Application

Sunday, June 23, 13

Page 22: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Concept Application

Sunday, June 23, 13

Page 23: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Concept Application

Sunday, June 23, 13

Page 24: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Concept Application

Sunday, June 23, 13

Page 25: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Concept Application

Sunday, June 23, 13

Page 26: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Concept Application

Sunday, June 23, 13

Page 27: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Conclusion

• We discussed our CRF contribution to MADLib.

• GPText is a scalable framework for text analytics in the database.

• We show a concept application supporting fast text search

Sunday, June 23, 13

Page 28: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Thank youhttp://dsr.cise.ufl.edu

baby gators @pinterest

Sunday, June 23, 13

Page 29: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Extra Slides

Sunday, June 23, 13

Page 30: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

CRF MADlib Interface

• select crf_train/test_data()

• select crf_train/test_fgen()

• select lincrf/vcrf_label()

Sunday, June 23, 13

Page 31: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

CRF Training Algorithm

• Features extracted with queries.

• The database takes care of the parallelism.

• Each inner loop updates the state until convergence.

Sunday, June 23, 13

Page 32: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

CRF Training Algorithm

• We can create a temporary table to store results

• Use a python udf or a with statement to control iterations

Sunday, June 23, 13

Page 33: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

CRF Training Algorithm

• Iterate the algorithm performing the crf_lbfgs step

• Increment and check to see if it is complete

Sunday, June 23, 13

Page 34: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

CRF Training Algorithm

• If it is converged finalize the result features.

Sunday, June 23, 13