Top Banner
KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz,
30

KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Apr 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

KNIME and the Web –Extract, Test, Automate

KNIME Spring Summit, Berlin, 25.02.2016

Philipp Katz,

Page 2: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Our Background

• Three former PhD students at TU Dresden (me, Klemens Muthmann, David Urbansky)

• Computer Science, Information Extraction

• After PhD, each of us founded a startup

CYFACE(fancy logo under construction)

Page 3: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Palladian Nodes

Page 4: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Palladian?

• Java-based toolkit for information retrieval started in 2009

• Palladian KNIME nodes since 2011

• Used in commercial and academic projects

• Available from KNIME Community Contributions download site

Page 5: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

The Palladian Nodes• Text classification

• Content extraction

• Date extraction

• Named entity recognition

• Geo data extraction

• Web page, image, news search

• HTML, RSS, Atom parsing

• Ranking value retrieval

• Evaluation metrics

Page 6: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Access Web APIs

• Web Searcher

• Ranking Services

Page 7: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Text Classification

• Very simple, one predictor, one learner

• n-gram features and Naïve Bayes scoring

• Optimized for big amounts of training data

• Learner is now streamable, Predictor soon

• Competitive accuracy for many use cases

Page 8: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Geographic Data

• Was cooking for a while, added after last year's summit due to popular demand

• New: Nodes for IP and address lookup

• New: Use local gazetteer as source for location extraction node

Page 9: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Geographic Data

• Extract and disambiguate locations from unstructured text, visualize them on the map

Page 10: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Geographic Data

• Extract and disambiguate locations from unstructured text, visualize them on the map

Page 11: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Geographic Data

• Extract and disambiguate locations from unstructured text, visualize them on the map

Page 12: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

HTTP and HTML

• New: Support for cookies, headers, and further HTTP methods besides GET

• New: Sending arbitrary byte stream content, form-encoding of table data

• New: OAuth signing for HTTP requests

Page 13: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •
Page 14: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

?

Page 15: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

?

Page 16: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Selenium Nodes

Page 17: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Selenium?

• “Selenium automates browsers.”

• The Selenium Nodes allow to simulate a real web browser with KNIME

• Use a KNIME workflow to describe actions and extract all the data you need

Page 18: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Use Cases

Data extraction

Task automatization

Web application testing

Page 19: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Browser Support

• Local installations

• Headless “browsers”

• PhantomJS, jBrowserDriver

• Remotely running

Page 20: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Browser Support

• Remotely running

• Connect to Selenium servers or VMs on your local network to simulate a variety of operating systems or browsers

• Use cloud services such as BrowserStack or SauceLabs, which provide ready-to-use Selenium instances (even iOS and Android)

Page 21: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Example Workflow

Page 22: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Example Workflow

Page 23: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Example Workflow

Page 24: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Example Workflow

Page 25: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Example Workflow

Page 26: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Node Overview

• Configure, start, and quit web browsers

• Navigate

• Locate Elements (using attributes, XPath, or CSS)

• Interact with Elements (click, input text, select, submit, …)

Page 27: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Node Overview• Highlight elements

• Take screenshots

• Extract data (page source, text content, attributes, …)

• Execute JavaScript

• Execute Selenium script

• Waiting and synchronization

Page 28: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Outlook

• More sample workflows

• Documentation, how-tos, …

• Workflow import and export for Selenium Scripts

Page 29: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •
Page 30: KNIME and the Web – Extract, Test, Automate...2016/02/25  · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •

Questions? Get in touch!

[email protected] forum