Top Banner
Web Scraping Python, PhantomJS, & Selenium
12

Web Scraping - Simple Tutorials · PDF fileWeb Scraping Python, PhantomJS, ... phantomjs-yahoo-login-form-not-submitting-python-bindings?rq=1 ...

Mar 06, 2018

Download

Documents

buiquynh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Web Scraping - Simple Tutorials · PDF fileWeb Scraping Python, PhantomJS, ... phantomjs-yahoo-login-form-not-submitting-python-bindings?rq=1 ... //

Web ScrapingPython, PhantomJS, & Selenium

Page 2: Web Scraping - Simple Tutorials · PDF fileWeb Scraping Python, PhantomJS, ... phantomjs-yahoo-login-form-not-submitting-python-bindings?rq=1 ... //
Page 3: Web Scraping - Simple Tutorials · PDF fileWeb Scraping Python, PhantomJS, ... phantomjs-yahoo-login-form-not-submitting-python-bindings?rq=1 ... //

Why PhantomJS?

Headless WebKit BrowserRuns JavaScriptInject JavaScriptInteract with the page (forms, etc)Take screenshots

Page 4: Web Scraping - Simple Tutorials · PDF fileWeb Scraping Python, PhantomJS, ... phantomjs-yahoo-login-form-not-submitting-python-bindings?rq=1 ... //

Selenium automates browsers

http://docs.seleniumhq.org/

Page 5: Web Scraping - Simple Tutorials · PDF fileWeb Scraping Python, PhantomJS, ... phantomjs-yahoo-login-form-not-submitting-python-bindings?rq=1 ... //
Page 6: Web Scraping - Simple Tutorials · PDF fileWeb Scraping Python, PhantomJS, ... phantomjs-yahoo-login-form-not-submitting-python-bindings?rq=1 ... //

Install...

https://simpletutorials.com/c/2191/Installing+Selenium+and+PhantomJS+for+Python+3+on+Ubuntu+14.04

Page 7: Web Scraping - Simple Tutorials · PDF fileWeb Scraping Python, PhantomJS, ... phantomjs-yahoo-login-form-not-submitting-python-bindings?rq=1 ... //

robots.txt

Page 8: Web Scraping - Simple Tutorials · PDF fileWeb Scraping Python, PhantomJS, ... phantomjs-yahoo-login-form-not-submitting-python-bindings?rq=1 ... //
Page 9: Web Scraping - Simple Tutorials · PDF fileWeb Scraping Python, PhantomJS, ... phantomjs-yahoo-login-form-not-submitting-python-bindings?rq=1 ... //

Robots.txt Parser

https://docs.python.org/3/library/urllib.robotparser.html

Page 10: Web Scraping - Simple Tutorials · PDF fileWeb Scraping Python, PhantomJS, ... phantomjs-yahoo-login-form-not-submitting-python-bindings?rq=1 ... //

Docs

https://selenium-python.readthedocs.org/http://selenium-python.readthedocs.org/en/latest/api.html

Page 11: Web Scraping - Simple Tutorials · PDF fileWeb Scraping Python, PhantomJS, ... phantomjs-yahoo-login-form-not-submitting-python-bindings?rq=1 ... //

User Agent...{ "headers": { "Connection": "close", "Host": "httpbin.org", "Accept-Encoding": "gzip", "Accept-Language": "ru-RU", "User-Agent": "Mozilla/5.0 (Unknown; Linux i686) AppleWebKit/534.34 (KHTML, like Gecko) PhantomJS/1.10.0 (development) Safari/534.34", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" } ...

http://stackoverflow.com/questions/17858663/custom-headers-in-phantomjs-selenium-webdriver

Page 12: Web Scraping - Simple Tutorials · PDF fileWeb Scraping Python, PhantomJS, ... phantomjs-yahoo-login-form-not-submitting-python-bindings?rq=1 ... //

User Agent

http://stackoverflow.com/questions/28532347/selenium-with-phantomjs-yahoo-login-form-not-submitting-python-bindings?rq=1

DesiredCapabilities.PHANTOMJS['phantomjs.page.settings.userAgent'] = \'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:16.0) Gecko/20121026 Firefox/16.0'