Top Banner
Introduction Typical solutions Web scraping and social media scraping – handling JS Jacek Lewkowicz, Dorota Celi´ nska-Kopczy´ nska University of Warsaw April 9, 2019
18

Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

Jul 14, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

Web scraping and social media scraping –handling JS

Jacek Lewkowicz, Dorota Celinska-Kopczynska

University of Warsaw

April 9, 2019

Page 2: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

JavaScriptA typical problem

What will we be working on today?

Most of modern websites use JavaScript (JS)

With JS the content of the website is generated dynamically

Which may make scraping content impossible or significantlymore difficult:

1 A part of website may not be rendered correctly2 Access to some areas may be granted upon clicking a button

Page 3: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

JavaScriptA typical problem

Convention

In snippets, we will highlight in violet the areas where youmay put your own content

In commands, the areas in [] are optional

UNIX-like systems use “/” as the path separator and DOSuses “\”. In this presentation the paths will be writtenin UNIX-like convention if not stated otherwise

Page 4: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

JavaScriptA typical problem

JavaScript

High-level dynamic, untyped interpreted run-time language

One of the three core languages related to web development(the most popular language of GitHub!)

Used to make dynamic webpages interactive and provideonline programs, including video games

Page 5: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

JavaScriptA typical problem

Problem – getting blog content

Let us assume that we want to collect titles of the blog’sarticles

Looks easy! They are stored in a table. We have already donesimilar scrapers two weeks ago

Page 6: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

JavaScriptA typical problem

Very basic spider

import scrapy

from scrapy import Request

class exItem(scrapy.Item):

title = scrapy.Field()

class exSpider(scrapy.Spider):

name = ’ex’

start_urls = [’http://your-site-here.com’]

def parse(self,response):

for i in range(0,13):

item = exItem()

item[’title’] = response.xpath(’//your/xpath/here/text()’).extract()[i]

yield item

Page 7: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

JavaScriptA typical problem

Output of scraper

We do not extract anything... at all

Let us debug the code, e.g., with scrapy shell!

a blank page... but... it worked in the browser...

Page 8: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

NaiveMature projectsSplash

Solution #1 – naive

Open the website in a browser

Save the source code of the website after the page is loaded

Work on a local copy of the source code

Pros: easy and sometimes may be a good workaround

Cons: tedious with limited possibilities, also slow

Page 9: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

NaiveMature projectsSplash

Solution #2 – PhantomJS

headless browser

has no graphical interface, that is where the name originated(user looks like a ghost)

http://phantomjs.org/

Pros: usually a good workaround

Cons: limited possibilities, sometimes does not rendercorrectly, suspended development

Page 10: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

NaiveMature projectsSplash

Solution #3 – Selenium

A web driver – you may work with various browsers via yourcode

http://www.seleniumhq.org/

Pros: a mature project, does not require human activity

Cons: nearly none, but for some reasons not covered duringthe course (:

Page 11: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

NaiveMature projectsSplash

Scrapy + Selenium

# example from https://stackoverflow.com/questions/17975471/selenium-with-scrapy-for-dynamic-page

import scrapyfrom selenium import webdriver

class ProductSpider(scrapy.Spider):name = "product_spider"allowed_domains = [’ebay.com’]start_urls = [’http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=

python&_sacat=0&_from=R40’]

def __init__(self):self.driver = webdriver.Firefox()

def parse(self, response):self.driver.get(response.url)

while True:next = self.driver.find_element_by_xpath(’//td[@class="pagn-next"]/a’)

try:next.click()

# get the data and write it to scrapy itemsexcept:

break

self.driver.close()

Page 12: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

NaiveMature projectsSplash

Solution #4 – Splash

A monster child of Scrapy guys...

https://www.reddit.com/r/Python/comments/2xp5mr/handling_javascript_in_scrapy_with_splash/cp2vgd6/

Pros: relatively easy to use, does not require human activity,great Scrapy integration

Cons: probably a lot

Page 13: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

NaiveMature projectsSplash

Splash – Installation

pip install scrapy-splash

Typically one works with an instance of Splash in a docker

docker run -p 8050:8050 scrapinghub/splash usuallyis enough

Page 14: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

NaiveMature projectsSplash

Splash – Configuration in settings.py

1 Add Splash server address:SPLASH_URL = ’http://192.168.59.103:8050’

2 Enable Splash middlewareDOWNLOADER_MIDDLEWARES = {

’scrapy_splash.SplashCookiesMiddleware’: 723,

’scrapy_splash.SplashMiddleware’: 725,

’scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’: 810,

}

3 Enable Spider middlewaresSPIDER_MIDDLEWARES = {

’scrapy_splash.SplashDeduplicateArgsMiddleware’: 100,

}

4 Set a custom Dupefilter ClassDUPEFILTER_CLASS = ’scrapy_splash.SplashAwareDupeFilter’

5 for more options see https://github.com/scrapy-plugins/scrapy-splash

Page 15: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

NaiveMature projectsSplash

Adding Splash to Scrapy code

import scrapy

from scrapy import Request

class exItem(scrapy.Item):

title = scrapy.Field()

class exSpider(scrapy.Spider):

name = ’ex’

start_urls = [’http://your-site-here.com’]

# A convenient way is to parse information about splash to start_requests metadata

# this setup can be used in any project (always looks the same)

def start_requests(self):

for url in self.start_urls:

yield scrapy.Request(url, self.parse, meta={’splash’:{’endpoint’:’render.html’,

’args’:{’wait’:0.5,}}})

def parse(self,response):

for i in range(0,13):

item = exItem()

item[’title’] = response.xpath(’//your-xpath-here/text()’).extract()[i]

yield item

Page 16: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

NaiveMature projectsSplash

Output with Scrapy-Splash

Page 17: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

NaiveMature projectsSplash

Output of Scrapy shell

scrapy shell ’http://localhost:8050/render.html?url=http://your-site-here.com&timeout=1wait=0.5’

Page 18: Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

NaiveMature projectsSplash

Additional links and tutorials

https://dzone.com/articles/

perform-actions-using-javascript-in-python-seleniu

https://simpletutorials.com/c/2205/Basic%20Web%20Scraper%20using%

20Python%2C%20Selenium%2C%20and%20PhantomJS

https://www.guru99.com/execute-javascript-selenium-webdriver.html

https://www.datacamp.com/community/tutorials/

scraping-javascript-generated-data-with-r

https://gist.github.com/hrbrmstr/dc62bb2b35617e9badc5

https://www.rladiesnyc.org/post/scraping-javascript-websites-in-r/