Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

IntroductionTypical solutions

Web scraping and social media scraping –handling JS

Jacek Lewkowicz, Dorota Celinska-Kopczynska

University of Warsaw

April 9, 2019


JavaScriptA typical problem

What will we be working on today?

Most of modern websites use JavaScript (JS)

With JS the content of the website is generated dynamically

Which may make scraping content impossible or significantlymore difficult:

1 A part of website may not be rendered correctly2 Access to some areas may be granted upon clicking a button



Convention

In snippets, we will highlight in violet the areas where youmay put your own content

In commands, the areas in [] are optional

UNIX-like systems use “/” as the path separator and DOSuses “\”. In this presentation the paths will be writtenin UNIX-like convention if not stated otherwise



JavaScript

High-level dynamic, untyped interpreted run-time language

One of the three core languages related to web development(the most popular language of GitHub!)

Used to make dynamic webpages interactive and provideonline programs, including video games



Problem – getting blog content

Let us assume that we want to collect titles of the blog’sarticles

Looks easy! They are stored in a table. We have already donesimilar scrapers two weeks ago



Very basic spider

import scrapy

from scrapy import Request

class exItem(scrapy.Item):

title = scrapy.Field()

class exSpider(scrapy.Spider):

name = ’ex’

start_urls = [’http://your-site-here.com’]

def parse(self,response):

for i in range(0,13):

item = exItem()

item[’title’] = response.xpath(’//your/xpath/here/text()’).extract()[i]

yield item



Output of scraper

We do not extract anything... at all

Let us debug the code, e.g., with scrapy shell!

a blank page... but... it worked in the browser...


NaiveMature projectsSplash

Solution #1 – naive

Open the website in a browser

Save the source code of the website after the page is loaded

Work on a local copy of the source code

Pros: easy and sometimes may be a good workaround

Cons: tedious with limited possibilities, also slow



Solution #2 – PhantomJS

headless browser

has no graphical interface, that is where the name originated(user looks like a ghost)

http://phantomjs.org/

Pros: usually a good workaround

Cons: limited possibilities, sometimes does not rendercorrectly, suspended development

http://phantomjs.org/



Solution #3 – Selenium

A web driver – you may work with various browsers via yourcode

http://www.seleniumhq.org/

Pros: a mature project, does not require human activity

Cons: nearly none, but for some reasons not covered duringthe course (:

http://www.seleniumhq.org/



Scrapy + Selenium

# example from https://stackoverflow.com/questions/17975471/selenium-with-scrapy-for-dynamic-page

import scrapyfrom selenium import webdriver

class ProductSpider(scrapy.Spider):name = "product_spider"allowed_domains = [’ebay.com’]start_urls = [’http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=

python&_sacat=0&_from=R40’]

def __init__(self):self.driver = webdriver.Firefox()

def parse(self, response):self.driver.get(response.url)

while True:next = self.driver.find_element_by_xpath(’//td[@class="pagn-next"]/a’)

try:next.click()

# get the data and write it to scrapy itemsexcept:

break

self.driver.close()



Solution #4 – Splash

A monster child of Scrapy guys...

https://www.reddit.com/r/Python/comments/2xp5mr/handling_javascript_in_scrapy_with_splash/cp2vgd6/

Pros: relatively easy to use, does not require human activity,great Scrapy integration

Cons: probably a lot

https://www.reddit.com/r/Python/comments/2xp5mr/handling_javascript_in_scrapy_with_splash/cp2vgd6/



Splash – Installation

pip install scrapy-splash

Typically one works with an instance of Splash in a docker

docker run -p 8050:8050 scrapinghub/splash usuallyis enough



Splash – Configuration in settings.py

1 Add Splash server address:SPLASH_URL = ’http://192.168.59.103:8050’

2 Enable Splash middlewareDOWNLOADER_MIDDLEWARES = {

’scrapy_splash.SplashCookiesMiddleware’: 723,

’scrapy_splash.SplashMiddleware’: 725,

’scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’: 810,

}

3 Enable Spider middlewaresSPIDER_MIDDLEWARES = {

’scrapy_splash.SplashDeduplicateArgsMiddleware’: 100,

}

4 Set a custom Dupefilter ClassDUPEFILTER_CLASS = ’scrapy_splash.SplashAwareDupeFilter’

5 for more options see https://github.com/scrapy-plugins/scrapy-splash

https://github.com/scrapy-plugins/scrapy-splash



Adding Splash to Scrapy code

import scrapy

from scrapy import Request

class exItem(scrapy.Item):

title = scrapy.Field()

class exSpider(scrapy.Spider):

name = ’ex’

start_urls = [’http://your-site-here.com’]

# A convenient way is to parse information about splash to start_requests metadata

# this setup can be used in any project (always looks the same)

def start_requests(self):

for url in self.start_urls:

yield scrapy.Request(url, self.parse, meta={’splash’:{’endpoint’:’render.html’,

’args’:{’wait’:0.5,}}})

def parse(self,response):

for i in range(0,13):

item = exItem()

item[’title’] = response.xpath(’//your-xpath-here/text()’).extract()[i]

yield item



Output with Scrapy-Splash



Output of Scrapy shell

scrapy shell ’http://localhost:8050/render.html?url=http://your-site-here.com&timeout=1wait=0.5’



Additional links and tutorials

https://dzone.com/articles/

perform-actions-using-javascript-in-python-seleniu

https://simpletutorials.com/c/2205/Basic%20Web%20Scraper%20using%

20Python%2C%20Selenium%2C%20and%20PhantomJS

https://www.guru99.com/execute-javascript-selenium-webdriver.html

https://www.datacamp.com/community/tutorials/

scraping-javascript-generated-data-with-r

https://gist.github.com/hrbrmstr/dc62bb2b35617e9badc5

https://www.rladiesnyc.org/post/scraping-javascript-websites-in-r/

https://dzone.com/articles/perform-actions-using-javascript-in-python-seleniu

https://dzone.com/articles/perform-actions-using-javascript-in-python-seleniu

https://simpletutorials.com/c/2205/Basic%20Web%20Scraper%20using%20Python%2C%20Selenium%2C%20and%20PhantomJS

https://simpletutorials.com/c/2205/Basic%20Web%20Scraper%20using%20Python%2C%20Selenium%2C%20and%20PhantomJS

https://www.guru99.com/execute-javascript-selenium-webdriver.html

https://www.datacamp.com/community/tutorials/scraping-javascript-generated-data-with-r

https://www.datacamp.com/community/tutorials/scraping-javascript-generated-data-with-r

https://gist.github.com/hrbrmstr/dc62bb2b35617e9badc5

https://www.rladiesnyc.org/post/scraping-javascript-websites-in-r/

Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

Documents