Top Banner
What’s in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim Pharmaceuticals, Inc. Ridgefield, CT USA SLA PHT 2013
24

Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Mar 29, 2015

Download

Documents

Maxwell Fryer
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

What’s in the News?Web Scraping Technology as a Cost-Effective Solution for News AlertingDavid A. Breiner and Raul Rodriguez-EstebanBoehringer Ingelheim Pharmaceuticals, Inc.Ridgefield, CT USA

SLA PHT 2013

Page 2: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Question #1

2

Who here is involved with News Alerting activities in their jobs?

Page 3: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

3

• About Us• Project Background & Critical Issues• Organizational Drivers• Technical & Process Overviews• Demonstration (pre-recorded)• Continuing Challenges• Lessons Learned• User Feedback• Next Steps• Q&A / Discussion

Topics

Page 4: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

4

About Us

Scientific Knowledge Discovery (SKD) is made up of Computational Biology professionals and Knowledge Management experts who support BI* Pharmaceutical Research and Corporate areas in the US by supplying relevant information and analysis.

We focus our work on:

• Delivering data and information in a short timeframe

• Streamlining information gathering and processing through computational methods including Text Mining

• Turning information into knowledge that drives impact

RAUL

* BI will refer to “Boehringer Ingelheim” throughout this presentation

Page 5: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

5

BI US Library has been involved with news alerting for ~20 years:

• 1990s: Library staff generated daily & weekly electronic newsletters on various therapeutic area & business topics

• Early 2000s: Executives & Competitive Intelligence (CI) requested a more systematic, early morning alerting of significant news; “Code Red Alert” developed & managed by 1 Info Scientist (~1-1.5 hours per day manual curation time)

• Late 2000s: Service evolved to include many sources (fee & free) but not as time-critical; executives alerted by other routes; CI no longer part of workflow; distribution list broadened to include Public Affairs & Communications group; work distributed among 3 Library staff for various weekdays after lead Info Scientist retired (~1-1.5 hours per day manual curation time); renamed to “Daily News Brief” in 2010

Project Background

A very valuable service…but extremely time-consuming

Page 6: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Vendor Products: Critical Issues

Ongoing search for a tool to assist in newsletter generation for many years; various vendor products* tested & used, but none met all requirements for success:

• Duplication: Similar stories from various sources• Timeliness: Sometimes 24 hour delay experienced• Cost: Some aggregators required fees for each recipient in addition to base annual subscription

Other Issues:• Some subscription sources had limited user access• Some products lacked focus on particular areas of interest to BI• Implementation always more challenging than anticipated• Technical issues usually required much interaction with vendors

6

No significant time savings realized (~1+ hour per day curation)

* No names will be disclosed

Page 7: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

2010: First In-House Tool developed

7

Strengths:Fast & free to build; simple to maintain (HTML page with links); customizable; comprehensive coverage

Weaknesses:No newsletter-generating tool; much manual scanning of many websites; required much manual curation (i.e. copying/pasting/formatting into email template); duplication among sources

Global News (Fee & Free)

Press Releases

Blogs

Local News

BI News

After 2 years, still no significant time savings(~1+ hour per day curation)

Page 8: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

2011: Organizational Drivers

• Major departmental reorganization in Q4 2011

• Limited staff to support news monitoring; needed to significantly reduce time spent on Daily News Brief

• Unsuccessful paid trial with vendor product

• New management prefers automated computational methods over manual processes

• Clients desire human filtering due to their lack of time

8

“A perfect storm”

Page 9: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Q1 2012: Daily News Brief re-launched

• Provide a daily morning “snapshot” of BI and pharmaceutical industry headlines with a US focus

• Minimize curation time to under 30 minutes per day

• Leverage internal expertise in Web Scraping

• Utilize cost-effective news sources whenever possible:

9

• BI Press Releases (US & DE)• Google News• Yahoo News• Elsevier Business Intelligence *global subscription

• FirstWord• FiercePharma / FierceBiotech• Reuters• Bloomberg• Medical Marketing & Media

Page 10: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Question #2

10

Who here knows what Web Scraping is?

Page 11: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Typical Content

• BI press releases & major news on all BI marketed Pharma products

• BI & subsidiaries (Vetmedica, Roxane Labs, Ben Venue Labs, Bedford Labs) in major & local news sources

• Competitor products: Phase 3 trial announcements, major trial published studies, approvals, launches

• Major Competitor, FDA, & Conference announcements

• Pharma & Healthcare industry trends

11

GOAL: Select & distribute ~12 relevant news itemseach business day before 8:00 am ET

Page 12: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Technical Overview

12

Web crawling agent (cURL)

Parse news items& components

Filter

Standardized display for selection

Relevancy &Minimum date

Newsletters

Manual selection / curation

Output presentation (HTML)

RSS Feeds News Websites

• Real Time• Sources gathered “on the fly”

• Multiple input formats• Manages RSS feeds, news websites,

online newsletters

• Handles password-protected sites• Automatic login

• Uses “lightweight” code• Adaptable script language (Perl)

• Copyright compliant• Only scraping/extracting content that

is free or globally licensed by BI

Page 13: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Technical Overview: Perl Scripting

13

Page 14: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Process Overview: Curation

14

2) Select categories for news items to include using drop-down menus

1) Login to DNB interface on internalBI server; Enter # of days to review

3) Select SUBMIT to publish all selected news items to HTML output file

Page 15: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

15

Process Overview: Publishing & Distribution

5) Paste into email, edit, & distribute

~15 minutes from start to finish!

4) Copy HTML output

Page 16: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Demonstration

16

BI DAILY NEWS BRIEF(2 minutes)

Page 17: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Continuing Challenges

• Duplication among sources, especially between Google News & Yahoo News

• Some sources don’t always scrape properly, requiring minor edits before distribution

• Technical changes on source websites can affect results

• BI still running IE7; migrating to IE9 in 2013

• Keeping it simple for us & our clients, i.e. “Daily News BRIEF” not “Daily News OVERLOAD”

17

Stay tuned…

Page 18: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Lessons Learned

• Have a focused objective (i.e. “snapshot” instead of “all news for everyone”)

• Look within your organization first for expertise before looking externally

• Change is inevitable; accept it as opportunity

• Regularly seek out user feedback (see next slide)

18

To eat an elephant, you must take one bite at a time

Page 19: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

User Feedback

• The Daily News Brief has become my primary source of competitive & marketplace information. Outstanding! from an Executive Director in Marketing

• I read the DNB every morning. I prefer the current format to the previous one; it’s succinct & provides a good overview of top industry stories that I can view on my Blackberry. from a Director in Public Affairs & Communications

• I really like the new simplified look of the Daily News Brief, especially the clean lines and simplicity! Nice work! from an Associate Director in Public Affairs & Communications

• I really enjoy reading the Daily News Brief. It helps me to prepare for my day. from an Associate Director in Business Intelligence

19

Page 20: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Next Steps

• Currently underway: • Use underlying code to develop news interfaces for

monitoring other domains of interest (e.g. Therapeutic Areas, BI Products)

• Expand distribution list to include more senior-level management in US (currently ~125 recipients)

• Develop RSS feed for internal portals (recently completed)

• Attempt to remove duplication among sources wherever possible

• Explore options for delivery to mobile platforms

20

Page 21: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Acknowledgements

• Dr. Raul Rodriguez-Esteban

• Dr. Will Loging

• Amy Shortlidge-Cox

• Yirong Wang

21

Page 22: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Thank You

22

David A. Breiner, MSLinkedIn: http://www.linkedin.com/in/davidbreiner

Raul Rodriguez-Esteban, PhDLinkedIn: http://www.linkedin.com/pub/raul-rodriguez-esteban/0/36b/3bb

Now at Roche in Basel

Boehringer Ingelheim Pharmaceuticals, Inc.Scientific Knowledge DiscoveryRidgefield, Connecticut USA

Page 23: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Q&A / Discussion

23

What are your companies doing for News Alerting? Please share!

Page 24: Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

What’s in the News?Web Scraping Technology as a Cost-Effective Solution for News AlertingDavid A. Breiner and Raul Rodriguez-EstebanBoehringer Ingelheim Pharmaceuticals, Inc.Ridgefield, CT USA

SLA PHT 2013