Big Data: Wall Street Style - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Big Data_ Wall Street Style... · 2 Permission to reprint or distribute any content from this presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Jeff Sternberg Jen Zeralli S&P Capital IQ February 29, 2012
2 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Boring Financial Chart
3 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Boring Financial Chart: less boring with labels
As of 2/24/2012.
4 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Boring Financial Chart = kind of interesting, actually
More than $2.35 trillion dollars
invested in Information Technology
over the last 10 years.
Source: S&P Capital IQ Transaction Screening As of 2/24/2012.
5 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
How Does That Compare?
Total Investment over the last 10 years:
• Industrials = $3.49 trillion
• Energy = $2.61 trillion
•Healthcare = $2.47 trillion
• Information Technology = $2.35 trillion
• Telecom = $2.13 trillion
Source: S&P Capital IQ Transaction Screening. As of 2/24/2012.
6 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
So Is Big Data…
Big Money?
7 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Big Money?
Total Investment over the last three years:
• Information Technology = $774.4 billion
Source: S&P Capital IQ Transaction Screening. As of 2/24/2012.
8 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Big Money?
Total Investment over the last three years:
• Information Technology = $774.4 billion
•Big Data = $32.4 billion
Source: S&P Capital IQ Transaction Screening. As of 2/24/2012.
9 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Big Money?
Total Investment over the last three years:
• Information Technology = $774.4 billion
•Big Data = $32.4 billion
So, 4.2%
Source: S&P Capital IQ Transaction Screening. As of 2/24/2012.
10 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Big Money?
Total Investment over the last three years:
• Information Technology = $774.4 billion
•Big Data = $32.4 billion
So, 4.2%
Hey, at least we’re not just “the 1%”
Source: S&P Capital IQ Transaction Screening. As of 2/24/2012.
11 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
But What We Really Wanted To Talk About…
Strata: Making Data Work
February 29, 2012
12 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
But What We Really Wanted To Talk About…
• S&P Capital IQ: Data Is Our Product
•About Data Collection
• Standardization
• Linking: The Curious, Special Case of Entities
• Suggesting Data
•Projections
13 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
S&P Capital IQ: Data Is Our Product
Strata: Making Data Work
February 29, 2012
14 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Data Is Our Product
15 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Data Is Our Product
• Capital IQ started as an investment bank in 1999*
• Data = competitive advantage over other banks
• Built a database of financial investments,
relationships and transactions
*Acquired by Standard and Poor’s in 2004, now part of S&P Capital IQ.
16 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Hey, Let’s Sell That!
For illustrative purposes only. Source: S&P Capital IQ as of 2/2012.
17 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Data Is Our Product: What We Offer
Datasets
• Financials and
Valuation
• Qualitative Data
• Global Market Data
• Sell-Side Research
• Earnings Estimates
• News and Events
• Fixed Income
• Alpha and Risk Models
• Research Companies
• Generate Ideas
• Build Models
• Monitor Markets
• Analyze Performance
• Quantitative
Research
• Web Portal
• Real-Time
Workstation
• ClariFi
• Mobile
• Data Feeds
• Web Services
• Office Plug-Ins
Use Cases Tools
18 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Data Is Our Product: Who We Help
• Investment Bankers
• Asset Managers
• Private Equity Firms
• Venture Capital Firms
• Credit/Equity Analysts
• Corporations
• Consultants and Advisors
• Academia & Government
19 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Data Is Our Product: Some Stats
Company and Person Profiles
Companies with full quantitative data 100,000
Private company profiles 2.7 million
Professionals and board members 4.2 million
Quantitative data points per company 5,000
Qualitative data points per company 1,500
Transactions
M&A Transactions 425,000
Private Placements 190,000
Public Offerings 138,000
News and Key Developments
Daily News articles across 184 countries 16,000
Key Developments (curated news) 9.7 million
As of 2/2/2012.
20 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Data Is Our Product
DEMO
21 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
Strata: Making Data Work
February 29, 2012
22 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
To Have A Data Product, One Must First Acquire Data.
23 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
Data Collection Goals
• Coverage
• Quality
• Timeliness
• Auditability
24 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
• It starts with documents – 67,000 per day
• Sources
– Company filings (SEC)
– News feeds (press releases)
– Web crawling
• We store these in our document repository
25 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
• Document repository
– SQL for metadata
– “Regular” file storage for docs
– Solr/Lucene indexing for fast search
– 99.3 million documents
– 240.3 million versions (files)
As of 2/24/2012.
26 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
Document_tbl
documentID int PK sourceID smallint FK
Version_tbl
versionID int PK documentID int FK rootID smallint FK
versionIndex smallint filePath varchar(100)
html, pdf, text, sgml, …
+ Filesystem: Document Repository SQL db:
Element_tbl
elementID int PK [doc/vers/rel]ID int FK typeID int FK
value [strongly typed]
ObjectRel_tbl
relID int PK documentID int FK objectID int FK
27 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
• Content search
– Which docs have relevant content?
– Search rules drive collection workflow
– 1000+ search rules per doc
– 65,000+ automated searches
per day
28 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
• Collection workflow
– Core engine that routes work items
– Organized into Processes, Stages, Statuses
– Prioritization based on usage (and others)
– Simple GetNext(), Commit() API
– 177.8 million Commits in 2011
– Avg. 130K+ Commits per day in Financials
As of 2/24/2012.
29 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
• Collection process
– Automated extraction
– Manual collection
– 1000s of quality checks
Basic integrity
Variance from prior period
– All data stored “as reported” with Doc ID
30 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Standardization
Strata: Making Data Work
February 29, 2012
31 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Standardization
Compare “apples to apples” (or Facebooks)
For illustrative purposes only. Source: S&P Capital IQ as of 2/24/2012.
32 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Linking: The Curious, Special Case Of Entities
Strata: Making Data Work
February 29, 2012
33 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Linking: Managing Entities
• Entities we like to think about
– Companies (public, private, investment firms)
– Government agencies (the Fed)
– Governments (munis, countries, the EU)
– Securities (equity or debt, issued by the above)
– Indices, funds, rates, other aggregations
– People (executives, board members,
investors, shareholders)
34 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Linking: Managing Entities
• Goal: Blend entity data from different sources
– Ex: unified view of stock price and ratings
• First: What’s the identifier? Or identifiers?
– Name, ticker, CUSIP®, others…
• Next: Can we auto-link?
– Use historical links to make future links easier
• Quality checks
– Look for outlier cases
• Remember that things change over time
– So entity links create a time series
35 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
An Example Of Difficult Entity Linking: Public Ownership
36 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Suggesting Data
Strata: Making Data Work
February 29, 2012
37 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Suggesting Data
• Goal: Platform that learns from user behavior
• Suggest company profiles that the user may be
interested in viewing
• Use “data exhaust”
to build better
products
38 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Suggesting Data
• Challenges
– We’re an impartial
data platform
– We may not provide
investment advice!
– Clients are super-secret
about their deals
– Ergo, can’t use collaborative filtering approach
39 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Suggesting Data
• Advantage: We have lots of great data!
• Key developments
– Curated news product
– “Get smart” on a company
– News searches catch interesting press releases
– In-house researchers ensure:
Quality entity linking
Event typing (categorization)
40 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
For illustrative purposes only.
41 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Suggesting Data
• Key development event ranking
–Popular & infrequent events = interesting
–Example: Dividend increase is more noteworthy than dividend affirmation
• User selectivity
–Based on clicks
–Sector, region, company type
42 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Suggesting Data
• Score each suggestion for each user based on signals via Hadoop + Hive
• Remove items that the user has already seen!
• Present in a “widget” on the “dashboard”
• Measure the clickthroughs
• Rinse, wash, repeat
43 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Companies You May Be Interested In
For illustrative purposes only.
44 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Companies You May Be Interested In
For illustrative purposes only.
45 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Projections
Strata: Making Data Work
February 29, 2012
46 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Projections
As of 2/24/2012.
47 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Projections
As of 2/24/2012.
48 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Projections
As of 2/24/2012.
49 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Projections – Simple Growth Rates
Transaction Valuation
First Year ($ billion)
3-year Total ($ billion)
Information Technology 209.8 774.4
Big Data 5.0 32.4
• Let S represent the first year • Let T represent the 3-year total • Let x represent the yearly growth rate (%) • Solve for x:
As of 2/24/2012.
50 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Projections – Simple Growth Rates
Transaction Valuation
First Year ($ billion)
3-Year Total
($ billion)
Yearly Growth
Rate (%)
Information Technology 209.8 774.4 21.5%
Big Data 5.0 32.4 89.4%
• When will Big Data catch up with IT? • Let y be the number of years this will take • Solve for y:
As of 2/24/2012.
51 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
So Is Big Data…
Big Money? YES!
52 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
53 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.