Top Banner
DETECTING ALPHA SIGNALS IN THE NEWS A COMPREHENSIVE APPROACH TO MEASURING THE WORLD AROUND US & FINDING PATTERNS THAT FORESHADOW MARKET MOVEMENTS August 2019
59

DETECTING ALPHA SIGNALS IN THE NEWS

Mar 13, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DETECTING ALPHA SIGNALS IN THE NEWS

DETECTING ALPHA SIGNALS IN THE NEWS

A COMPREHENSIVE APPROACH TO MEASURING THE WORLD AROUND US

&

FINDING PATTERNS THAT FORESHADOW MARKET MOVEMENTS

August 2019

Page 2: DETECTING ALPHA SIGNALS IN THE NEWS

2

SUMMARY We believe there is significant alpha to be generated by quantifying and analyzing streaming news. News reflects what is happening around the world and when machine-read in way that a human naturally would, NLP based measurements can be taken that reflect events, trends, and opinions. While many have tried to use alternative data in investment strategies, these attempts have often been ad-hoc, shortsighted, or difficult to scale. Indexica’s mission is to measure and quantify the world around us in a way that is systematic and digestible. We convert raw information into digestible signals and map these measurements against financial markets to deliver leading indicators and predictive signals. In this document, we first provide a view on the alternative data space and the opportunity that exists on the textual side of it. We summarize our technology and the infrastructure that is needed to make use of this opportunity. We then detail at length the different factors and metrics we create from raw, unstructured data and show how we add structure it. Importantly, this structure allows us to provide our clients with data that can be used to create short and long-term systematic trading strategies, develop discretionary portfolios or rules-based indexes, and to identify themes and thematic or ESG-oriented products. We provide topline analysis and examples of these findings throughout this document. We then highlight our Predictive Indexing product, which brings together all of the factors and metrics we create into an AI-driven, machine learning engine, which isolates the factors that move markets and securities. The underlying factor constituents fuse together to create indexes, which serve as predictive signals. The approach, technology, and results are discussed at length, and provide examples of how the news can provide alpha when using the right approach.

Page 3: DETECTING ALPHA SIGNALS IN THE NEWS

3

TABLE OF CONTENTS

Part 1: Overview of Current Environment and Progress Towards the Future

Introduction 4

The Problem 4

A View of the Promised Land 4

Current Solutions & Their Limitations 5

A Roadmap to the Promised Land 6

A Summary of the Process & Tech That Makes the Roadmap Real 6

Methodology and Steps of the Indexica Approach 7

Intelligent Storage and Architecture 9

Metrics and Predictive Indices 9

Part 2: Systematically Measuring the World Around Us

Human versus Machine Reading 10

Metrics that Measure the World Around Us 10

The Building Blocks of Metrics 11

Metric Portfolio 12

Metric Constructs 16

Putting Indexica’s Metrics To Work 17

Short-Term Systematic Alpha Strategies And Results 17

Long-Term Systematic Alpha Strategies And Results 18

Portfolio And Index Constituent Selection & Weighting Regimes 22

Building A Thematic/Exposure Score 24

Gender Composite – Indexica’s ESG Factor 34

Conclusions 38

Part 3: Finding Alpha Signals That Drive Market Movements

Bringing it all Together 39

Understanding The Basics of Predictive Indexing 39

Detailed Process & Methodology 40

Making Sense of Signals 45

Putting Predictive Indexes to Work 47

Equity Indices 49

Individual Stocks 50

Currencies 52

Volatility 53

Other Securities 55

Chain-Linking Indices 57

Delivering Predictive Indexes 58

Conclusions 59

Page 4: DETECTING ALPHA SIGNALS IN THE NEWS

4

PART 1: OVERVIEW OF CURRENT ENVIRONMENT AND PROGRESS TOWARDS THE FUTURE

Introduction There is significant alpha to be generated by quantifying and analyzing streaming news. News reflects what is happening around the world and when machine-read in way that a human naturally would, NLP based measurements can be taken that reflect events, trends, and opinions. These metrics can then be intelligently analyzed using advanced techniques to find what drives markets and companies both in the short and long term. There are significant challenges associated with alternative data and these strategies, but with the right premise, technology, and systematized approach, advanced signals with predictive power, grounded in economic rationale, can be found and utilized by investors. The Problem

Investors need to make decisions, while cognizant that the pace and complexity of change related to political, economic, social, technological, and opinion dynamics is exponentially increasing. The world is changing fast and markets are changing faster. Since 2000, 52% of Fortune 500 companies have gone bankrupt. By 2027, 75% of the S&P 500 will be replaced. Because the world around us and causal change are hard to measure and harder to anticipate, investing is more difficult than ever. Fortunately, rapid change results in incredible investment opportunity, if one can measure and anticipate it accurately. Investment advantage comes from being able to make proactive decisions before all of the facts, trends, and patterns are widely known. Enter the new era of alternative data, AI, NLP, and machine learning to find alpha in news. A View of the Promised Land In an ideal world, investment managers would have two sets of data at their fingertips and an engine between the two.

Side A Side A would include every available piece of market data. This would

include constituents from all asset classes from all countries, all with a long history. Additionally, the data would continue to stream in, in real time. This is realistically affordable and accessible to all investment managers.

Side B Side B would include a well-researched and functioning quantified

measurement metric for everything happening in the world in real time and historically. This data would be millions of times larger than the market data set. It would reflect items such as demographic trends, weather, social change, app downloads, satellite data, political events, economic patterns, company performance, etc. It would also contain, most importantly, the opinions towards everything that is measured, since

Page 5: DETECTING ALPHA SIGNALS IN THE NEWS

5

opinions impact markets directly. This data set would be all encompassing and nothing that happens, or that is felt, would go unmeasured. This is a fantasy.

Intelligence Engine In this fantasy, there would be a massive intelligence engine built

between the two data sets. It would be able to intelligently analyze whether anything from Side B is predictive of Side A. It would look for statistical relationships and use systematized human knowledge to ensure that causal relationships, if found, were grounded in economic rationale. It would look at historical data, and if signals were found, it would continue to generate them on top of newly streamed data, such that trades could be made based on world events in real time. Thus, signals would be actionable.

The underlying premise is that what’s going on in the world, and the opinions of what is going on in the world, impact tradable securities each in their own way, and sometimes in systematic ways. Side A isn’t the problem. It’s accessible to everyone. Side B and the intelligence engine between the two sides is what the investment community lacks but is attempting to create.

Current Solutions & Their Limitations Most attempts to enter this Promised Land via alternative data are exciting but lacking. Alternative data attempts to fill gaps in Side B by measuring what has historically been unmeasurable, hard to measure, too costly to measure, etc. For example, satellite data is measuring the physical movement of people and items, credit card transactions are measuring how/when/where people are spending, sensors are measuring various movements, etc. The alternative data industry could also be called: “the industry that is measuring stuff that previously wasn’t measurable.” Less catchy, but it’s key to understand that alternative data will soon be called essential data, not alternative data, because what it is measuring is essential. It is just the methods to measure that are new or “alternative.” The problems we see in the current alt-data environment are as follows:

1. No value or value dilution. The hope is that alternative measurements have relationships to company performance or market behavior. Often however, the data does not have value on its own, which is unsurprising. Not every metric has investment value. Sometimes the data is valuable or needs to be enhanced. But when it is, it is often sold to too many buyers, reducing its value. Exclusive access is ideal.

2. Niche measurements. Most data is relevant to a small number of markets. Satellite data, for example, isn’t going to help anyone analyze Oracle. Thus, investment universes are narrowed unless alternative data measurements are systematically applied. It’s not ideal that in order to measure systematically, hundreds of vendors need to be engaged.

3. Digestion issues. Often, data needs to go through a data science group before it can become actionable. By the time all of the steps are taken, the digester of the data doesn’t understand it, it’s too late to act on, or participants are not speaking the same language.

4. Short history. Alternative methods of measuring the world are new and thus the history will not be as long as desired. Early adopters must accept this. Traditional back testing needs to be re-thought given new measurement capacity. It is no different than similar issues currently facing ESG investors.

Page 6: DETECTING ALPHA SIGNALS IN THE NEWS

6

5. Wrongly sourced intelligence engines. The engines used to analyze the predictive relationships between alternative data and market data or company performance are generally sitting in various forms at investment managers, who each build models that fit their investment strategies. Some argue that this is preferable to the analytics engine coming from the data providers, since the investor may mix and match the data with other data feeds in ways that create enhanced signals that fit their investment theses. But there are often flaws in this approach, including difficulty in systematizing the answer to whether a predictive signal is grounded in economic rationale. Or not allowing the intelligence system to find predictive relationships without a human entered thesis. Providers of alt data often have unique insight into what they measure and how the data should be analyzed. Thus, provider intelligence engines should be built, valued, and reviewed in depth. A Roadmap to the Promised Land In order to benefit from rapid change, one must first be able to broadly measure global dynamics, change, and opinions. Thus, they must fill Side B as completely as possible. Second, they must have an engine that can identify causal relationships between Side A and Side B. Fortunately, AI, NLP, and ML are making this more possible than ever before. Indexica enables customers to do this, via:

Modern Factors & Tracking Indices. Our 40+ proprietary metrics micro-measure real-time political, economic, social, technological, and opinion dynamics, trends, and collective consciousness. Turning points and collective behavioral patterns are detected with speed, deciphered, and delivered. While single modern factors tell partial stories, fusing modern factors into indexes expands measurement capacity and allows for customized monitoring.

Predictive Indices. Textual news, the first rough draft of history, typically precedes real world market movements. Patterns in mass quantities of news data often telegraph market-moving events. Indexica measures this data in real time, and deciphers it to generate causal predictive indexes, helping clients become proactive in their investment processes.

A Summary of the Process & Tech That Makes the Roadmap a Reality for Indexica’s Clients

Intelligence Gathering. Indexica gathers vast amounts of streaming data from tens of thousands of unstructured textual web sources in real time. This data is the basis for all generated insight.

Algorithmic Modeling & User Interaction. On top of the data, Indexica utilizes algorithmic models where humans apply mind, experience, sense, discernment, and knowledge. Clients interact with data via our semantic search engine, text filters, API, files, or web-based data visualizations, which enable them to define how intelligence should be tailored and delivered.

Page 7: DETECTING ALPHA SIGNALS IN THE NEWS

7

Signal Construction. Indexica transforms unstructured textual documents into quantified metrics and predictive indexes to help clients measure, monitor, and intelligently anticipate elements that impact markets.

Methodology and Steps of the Indexica Approach

Crawling The News: Why News? And How? Finding alpha in news data, or textual documents, requires first a validation of the source type as the basis upon which measurements should be taken. 80% of the world’s data is sitting in text, thus it is the largest source upon which quantification can occur. Far more importantly, it is where the unlocked value is hiding. The textual news universe has historically been unquantifiable, yet it contains the richest source of information about what is happening in the world in real-time across subjects and geographies. Because streaming textual news narrates what’s happening in the world in real time, it provides a method by which political, economic, social, technological, and opinion dynamics and change can be quantified. Logically, Indexica ingests, in real-time, every essential byte from the billions of news articles that are continually streamed to the web.

Our media list contains approximately 25,000 sources of English based news from approximately 180 countries. Source diversity enables a broad perspective of global events across all subjects. Sources range from daily newspapers in China, to leading medical journals in California, to SEC filings. Our system compliantly crawls, ingests, and reads millions of texts per day from regulatory, economic, scientific, political, social, and technological sources across global markets. Our news history goes back to 2010 for most sources and 2016 for all sources. Every source is human curated based on content quality and is systematically scored for influence value. We crawl every source on our media list every ten minutes.

Our crawlers are built in house so that we can ingest the raw text in a format that aligns with our text processing requirements. For example, we have the need to preserve the semantic structure from source texts, such as paragraphs, titles, footnotes, etc. Nothing off the shelf would be able to accomplish the same goal. Additionally, our crawlers are smartly able to extract text from PDFs, charts, and websites from around with the world.

Bigger is not always better when it comes to data origins, which is why our media list is human-curated and evolves as new quality sources become available. As it relates to the news media, we’ve found that additional sources beyond the 25,000 range tends to hurt, rather than help, results, since most sources beyond a core 25,000 tend to be unreliable, noisy, and duplicative. Additionally, we don’t believe there is consistent or significant value to be derived from social media (except for HFT), which is why we’ve focused on news texts.

Indexica is agnostic to text type and often ingests texts desired by clients, such as sell side research, earnings call transcripts, and Fed transcripts. There is often great value in these texts as well, since they narrate very specific stories which move markets.

How Texts are Processed and Data is Stored. Ingesting millions of news articles from around the world is a complex task, but processing the texts is equally important if a quality end result is to be achieved. The process is broken into various steps as noted below. All

Page 8: DETECTING ALPHA SIGNALS IN THE NEWS

8

steps have been created with the final result in mind, which is to quantify world change, events, and opinions. Upon request, a sample and/or more detail can be provided to show each step being performed on a single text. All NLP technology was built in house. Nothing is off the shelf.

1. Readable text extraction after DOM object interpretation. This step essentially extracts raw text for the next steps.

2. Phrase parsing and phrase classification. During this process, we map phrases, merge broken ones, and discard useless ones. We discard unrelated links, blank phrases, and unrelated content and we keep valuable content for further processing.

3. Initial text analysis to calibrate applied math. In this level, we start with entity level parsing, identifying which entities configure multi-worded ones and leaving the remaining as single words. Later, they are POS tagged inside step #8. This stage includes running a pattern recognition algorithm to match text type from a supervised trained model. A language detection feature is also applied. At this stage, other text factors are stored, such as whether a text is a news article or an editorial, when the article was crawled, what type of media source it came from, the origin URL, the number of words, the author, weighting regimes for later scoring, phrase counts, etc. Dozens of attributes are grounded.

4. Text level lexical metrics grounded. We run algorithms to calculate lexical metric values inside the contextualizer process. This saves time and CPU power when later retrieving data from the data lake. All metrics are calculated both at the whole text level and the entity level using surrounding text.

5. Geolocation extraction. We start by listing all possibilities and then running a disambiguation process. For disambiguation we use a sub-NLP process using the text and the media source metadata along with a machine learning system.

6. Date and time extractions. We detect words and expressions that can be structured into a timeline interval. This is a dynamic algorithm that takes into account self-awareness such as “when is today?” All findings are taken into account as part of our Futurity metric.

7. Content categorization. We match content against supervised rules for 15 pre-existing subject categories. These are used for scoring and other purposes.

8. Part of speech (POS) tagging, entity level attributes, and keyword based metric values. At this heavy level, we semantically process the surrounding text for each entity and register all possible entity level attributes as well as process the algorithms to calculate lexical metric values. We store approximately 80 attributes for each entity we extract, in hash table format. This is essential and unique to our approach and end goal of quantifying what we read. Attributes include distances between words, counts, scores, part of speech identification, word position, connectivity, metric values, etc. In this stage, we also perform structured value extraction and attach values to their connected entities. We perform entity inference and upscale single nouns to multi-worded ones when we have enough confidence to do so. We find the “focus” of a text and process a score applied to each entity related to that text’s focus. We review linguistic ambiguity using ML, so that complex sentences are analyzed correctly, boosted, or diminished in value, and scored appropriately.

9. Entity classification. The final step at the contextualizer level is to try and match found entities against our entity database. This enables us to include known aliases (such as full company names, nicknames, tickers, or acronyms) and ensure that the findings will be available later via search, tagged to the correct entities. This layer includes machine learning and strengthens our ability to map entities to public

Page 9: DETECTING ALPHA SIGNALS IN THE NEWS

9

securities. This phase requires the use of an entity database, a knowledge graph, entity disambiguation algorithms, and other essential NLP tools.

10. Discard texts. All texts are discarded after being processed. Nothing remains except for the extracted information. This is essential for compliance purposes.

To summarize, millions of texts are brought in and processed such that all important information, which should be thought of as building blocks, is extracted and stored for the creation of quantified metrics in the future. Designing the steps is essential if measurements are to be taken later. After these steps, the information used as inputs in measurement metrics is available and tagged to the correct entities.

Intelligent Storage and Architecture

Intelligent storage is essential. Indexica registers all processed data into a real time query-able data lake that grows perpetually and can receive hundreds of thousands of new entities per hour. Using any entity found in the data lake, Indexica produces a scalable framework that allows end users to filter by any of set of semantic context attributes and delivers, in real time, metric values from a proprietary metric portfolio. The system includes permanent background processes, running in perpetuity, allowing users to monitor metrics or indexes for any entity set or monitor metric combinations, using Indexica’s self-developed index math to produce a single plot value out of them.

Indexica extracts millions of entities per day. These entities have to be available to be queried in real-time, not knowing which ones users will want, what filter set they may apply, or how a metric/entity pair may be related to a market. On the other hand, this same data has to be used to detect patterns and produce systematic insights. This is a mix of two opposed tech worlds. On the one hand, we have the need for a relational database to be queried in real time, and on the other, we have a pure data warehouse approach.

To address both needs, we architected a proprietary partitioned data lake scheme, with a multithreaded top down query system, which allows us to query billions of records in real time, applying any mix of the available filters, while permitting the retrieval of results within a couple seconds.

A self-developed on-demand SQL generator, applying filters in real time, allows Indexica to act as a search engine or a discovery tool, at the same time as a machine learning tool, producing predictive indexes on demand. A uniquely built in-house-developed triple cache memory layer enables the platform to get smarter and faster with every user interaction, rather than slower over time.

Metrics and Predictive Indices

All of these processes lay the groundwork for the two essential tasks of Indexica: 1) measure

the world around us and 2) see whether those measurements have predictive power in financial

markets. Section two will discuss metrics in more detail and section three will discuss the engine

used to scour our universe for metrics that have actionable value.

Page 10: DETECTING ALPHA SIGNALS IN THE NEWS

10

PART 2: SYSTEMATICALLY MEASURING THE WORLD AROUND US

Human versus Machine Reading

When a human reads a text about something that happened in the world, they generate various

conclusions. For example, who was mentioned? Were they people, companies, commodities,

currencies, or countries? Were events discussed? What kind of events? Did the article have a

bias? Was it generally positive or negative? Was it about business? Was it about something that

already happened or something that will happen? Was it filled with fear? Was it written in a

complex way or a simple way? Was a big opportunity mentioned? Was it written two years ago?

Was it written in China? Was something a surprise? This list could go on and on.

Humans are really good at this process. But humans lack the ability to read millions of articles,

which means they miss the ability to digest a large quantity of information being released each

minute that is narrating something happening in the world. Thus, no human should feel

confident that their conclusions are grounded in fact unless they are working in parallel with

machine intelligence. This is a major problem if the goal is to identify trends that impact markets

and invest based on that knowledge. Trends are simply missed.

If we can train machines to read like humans do, chances increase drastically that full

measurements will be taken and acted on rather than lost.

Sentiment analysis is the first well-known attempt to teach a computer to make sense of the

emotions it reads. Depending on the research, there may or may not be value in sentiment data.

Sentiment is just one of dozens of factors that a human interprets from what it reads. And by

itself, sentiment is not all encompassing. At Indexica, we think the first step in the quantification

process must be to select metrics that matter to a human and that are possible for a machine to

create. Those are often conflicting worlds, but NLP is far beyond sentiment, and we must push

the measurement limits in order to generate insightful signals.

Metrics that Measure the World Around Us

An Indexica metric is a measurement factor that appraises elements of political, economic,

social, technological, and opinion dynamics, trends, and collective consciousness derived from

textual documents. They are essential measurements that we use to fill Side B as well as we

can.

Page 11: DETECTING ALPHA SIGNALS IN THE NEWS

11

Our metric portfolio is designed to comprehensively measure what matters. It is designed to

measure everything happening in the world that is narrated in texts. Being able to measure is

the first step towards finding alpha or generating insight. Measurements have their own use

cases before even considering alpha, and those are discussed in this section.

The Building Blocks of Metrics

The building blocks for metrics are preserved when we ingest, process, extract, and store raw

information from the millions of texts we read each day. Most metrics are not created when a

text is processed. Instead, the building block inputs are stored intelligently for later metric

construction.

For example, our Futurity metric is meant to assess whether the language around an entity is

tilted towards the future or the past. This metric can assess whether an event happened or will

happen, or whether tonal conversation around a company is geared towards the future or past.

It’s an essential human metric to digest when reading, but until Indexica, it has never been

measured or assessed via NLP. To create this metric, we parse every verb we read for its

verbal tense and we extract every time and date we read. Since we are also collecting the

distance between words, we can connect these values to nearby entities. Thus, the Futurity

score for a company over last three years at daily frequency can be delivered within two

seconds because the building blocks for the metric are intelligently stored away, anticipating the

possibility of this request coming at some point.

Page 12: DETECTING ALPHA SIGNALS IN THE NEWS

12

What Indexica cannot anticipate is what a client will want the metric to be applied to. It could be

a company, a commodity, a currency, a geolocation, an event type, a whole text, etc. We have

no preset universe. Thus, the measurement can be applied to anything and is technically being

applied to everything. That’s essential when looking for patterns, because we don’t know upfront

what a metric will tell us. Not having a preset universe of entities is a differentiator in the

marketplace.

Building off the data that is extracted and stored when we read a text (see Section 1), various

exclusive metrics are available. These metrics are the result of a balance between what is

possible with computing and what is natural and desired by a human. Sometimes these are

opposing worlds, and finding the right balance is essential.

Quantitative Metric Portfolio

We’ve built our metrics into categories based on the methodology used to create them and the

types of measurements they take. Some metrics are built by parsing verb structure. Some are

created by analyzing word type and variability. Some are created by matching words to linguistic

lists and using machine learning to disambiguate. There are dozens of methods for creating a

measurement metric using NLP. All metrics are purely quantitative and should be interpreted on

a 0-10 scale.

Metrics tell stories. The learning curve to understand a metric and what its values are trying to

say is small. One can analyze values over time entity-by-entity, or via a cross-section at a

specific snapshot in time. A factor portfolio encompassing a number of new and unique metrics

is necessary if the goal is to measure the world around us, change, and opinions as accurately

as possible. Because multiple elements typically impact companies and markets, indexing fuses

these elements together, which is essential for measuring and monitoring multiple patterns at

once. Below is a list of our factors and a brief summary of what they each measure.

CORE FACTORS Attitude Attitude quantifies the degree to which opinion or discussion is positive,

negative, or neutral. Attitude is based on a proprietary keyword list that measures negative, neutral, and positive bias. Machine learning is employed to handle ambiguity.

Volume An utterly essential metric within our portfolio, Volume measures how often something is being spoken about and is important for understanding event frequency. Volume is a measure of the “share of voice” of an entity, relative to other entities. On its own, the metric is highly useful as a method to understand what’s happening in the world, but when combined with other metrics, the power of Volume is increased dramatically.

Page 13: DETECTING ALPHA SIGNALS IN THE NEWS

13

Count Count is the raw number of times an entity was identified in set of

documents. It is related to Volume but is real rather than relative.

Futurity Futurity measures whether discourse is forward-looking, present-based, or historically-focused. The metric analyzes parsed verb structure and extracted times and dates to formulate a scaled value. The metric describes whether an entity or company is associated with innovation and forward-looking viewpoints or backwards looking behavior.

Lexical Metrics

For each lexical metric in our portfolio, a score measures whether an entity is highly related and connected to the underlying concept or not, as follows:

Anger. Anger measures the degree to which an entity is associated with anger and aggressive language.

Desire. Desire characterizes the relationship between an entity and concepts associated with wants, desirability, and benefits.

Fear. Fear measures the level of fear and worry associated with an entity.

Uncertainty. Uncertainty measures the level of uncertainty, ambiguity, and doubt associated with an entity.

Opportunity. Opportunity measures whether an entity is associated with potentially positive outcomes.

Probability. Probability measures whether an entity is associated with highly likely outcomes or less likely outcomes.

Surprise. Surprise measures whether an entity is associated with unexpected language and events.

Sophistication. Sophistication measures the quality of text and how “high-language” texts describing an entity are.

Slangness. Slangness measures how much slang is associated with an entity by measuring slang and non-dictionary terms.

Severity. Severity measures the strength and seriousness of language associated with an entity.

EasyFunnyLight. EasyFunnyLight measures how light and non-serious the language associated with an entity is.

Gender Gender measures the degree to which an entity is associated with male

or female oriented language.

Diversity Diversity measures whether the discussion around an entity is creative and uses many unique terms or if the commentary around an entity is simplistic, with frequent use of the same words.

Density Density analyzes the types of phrases used to discuss an entity. A higher score indicates denser descriptions rather than simple or direct descriptions of an entity. The score often explains whether an entity is associated with a convoluted situation or not.

Complexity Complexity measures how linguistically complicated the language

Page 14: DETECTING ALPHA SIGNALS IN THE NEWS

14

associated with an entity is. To measure Complexity, we use our version of the Gunning Fog Index.

Action Share Action Share quantifies the degree to which the language associated with an entity is action-oriented (verb centric). Higher scores indicate that an entity is associated with events.

Description Share

Description Share measures the degree to which the language associated with an entity is explanatory (adjective centric). Higher scores indicate that an entity is associated with unusual situations that require detailed and varied explanation.

Quotability Quotability measures to what degree an entity is featured in quoted conversation, which indicates that it is being actively talked about by influential people.

BuzzSentiment BuzzSentiment is a fusion of Volume and Attitude. It combines the two to create a more sophisticated measurement of the opinion bias towards an entity combined with the Volume with which it is being spoken about.

NLP Focus NLP Focus measures how central an entity is to a text or a group of texts. It measures to what degree texts are focused on an entity versus other entities in the same text(s).

COMPLEX FACTORS Entity Connectivity Connectivity measures how strong the relationship is between

two or more entities. The metric is created based on how often, and how close, multiple entities appear within texts together and if they are contained within the same textual phrases. Connectivity is both a measurement metric and a tool used to contribute towards a knowledge graph, among other broad uses. It is also the core metric used for thematic index scores and exposure scores.

Geo Entity Connectivity

Geo Connectivity (city, state, country) measures how strong the relationship is between an entity and a geolocation or how strongly geolocations are connected to each other.

Time Entity Connectivity Time Entity Connectivity measures how strong the relationship is between an entity and a time-entity (date or timeframe).

Subject Categories For each subject category in our portfolio, a unique score measures whether an entity is highly related to a defined subject category theme or not. Subject Categories are genres that encompass common topics as follows:

Page 15: DETECTING ALPHA SIGNALS IN THE NEWS

15

Government & Politics & Policy

World & International Affairs

Living & Lifestyle & Society

Business & Economics & Finance

Law & Order

Arts & Entertainment

Consumers & Products & Design

Science & Health

Weather & Nature & Animals

Technology & Computing

Sports & Games & Hobbies

Religion & Spirituality

Connected Value Focus Indexica extracts value-based elements such as numbers, currencies, percentages, and other values from textual documents. We use these extractions in this metric to quantify the degree to which an entity is connected to extracted values. This metric is helpful for comparing and contrasting whether entities have a quantitative or qualitative focus.

Connectivity Diversity Connectivity Diversity measures the degree to which an entity is connected to many other entities, and how strong those connections are. Thus, the metric simultaneously analyzes whether an entity is part of a larger network, and how strong its connections are to the other entities in that network.

Trending Value Trending value captures whether the discussion of an entity is trending by measuring its growth in Volume relative to a pre-period value. The metric measures whether growth associated with an entity is due to it being well-established or novel. A higher score indicates that an entity is continuing to grow from an already established position, while a lower score indicates that the growth stems from an entity with a low Volume baseline.

Novelty Novelty measures how new or novel an entity is. In a sense, it is inversely correlated with Trending Value. The metric allows users to understand whether an entity that experienced recent growth is new and novel or old and established. While the underlying components are similar to Trending Value, the scale is differentiated to give a varied interpretation.

Consequence Consequence measures the extent to which an entity is important and not just popular or trending. A low score corresponds to relatively low importance and popularity, while a high score suggests that an entity is relatively important with significant Volume.

Reach Reach allows users to investigate how broadly an entity is

Page 16: DETECTING ALPHA SIGNALS IN THE NEWS

16

discussed across the media landscape. While Volume indicates how much “share of voice” is devoted to an entity, Reach measures how many media sources in Indexica feature the entity in question in ratio to the total.

METRICS OF METRICS Metrics of Metrics are unique and proprietary statistical factors we have created or adapted to derive insights from any Indexica metric or index. Metrics of metrics are computed over a selected time series of data, and can be derived over time when computed on a regular basis. More information about these metrics is available upon request. Metrics of metrics include:

Volatility

Momentum

Deviation

Moving average

Controversiality

Age

Slope

MACD

RSI

Coherence

Metric Constructs

Lookback. Each metric looks backwards a certain amount of time to create each plotted

value. Unlike a stock price, which is a snapshot in time (e.g, balance sheet), a metric

reading is based on texts over a certain historical time period (e.g, income statement). For

example, if the lookback is set to one day, then the previous day’s texts will be the derivative

texts used to create one plotted value. If the lookback is set to 30 days, then the previous 30

days of texts will be the derivative texts used to create one plotted value. The lookback does

not impact the frequency of plotted values, instead it specifies only the trailing period that

will be used to develop each plot. From value to value, a shorter lookback will typically

generate more volatile readings than a longer lookback.

Frequency. Metric values plot at defined intervals based on their set frequency. Values can

plot hourly, daily, weekly, or monthly. Because the news universe operates 24/7 and

because we crawl all sources in real time, immediate metric results can be released or a

longer term frequency can be set to match a client’s needs.

Delivering factors and data using various specifications of lookbacks and frequencies can

help achieve different goals. For example, creating a 30-day lookback at a daily frequency

simulates a moving average indicator.

Page 17: DETECTING ALPHA SIGNALS IN THE NEWS

17

PUTTING INDEXICA’S METRICS TO WORK

Metrics can be used in a multitude of ways. A quant team can add textual factors into existing

models and metrics can be combined with existing strategies or data to generate new insights.

On the other hand, Indexica factors provide a breadth of new possibilities for executing

backtests or trading strategies based on new textual factors from the ground up. Indexica’s

metrics can be used to generate short-term quantitative trading strategies and long-term trading

strategies. Metrics can be used to create indexes or portfolios based on a theme, or geared

towards ESG. We have noted some simple analyses in brief, below, which do not by any means

capture the extent of what is possible. We highlight five use case categories:

1. Short-term systematic alpha strategies and results.

2. Long-term systematic alpha strategies and results.

3. Portfolio and index constituent selection.

4. Building a thematic/exposure score.

5. Indexica’s ESG factor.

The following results are presented in short, summary form. Research briefs can be provided

upon request. The below results are actual user examples to pique interest.

SHORT-TERM SYSTEMATIC ALPHA STRATEGIES AND RESULTS

Short-term strategies are geared towards using factors from Indexica’s offering to execute short

term trades. These strategies generally attempt to optimize next-day excess returns, while

controlling for standard market factors. We share results for three short-term strategies among

the dozens currently in use by our clients.

1. Severity Gets Attention

Strategy: Measure the severity of news systematically connected to a large universe of

equities and initiate short-term trades based on the results by buying stocks with high

Severity.

Results:

Firms with higher Severity factor scores experience strong additional returns in

the immediate term (next day excess returns above a market index). A one-

standard deviation increase in Severity is linked to an additional 5 basis point

return on the following day.

These excess returns account for the general market returns as well as

traditional factors such as size, value, and momentum.

This finding likely suggests that the prices of stocks discussed in the news with

high Severity revert in the short-term following initial negative press.

Data is backtested at a daily frequency with 1-day lookbacks from 2016 - Q1

2019.

Page 18: DETECTING ALPHA SIGNALS IN THE NEWS

18

2. Opportunity In Plain Sight

Strategy: Determine where opportunity exists across a large universe of stocks by

selecting assets with higher potential. Execute short-term trades based on the results by

buying stocks with high Opportunity scores.

Results:

Stocks with high Opportunity factor scores show higher excess returns the

following day. The effect is nearly 4 basis points in the immediate term for a one-

standard deviation increase.

Stocks with high Opportunity scores are discussed in the news in a way that

describes high potential and future positive opportunities.

The factor shows very little correlation to standard factors and other Indexica

factors.

3. Complexity vs. Slang

Strategy: Determine which assets are written about in an overly complex manner versus

colloquially and exploit this heterogeneity for short-term alpha by shorting high

Complexity equities and buying high Slangness equities.

Results:

Stocks which are discussed systematically with more complex language (30 day

rolling lookback) show lower returns compared to the market and compared to

less complex peers.

Stocks that are written about more colloquially (measured via Slangness) show

positive returns compared to their peers.

Complex and simple verbiage about firms may play an important role in how the

market, particularly retail investors, perceive their valuations. Likely, more

complex text is associated with a slower market response, which can be

capitalized on using systematic factors that measure complex and colloquial text

using fast NLP.

LONG-TERM SYSTEMATIC ALPHA STRATEGIES AND RESULTS

We find long-term value across most all of our factors. Using a universe of the top 500 large-cap

US equities, we show that strategies which select firms with higher/lower levels of various

Indexica textual factors generate higher returns over common benchmarks. These factors are

unique and orthogonal to traditional sentiment and market factors.

The graph below plots the HML return (high quintile minus the low quintile, relative to a common

benchmark, the S&P500 index), i.e., they plot the additional return above the market for a

handful of textual factors over the last three years. For example, a strategy of buying high

Page 19: DETECTING ALPHA SIGNALS IN THE NEWS

19

Futurity stocks and selling low Futurity stocks yielded approximately 35% greater cumulative

returns relative to the S&P index (see the first example below). Exploiting the same strategy

with Anger would be the inverse, and an investor would rather buy low Anger and sell the high-

Anger counterparts. These are not all-encompassing results, but instead are provided to pique

interest. A few of these select examples are discussed in further detail below.

This simple strategy tends to outperform the S&P 500 index over the long term, across a

number of Indexica factors. We detail Futurity and Novelty below, and provide similar evidence

for other factors, such as Severity. We also show how these factors can be used with different

universes by providing a currency example.

1. Futurity on Equities

Strategy: Systematically measure how firms across an equity universe compare in their

tilt towards future oriented events and conversation versus past tense discourse. Create

a portfolio holding high Futurity stocks for the long-run.

Results:

Over the last three years, companies with higher Futurity scores are linked to

above average cumulative returns.

An asset in the highest bin of Futurity, earned approximately 30 percentage

points more compared to the average, while those in the lowest bin earned

almost 20 percentage points less.

Leveraging this asymmetry across the sample of S&P 500 stocks yields an

additional cumulative return of approximately 35 percentage points.

Page 20: DETECTING ALPHA SIGNALS IN THE NEWS

20

2. Novelty on Equities

Strategy: Firms spoken about in a novel sense – new to the current conversation, rather

than long established – are likely to attract investors’ attention and see price growth.

Beyond Meat is a recent example. If one measures this across a large number of stocks,

within an industry, or across a handful of names, and implements a trading strategy that

buys high Novelty equities, they are likely to outperform.

Results:

Over the long-term, companies with higher Novelty scores are linked to above

average cumulative returns.

An asset in the highest bin of Novelty earned approximately 15 percentage points

more compared to the average. Those in the lowest bin earned almost 20

percentage points less.

A long-short strategy here could drive even larger gains above the S&P 500

index.

Page 21: DETECTING ALPHA SIGNALS IN THE NEWS

21

3. Severity on Equities

Strategy: Over the long term, stocks which are discussed more severely are

hypothesized to be linked to lower returns. One strategy is therefore to systematically

identify stocks where news discourse is severe, and create a high-low portfolio to

capitalize on this hypothesis.

Results:

Over the course of 13 quarters from 2016-2019, firms with higher levels of

Severity returned approximately 10% less than the average on a cumulative

basis.

Firms with lower Severity levels show on average 10-20% higher return.

The bar graph below plots bins of the average Severity factor score for each S&P

500 constituent over the sample (Jan 2016 - April 2019) period by its cumulative

return. The dashed line is the average cumulative return across the sample.

A long-short strategy here could drive gains above average S&P 500 returns and

could be extended to any universe of stocks.

4. Factors on Currencies

Strategy: Similar to the above strategies, use systematic Indexica factors for currency

trading. If textual factors predict long term heterogeneity across currency returns, this

can be leveraged to select currencies based on high vs. low factor exposure. Scores

below were generated using the average daily score from Jan 2018 to April 2019.

Currencies in the sample data include: USD, EUR, CAD, JPY, CHF, GBP, AUD, HKD,

and CNY.

Results:

Novelty. Based on a basket of currencies, those with high average Novelty

scores earned greater returns (relative to the USD) over a three-year sample

period. Low-Novelty currencies strongly underperform.

Page 22: DETECTING ALPHA SIGNALS IN THE NEWS

22

Severity. High Severity currencies strongly underperform relative to those with

less Severity exposure over the long run.

The graphs below plot three bins of Novelty and Severity and their cumulative return to

the USD, relative to the average currency in the sample. Since the analysis uses a

smaller universe, we segmented the factors into terciles rather than deciles, however the

data can easily be expanded to a larger universe as desired.

PORTFOLIO AND INDEX CONSTITUENT SELECTION & WEIGHTING REGIMES

Metrics can be used to select constituents for active or passive portfolio management or index

creation. Single and multi-factor constructs can both be utilized. Importantly, these portfolios

and indexes differentiate from the previously described results in that they utilize periodic

rebalancing strategies, which can be variable and customizable. In the following examples, we

Page 23: DETECTING ALPHA SIGNALS IN THE NEWS

23

institute a standard quarterly or monthly holding period. Below we note a few of these strategies

among many with positive results.

1. Single Factor Futurity

Strategy: The premise for this strategy is that companies with more future oriented tonality,

relative to other companies, may be more likely to innovate, attract investment and

customers, and in turn generate higher returns in subsequent periods. For each quarter, we

calculate the average Futurity score for each stock and then sort all stocks by this value,

creating quintiles. Each quarter, the long-index includes the stocks that are in the top 20% of

Futurity scores from the previous quarter.

Results:

A quarterly rebalancing strategy based on Futurity drove significantly higher returns

versus the index. In practice, any rebalancing (weekly, monthly, and annually) could

be implemented.

An equity in the highest bin of Futurity earned approximately 13.4 percentage points

more compared to the index average, while those in the lowest bin earned almost 20

percentage points less. A long-short strategy could also be implemented with this

factor to capitalize on a total return of 33.9 percentage points greater than the index.

Futurity Score Quintile

Q1 (Low) Q2 Q3 Q4

Q5 (High)

Long-Short (Q5-Q1)

Cumulative return 22.5% 36.0% 49.9% 50.5% 56.5%

Cumulative return relative to S&P 500 -20.5% -7.0% 6.8% 7.5% 13.4% 33.9%

S&P 500 cumulative return 43.0%

Page 24: DETECTING ALPHA SIGNALS IN THE NEWS

24

2. Single Factor Surprise

Strategy: Measure Surprise on a daily basis, and use the monthly average to institute a

monthly rebalancing strategy. Surprise captures whether the conversation around firms was

unexpected, in a bad way. This is the main cause of price declines in markets. A monthly

rebalancing strategy can be utilized to reduce uncertainty and volatility.

Results:

a. An equity in the lowest bin of Surprise (the least amount of Surprise), returned

approximately 14 percentage points more compared to the average. Those in the

lowest bin returned approximately 10 percentage points less than the index. A long-

short strategy could drive large gains above the S&P 500 index during the holding

period.

3. Multifactor Futurity and Opportunity

Strategy: Using multiple factors for constituent selection often enhances returns. For

example, using Futurity alongside high positive Sentiment or high Opportunity creates a

powerful combination. To achieve this, select assets which were in the top 20% of both

Futurity and Opportunity in the previous quarter. Cumulative returns were calculated from

April 1st 2016 - April 1st, 2019 and are noted below.

Results:

The stocks associated with Low-Low (Low Futurity + Low Opportunity scores)

earned a modest 15.1% during the time period while the S&P 500 returned over

40%.

Stocks in the High-High bin (High Futurity + High Opportunity scores) returned over

65% during the same time period.

Persistently low-low stocks included, NLSN, MCK, HBAN, and ALB.

Persistently high-high stocks included AKAM, SBAC, FISV, and LH.

4. Factors to Determine Weights

The above strategies highlight constituent selection techniques. Equally powerful is to use

another technique to select constituents and use Indexica’s factors to weigh those

constituents. This creates a tilting or multifactor approach that is unique and hard to

replicate. Returns are enhanced while the mandate remains based on the original thesis.

BUILDING A THEMATIC/EXPOSURE SCORE

Using our Connectivity metric alongside other metrics, we can create thematic/exposure scores

for securities across a vast universe using state-of-the-art NLP. This is similar to approaches

currently being undertaken by various ETFs in the market.

Page 25: DETECTING ALPHA SIGNALS IN THE NEWS

25

Traditional Approach

The traditional approach to building a thematic/exposure score is to utilize teams of human

analysts to pour through filings, research, and news to determine the sector score for each

company. The NLP innovations which followed have been simplistic to date. Others in the space

often do some mixture of the following steps. First, manually build a theme keyword list. Then,

manually put together a list of company names, leaving out aliases. Ingest textual sources,

storing texts into a repository (usually a database) and score the texts where theme keywords

and at least one company both appear. Matching is done using textual search, using available

database engine methods or an additional layer such elastic search. Scoring is then done by

using basic co-counting and the final scores are grouped by desired frequency.

Indexica’s Advanced NLP Approach

Non-manually, use our Connectivity metric combined with two knowledge graphs to

automatically build a theme keyword list from a seed word, such as “electric vehicle.”

Select a universe of companies and benefit from aliases and our entity database. Allow

semantic intelligence to perform disambiguation and prevent false hits.

Ingest texts and process them intelligently before scoring. All ingested texts go through

extensive parsing: detection of paragraphs, phrases, multiple-worded entities, single

words, mapping, POS tagging, phrase classification, entity automatic augmentation,

entity classification and entity contextualization processes: denials, boosts, diminishes,

verbal times, surrounding values, surrounding in text dates, etc.

Find essential texts using Indexica’s own search layer, avoiding false hits and optionally

using a set of extensive filters, including geolocations, metrics, or subject categories.

Search against extracted and contextualized entities, not a mere sequence of

characters.

Score findings using an exclusive Connectivity algorithm that takes into account

distances, denials or types of phrases, combined with a set of desired filters.

Use extensive options for different frequencies or lookbacks.

Use background processes to keep plotting scores or components into perpetuity.

Interface our API to ensure smooth ongoing operation.

The Downsides Associated With the Traditional Approach to Thematic Are Vast

Expensive and labor intensive.

Slow and manual.

Fewer sources can be analyzed with speed.

Open to discretion.

Page 26: DETECTING ALPHA SIGNALS IN THE NEWS

26

No Cognitive Intelligence No semantic virtual common sense is utilized. For example, a text says “XYZ Inc.

announced it will not consider drone delivery.” Using a common approach, this will result into a highly scored hit.

Indexica’s unique parsing and semantic processes would discard this, by adding a denied flag to “drone” and the verbal connection “will not consider.”

Lack of Flexibility

Textual corpuses tell different stories depending on whether they originate from a company filing or from the news. Indexica permits any combination of these texts to be used to create scores based on index mathematics. Indexica utilizes approximate two dozen available filters.

Options around lookbacks and frequencies are essential for custom index strategies desired by end user clients, but are rarely used.

No Aliases

Ensuring that all potential tags are mapped correctly is essential and most NLP engines do not utilize them. Google, GOOG, and Alphabet all must be mapped to the same company. And apple the fruit is not the same as Apple the company. Smart NLP is key.

No Textual Intelligence

When a human reads an article, it can use that information to understand when a company has exposure to a theme, but a machine must be trained to do so. Most current systems have no intelligent process for comprehension, parsing, or utilizing the right texts and sections. Counting words will not suffice.

Flat Scoring

Common keyword approach scoring will use keyword counts and sometimes distances between occurrences to produce a score. Indexica’s parsing process breaks all texts into complete sentences and fits each sentence into categories such as a title, an inner title, a note, a bullet, a discardable phrase, or a content phrase. A match found in a title may enhance the importance of a match. Or if a theme keyword match happens in a different section from the entity match, it should be discarded or scored lower.

Other Indexica metrics also play a strong role in scoring, such as Volume. No Sense of Time

A sentence such as “last year, XYZ Inc. said that drone delivery may be considered.” should be discarded in a relevant thematic index, since it’s describing something not relevant today. Indexica has a unique process that combines verbal conjugations and textual and structured date extractions to fit Futurity into scoring.

The Indexica Process

Keyword List

A taxonomy is created using a handful of seed words provided to the Indexica system. Thereafter, our Connectivity metric expands the word list to include all strongly connected words to the seed. Our Volume metric was used to create tiers. This could be done manually if desired. The below example is based on the Future of Mobility.

Page 27: DETECTING ALPHA SIGNALS IN THE NEWS

27

Taxonomy Sample for Future of Mobility

Future of Mobility Tier Future of Mobility Tier

absolute positioning 2 electric bike 2

adas 2 electric car 1

advanced driver assistance 1 electric vehicles 1

advanced transport 2 hands-free 2

advanced transportation 2 hev 2

advanced-mobility 1 hexicopter 2

aerial vehicle 2 hybrid electric 1

aerial vehicles 2 hybrid vehicle 1

automation 2 hydrogen 2

autonomous 1 hydrogen-powered 2

autonomy 2 hyperloop 2

autopilot 2 intelligent transportation 1

batteries 2 lithium-ion 1

battery 2 low emission vehicle 1

battery-electric 1 megacities 2

BEV 2 micromobility 2

bikesharing 1 microtransit 2

brt 2 mobility 1

bus rapid transit 2 multicopter 2

car rental 2 smart vehicles 1

car share 1 vehicle sharing 1

car sharing 1 vehicle technology 1

driverless 1 zero carbon 2

drone 1 zero emission 1

drones 1 zero-emission 1

e hail 1 zev 2

Connectivity Scoring

Thematic scores are derived from the Connectivity metric registered in our system based on the appropriate query.

Connectivity measures the degree to which a firm is cross-referenced with another entity. For thematic queries, it is a dynamically determined list of related entities.

The score quantifies the frequency, the distance, and the strength of the ties a firm has with the thematic taxonomy.

This value is then adjusted by a measure of entity volume. While not mandatory, this adjustment reduces potential bias from entities that are discussed more generally/frequently across texts. We then scale the resulting value.

The final score is scaled from 0-10 where 0 is low exposure to the theme and 10 is the highest possible exposure.

For each period of data (monthly, quarterly, daily, etc.) we create a per-period ranking of firms by their thematic exposure.

Page 28: DETECTING ALPHA SIGNALS IN THE NEWS

28

Textual Universe, Filters, and Company Universe

o The textual corpus can derive from our 25,000 news media sources or any other desired corpus, such as regulatory filings.

o Texts in the below example are from January 2018 through April 2019, though any range can be utilized.

o Scores are calculated monthly, based on a 90-day lookback window, though this can be adjusted.

o An equity universe can include all publicly listed equities in the US or abroad.

o Note that all of these variables are fully adjustable (i.e., scores can be produced using varying frequencies, lookbacks, textual sources, and equity universes).

Thematic Exposure: The Future of Mobility

Using news text and a non-customized scoring strategy, we derive values that score firms with

higher exposure to the Future of Mobility theme, which include both core and non-core firms.

Many are direct producers, others are part of the supply chain of these firms, and still others are

leaders in transforming their industry alongside changes in mobility and technology.

US Equities Based on High Mobility Score Rankings

Firm Score Date

TESLA 10.00 01-Apr-19

UBER 6.68 01-Apr-19

LYFT 6.09 01-Apr-19

NVIDIA 5.78 01-Apr-19

GENERAL MOTORS 5.44 01-Apr-19

INTEL CORP 5.05 01-Apr-19

FORD MOTOR CO 4.80 01-Apr-19

MICROSOFT CORP 4.49 01-Apr-19

HP 4.43 01-Apr-19

IBM CORP 4.38 01-Apr-19

QUALCOMM 4.33 01-Apr-19

APTIV PLC 4.31 01-Apr-19

ROCKWELL AUTOMATION 4.20 01-Apr-19

FEDEX CORP 4.14 01-Apr-19

XCEL ENERGY 4.12 01-Apr-19

UNITED PARCEL SERVICE 4.12 01-Apr-19

PTC INC. 4.10 01-Apr-19

Page 29: DETECTING ALPHA SIGNALS IN THE NEWS

29

Securities which rank above a threshold can be included in a thematic index/portfolio using their

scaled scores as a weighting tool:

Firm Weight

TESLA 3.11%

UBER 2.07%

LYFT 1.89%

NVIDIA 1.80%

INTEL CORP 1.57%

MICROSOFT CORP 1.39%

HP 1.38%

IBM CORP 1.36%

APTIV PLC 1.34%

ROCKWELL AUTOMATION 1.31%

FEDEX CORP 1.29%

PTC 1.27%

AVIS BUDGET GROUP 1.25%

VEONEER 1.25%

OMNICELL 1.25%

TENNECO 1.25%

Page 30: DETECTING ALPHA SIGNALS IN THE NEWS

30

Persistence versus Time-Variation

For some firms, scores are consistent over time, while for others, they vary over time. The

length of the lookback period can smooth or increase the variation.

Nvidia scores fluctuate alongside important market news:

September 2018 Peak

February 2019 Low

Page 31: DETECTING ALPHA SIGNALS IN THE NEWS

31

Fine Tuning a Thematic Score

Some firms often receive a lot of attention because of the size of their businesses, such as

Walmart and Amazon. Because these firms have overall high Volume scores, score

adjustments using Volume and Connectivity to unrelated themes can improve thematic scores.

Lookback periods can also be increased to reduce score volatility. Additional NLP layers and

scoring can be used to separate tangential businesses from producers, if so desired, in the final

product.

This iterating can be done easily and should be considered essential especially when using NLP

to create thematic scores. Below are examples of firms that are tangentially involved in a theme.

They can easily be identified and included/excluded as desired.

Tangential Thematic Relationships

Firm Score

WALMART 4.98

AMAZON.COM 4.94

FORD MOTOR CO 4.80

KROGER CO 4.16

FEDEX CORP 4.14

UNITED PARCEL SERVICE 4.12

Creating Themes on the Fly Indexica can create thematic/exposure scores for any desired industry, theme, or topic. Topics can be chosen based on a market hypothesis, to market to a specific client demand, or for any other reason. Themes can be broad, or extremely narrow and can be augmented with other sources of data such as industry classification, volatility, and traditional momentum factors. Themes do not need to be based on revenue. Below are a few additional examples.

Page 32: DETECTING ALPHA SIGNALS IN THE NEWS

32

Cannabis Theme Strategy:

Create thematic scores to identify core and non-core firms operating in the cannabis industry.

Use a broad universe including Canadian small-caps.

Results:

Constituents and weights are consistent with cannabis-focused ETFs which use manual,

human-intensive selection methods.

April 2019 Weighting Based on Top Scoring Firms

Firm Score Weight

AURORA CANNABIS 10.00 0.142

CANOPY GROWTH 9.99 0.141

CRONOS GROUP 8.48 0.120

TILRAY 7.21 0.102

CONSTELLATION BRANDS A 6.74 0.095

VILLAGE FARMS INTL 5.44 0.077

APHRIA 5.28 0.075

ALTRIA GROUP 5.22 0.074

GREEN ORGANIC DUTCHMAN 5.18 0.073

Constituent selection and scores are closely linked to news discussion, which is

narrating real world behavior:

*Apr 2018: Altria connection to cannabis industry unveiled

**Dec 2018: Announcement of Altria investment in Cronos

*

**

Page 33: DETECTING ALPHA SIGNALS IN THE NEWS

33

Trade War Theme

Strategy:

Create trade risk scores in order to evaluate exposure to this theme. If the thesis is that a trade

war will worsen, sell highly scored assets. If the thesis is that trade conflicts will improve, buy

highly scored assets. We measure exposure to trade risk using our thematic approach.

Results:

On average, a one point increase in trade exposure (7-day rolling lookback) decreased

next day expected returns by 2 bps (1 standard deviation reduces returns by

approximately 6.4 bps).

Trade factor is uncorrelated with standard market factors.

Assets with higher exposure to trade risk strongly underperformed less exposed stocks

over a longer run horizon as well (in terms of cumulative returns).

Political Affiliation Theme

Strategy:

Create Republican-Connected scores in order to evaluate exposure to this theme. If the thesis

is that companies highly connected to Republicans will thrive, buy them. If not, sell them.

Results:

a. On average, a one point increase in republican connectivity (7-day rolling lookback)

increases next day expected returns by 1.7 bps.

b. Extended versions of this theme could overweigh Republican firms and underweight

Democratic-exposed firms. Alternatively, a theme could measure any political exposure

compared to firms which are more neutral in their political connections. Such themes

may be interesting not only for abnormal returns, but also for volatility purposes.

Page 34: DETECTING ALPHA SIGNALS IN THE NEWS

34

GENDER COMPOSITE – INDEXICA’S ESG FACTOR

Gender is a linguistically constructed metric that is built upon the structural roots of sentiment

analysis. Rather than classifying words, phrases, and events as positive or negative, Gender

classifies words, phrases, and events as either more aligned with female or male dominant

linguistic patterns. Linguistically speaking, male-dominant language often includes macho talk.

Female-dominant language is often softer, more caring, and conscientious. The metric was built

by analyzing millions of quotes from men and women, and then classifying words as either

“male” or “female” leaning, based on their volume of usage by gender. Although Gender is not

an all-encompassing score, there are advantages of integrating Gender scores alongside other

more traditional ESG data. One major advantage is the real-time nature of the metric.

Indexica’s Gender metric correlates with traditional ESG scores from the major providers

especially at both ends of the spectrum, yet does so in a non-subjective and real-time manner.

The yardstick for measuring ESG behavior will evolve, thus using language tonality rather than

human subjectivity is likely to result in more accurate scoring, consistency, and faster actionable

data. We define our ESG metric as a Gender Composite because the final score is adjusted by

our Severity metric to account for firm-level controversy. Our measure shows that higher

Gender (more female) stocks do not currently outpace lower Gender stocks or the market

benchmark. But our metric does correlate well with traditional ESG scores and thus, if one is to

believe that ESG metrics will result in outperformance in the future, Gender can be utilized in

the investment process.

Reuters ESG scores (which align well with MSCI scores) do not necessarily result in

outperformance over the last few years:

Page 35: DETECTING ALPHA SIGNALS IN THE NEWS

35

Similarly, Gender scores do not necessarily result in outperformance over the last few years. In

this example, Gender scores are collected daily, and rebalanced quarterly based on their

average Gender score, then sorted into quintiles. High and low quintiles are plotted below and

we note are correlated with outperformance starting in Q1 2019.

Gender’s Relationship to Other ESG Scores

Across large textual corpus samples, female language lines up well with the types of events that

lead companies to receive high ESG scores while male dominant talk often correlates with low

ESG scores. Thus, Gender is a simple and fast way to obtain real-time values and to monitor

trends across portfolios.

On the top panel are low Gender stocks (male dominant) and their respective Reuter’s ESG

scores. Gender is based on a 1-10 scale while Reuters is 1-100. In the bottom panel are stocks

which are ranked high in Gender (female dominant) and their corresponding ESG scores from

Reuters.

Low Gender - Low ESG

Ticker Company Gender Rank

Gender ESG Score Reuters ESG

NFLX Netflix 1 3.15 17.11

TWTR Twitter 1 3.20 21.37

EFX Equifax 1 3.32 21.86

SEE SealedAir 1 3.38 28.44

FB Facebook 1 2.75 32.22

Page 36: DETECTING ALPHA SIGNALS IN THE NEWS

36

High Gender - High ESG

Ticker Company Gender Rank

Gender ESG Score Reuters ESG

IFF Intl. Flavors & Fragrances 5 6.00 74.94

XYL Xylem Inc. 5 5.99 79.51

EMN Eastman Chemical 5 5.93 82.6

CLX Clorox 5 5.76 87.12

PLD Prologis 5 5.73 89.91

The correlation between our Gender score and other ESG scores is strong and statistically

significant:

Correlation between raw scores:

Correlation between scores, normalized by industry:

Page 37: DETECTING ALPHA SIGNALS IN THE NEWS

37

Time Variation and Industry Normalization

Gender ESG scores show significant variation over time reflecting logic and real-world events.

The Gender ESG scores also map well to external ESG scores. Furthermore, the scores can be

normalized by industry to adjust for systematic differences across sectors.

Below we show several large cap stocks across three industry segments and their industry-

normalized time series ESG scores. The left panel shows how these scores vary over time. In

this example they are computed on a daily basis, but could be calculated more or less

frequently. In the right panel, we compare these metrics to ESG scores created by

Sustainalytics. We note that the ESG ranking of stocks that we find with our Gender ESG

methodology, more often than not, correspond directly to the ESG ranking derived by

Sustainalytics. Furthermore, the daily time series data aligns well with the quarterly variation

created by Sustainalytics. In general, the Gender ESG score provides a frequent, flexible, and

unique approach to ESG which correlates strongly to alternative and traditional ESG measures.

Gender ESG by Industry 2016-2019: Sustainalytics ESG:

TROW: 83rd

percentile

BAC: 82nd percentile

CFG: 3rd percentile

TJX: 76th percentile

KSS: 80th percentile

COST: 32nd percentile

Page 38: DETECTING ALPHA SIGNALS IN THE NEWS

38

TXN: 88th percentile

WDC: 76th percentile

Conclusions

In Part 2, we have shown how Indexica can read a text and quantify what it is narrating into a

single score. When these scores are aggregated across textual sources, days, quarters and

years, amongst equities, commodities, currencies, etc., they begin to tell a story. The stories are

related to how the market values securities. The relationship between these scores and the

movements which occur in markets are strong and can be converted into actionable strategies

in various ways. This section has shown that across our metrics, there are relationships which

can be exploited as part of the following strategies:

Short or long term systematic

Index product development

Portfolio construction

Thematic/exposure score based trading

ESG implementation

Furthermore, these opportunities are highly customizable and can be used in combination and

alongside traditional strategies. In the following section, we will show how these metrics, scores,

and factors can come together and fuse into a novel predictive signal using an intelligence

engine we’ve built.

Page 39: DETECTING ALPHA SIGNALS IN THE NEWS

39

PART 3: FINDING ALPHA SIGNALS THAT DRIVE MARKET MOVEMENTS

Bringing it all Together

In Section 1, we explained that connecting the two essential data silos would grant access to

the Promised Land. In Section 2, we explained how our metrics attempt to measure the world

around us, while providing evidence that there is value in our portfolio of factors across various

use cases. In this section, we explain how Predictive Indexing acts as an intelligence engine

built between measurement metrics and market data.

Predictive indexing uses machine learning strategies to decipher patterns in streaming textual

data using our proprietary quantified metrics as the basis upon which to construct predictive

signals. Thus, without our metrics, predictive indexing would not be effective. And without

Predictive Indexing, our metrics only have half their desired value.

Understanding the Basics of Predictive Indexing

Leading Indicators for Predictive Insight. Predictive indexes deliver signals to help

clients intelligently anticipate market moves driven by political, economic, social,

technological, and opinion dynamics and change. Because textual news typically

precedes real world market action, patterns in mass quantities of news data often

telegraph market-moving events. Indexica organizes, deciphers, and visualizes this data

to generate causal predictive indexes, helping clients become proactive in their

investment process.

Identification of Causal Drivers. This process identifies which modern factors, among

a sea of millions of options that reflect political, economic, social, technological, and

opinion dynamics and change, historically drove market moves. Finding these causal

relationships is accomplished by modeling these metrics against markets and spotting

correlations in a way that a human cannot. Importantly, constituents must possess not

only statistical fit, but quantifiable economic rationale, to qualify for inclusion.

Actionable Surveillance Signals. Modern metric constituents each have individual

predictive power that when fused into composite indices, cohesively increase predictive

capacity. Constituents should be monitored continually to stay on top of elements that

impact businesses and markets, while predictive indexes directly inform investment

decision-making and products.

Page 40: DETECTING ALPHA SIGNALS IN THE NEWS

40

Detailed Process & Methodology

There are number of processes and algorithms that work together as part of the unified

Predictive Indexing process. These processes function with the goal of delivering a single

signal, composed of underlying Indexica metrics measured against entities of all kinds, based

on the news. The goal is to find what is driving asset price movements in order to deliver an

actionable signal. The process is detailed below.

1. The Query

At the initial stage, a user inputs various items to guide the Predictive Index process. This

information can be inputted via a web application, directly with an API, or via a customized app.

Queries can be automated in order to trigger a new predictive index at regular intervals (e.g.,

every day at market close, or at the end of each week, quarter, etc.) The following components

make up the predictive inputs:

a. Seed entities

Mandatory. Entities that explain what the input is in words must be entered.

These entities provide direction for the algorithmic and cognitive processes

which follow. As much detail as can be provided at this stage goes a long

way towards helping the system determine the path structure it should follow

in order to find entities whose relationship to the seed is grounded in

economic rationale.

b. Time series

Mandatory. A long-term or short-term historical time series (e.g., market

prices for an asset) must be inputted. The length of the series will determine

whether a signal is persistent or momentum-driven.

c. Filters

Optional. Users can select from available text filters to limit the universe of

available texts that will be analyzed for patterns, which may better reflect the

Page 41: DETECTING ALPHA SIGNALS IN THE NEWS

41

initial user hypothesis. For example, a user may want to concentrate the

analysis on US-centric text sources, or only on those related to emerging

markets.

d. Condition stamps

Optional. If a user wants to include other time series data as potential

constituents in a predictive index, the stamps enter the process as an entity-

metric observation. This allows clients to add additional sources of external

data to the potential predictive index. The condition stamp series receives the

same treatment as an entity-metric derived time series.

e. Match strength

Optional. The match strength is a variable that sets a number of critical

values in test-statistics and thresholds throughout the process. Higher match

strengths may result in Indexica not finding patterns and therefore not

creating a predictive index, while lower strengths may be valuable in

identifying initial relationships that can be further backtested. Often, longer

term constituents have lower but more persistent strength while shorter term

constituents have higher but less persistent strength.

f. Precedence

Mandatory. Precedence is the amount of time preceding the client inputted

time series for which predictive indexing searches for predictive indicators.

Thus, Indexica can scour for signals that are actionable 24-hours in advance,

or three months in advance, among other variations. This option designates

the amount of time that the predictive index should lead the client inputted

data.

2. Metric Intelligence

After the query is initiated, the process goes to work to create a sample of data from the vast

universe of available entities and metrics. Thus, a neural pathway is initiated. Metric intelligence

is a hybrid of algorithmic, rules-based decision making, and predefined criteria based on

extensive human testing and configuration. The goal of this step is to create a subsample of

data which has passed the logic test before it enters the subsequent statistical models.

Cognitive knowledge graphs are utilized at this stage.

For each inputted seed entity provided at input, the system cycles through a number of

processes to find entities related to the seed entities. The methods and strengths of connections

are multifaceted in order to find relationships that may have otherwise gone unnoticed by a

human. Each step of each process is quantified and used for inclusion/exclusion criteria.

a. Directly connected entities

i. Connectivity

For each inputted entity, Indexica finds the top connected entities

using Indexica’s Connectivity metric, resulting in a quantifiable

measure of the strength of the connection to each seed entity. This

knowledge base is derived from the texts we read.

ii. Classification layer

Page 42: DETECTING ALPHA SIGNALS IN THE NEWS

42

We map seed entities to their classification layer in our entity

database, which organizes 14 million entities into a knowledge graph.

We find related entities using a score derived from the shared

category overlap and, since the classification layer is hierarchical, the

highest level of category that is co-achieved. This knowledge base is

derived from Wikipedia.

iii. Lexicon

Simultaneously, we select entities based on direct connections from

proper nouns across our own lexicon. For each seed entity, we find

entities which map closely based on their lexical classification. This

allows us to find new connections based on the intelligence we have

gathered from machine reading millions of texts and entities. This

knowledge base is derived from an in-house lexicon.

iv. Nth level connections

The process iterates over a variable number of levels, i.e. it finds the

entities which are connected to seed entities, then the entities

connected to those entities, and so on until an adequately large

sample is reached.

b. Geographic relationships

i. This cycle collects geo-entities connected to initial seed entities which are

then included in subsequent level processes.

ii. Additional levels of geo-entities are also collected, which are the connected

geo-entities to Nth level entities.

c. Title positioning

i. In this cycle, we derive entities from titles and headlines of texts released

during the time series period. This enables us to connect seed entities to the

most important positioned entities found during the research period. The

process keeps entities which are in titles that are proper nouns.

d. Importance

i. Similarly, another cycle derives the most important entities based on Indexica

metrics such as Volume, rather than solely textual position.

ii. Based on the date range and sampling criteria of the request this phase

collects:

1. The top N entities from metric level sorts within the time series and

the search criteria.

2. For each of the entities gathered from above, we find the top N

directly connected entities.

e. Entity collection and metric retrieval

i. Every step generates a score and is quantified. Each procedure faces

optimization rules based on those scores.

ii. Duplications are removed across all levels/sources and are assigned to their

primary source.

iii. Within each bucket, entities are sorted by the interaction of collected scores.

iv. The entities are then culled within each bucket, and critical values are

dynamically determined by the sample. Entities below thresholds are

discarded.

Page 43: DETECTING ALPHA SIGNALS IN THE NEWS

43

v. At this stage, the process focuses on every entity which has survived and

collects time series information on all eligible Indexica metrics belonging to

the entities. This produces a very large matrix of data with each entity-metric

time series being a possible candidate for a constituent in a predictive index.

vi. This process iterates within itself based on sample sizes and collected

entities. We set sample values to ensure sufficient entities are collected

across the various strategies

3. Rules-based Dimensionality Reduction

After entity selection and the retrieval of all associated metrics in the previous cycles, the

subsequent process is to statistically eliminate data unrelated to the user provided data series.

This process consists of a number of univariate tests. We develop unique test-statistics and

compute a value for each entity-metric. Rules and critical values are implemented which decide

inclusions/exclusion of the time series in subsequent tests and processes. The detailed steps

are as follows:

a. Convert. Indexica converts all time-series data into index formulae. This is done in

order to compare changes over time and is the basis for final index plotting.

b. Correlate. Indexica develops a correlation test-statistic for each time series. The

statistic includes whether the values are positively or negatively correlated.

c. Metrics of metrics. A test-statistic based on our own proprietary metrics of metrics

values is calculated. This determines the similarity between time series in our own

framework. In this step, we focus on variation in moving averages, volatility, slope,

and deviation.

d. Linear fit. A test based on the absolute differences between time series’ over all

observations is created for each entity-metric observation.

e. Higher-order variations. A test-statistic is created based on a number of

computations over each time series. These include skewness, kurtosis, min/max

values, differences from endpoints and midpoints, etc. This statistic accounts for

nonlinear variation over time.

4. Logic

After entity-metric constituents are statistically tested, they are further evaluated for their

contribution to a potential predictive index from a logical perspective. This step helps to make

logical sense of the still vast array of time series data.

a. Backtesting. Resulting entities are backtested to ensure the connection they have to

seed entities and to historical data are logical. If the Connectivity appears to be

statistically driven by an outlier, it is discarded.

b. Improvement. Additionally variations on existing filters and metrics are computed

and tested. If entity-metric observations under alternative lookback specifications

improve logical tests, they are substituted into the sample which is delivered to the

next stage of the cycle.

Page 44: DETECTING ALPHA SIGNALS IN THE NEWS

44

c. Re-Evaluation. Depending on sample size resulting from previous steps, critical

values are potentially re-evaluated and re-run to ensure a stable sample for

subsequent steps.

5. Supervised Learning

The goal of the next step is twofold: first to use Machine Learning (ML) models to efficiently

remove entity-metric observations with little to no predictive power. This process maximizes the

prediction power based on many individual time series observations and the inclusion of

multiple entity-metric observations together. Second, to return optimized weights for use in a

composite predictive index containing many constituents.

a. The class of models used are Supervised Learning Models (SLM). The system is

flexible such that it allows for multiple model and specification testing and therefore a

number of processes are cycled through during this stage.

b. The SLMs at this step are regression models with L1 and L2 regularization (Ridge,

Lasso, Elastic net). These offer iterative fitting along regularization paths selected by

manual, default settings, or dynamically through cross-validation.

i. The cross-validation uses training samples from subsets of the data to

determine model parameters (weights and penalization).

ii. Following cross-validation, the models are iterated across the entire dataset

and evaluated based on test-statistics.

iii. At this stage, we also allow for minimization algorithms with additional

constraints and regularization tools.

6. Delivery

At the end of the above processes, Indexica collects model components and weights and

prepares them for delivery into an index plotter. Metrics and test-statistics evaluate the model fit

and the quality of the index based on the following criteria:

a. A number of standard statistical tests evaluate the quality of the model.

b. The In-Sample Signal Power score evaluates how well the resulting predictive index

correlates and explains the inputted time series in an in-sample environment.

c. Additional metrics capture the efficiency gain of the predictive model relative to

bivariate entity-metric analyses.

To summarize the process, Indexica performs a series of steps to create a predictive index. We

test for statistical fit using approximately 10 econometric models. If there is a fit, each

entity/metric pair must pass through three knowledge graphs in order to ensure that the fit is

grounded in economic rationale and is not spurious. One knowledge graph is based on our

entity database and the imbedded connection strengths and structuring all of entities inside of it.

The second is our real-time created knowledge graph based on the texts we read. The third is

based on our self-built lexicon. By ensuring fit both statistically and logically, all potential inputs

for a predictive index pass through the tests that a human would use.

Page 45: DETECTING ALPHA SIGNALS IN THE NEWS

45

Once a number of individual inputs are found (assuming there are inputs to be found) each with

their own predictive power, we use machine learning to fuse the components into an index that

increases the overall predictive power of the signal. This is logical, since typically, multiple

factors will impact markets and companies, and Indexica is able to find all of them if they exist.

Our process fuses these components into an index. Economic rationale and statistical methods

determine the constituent weight of each component such that the overall signal is a proper

combination of world events and opinions, impacting the user’s time series.

Making Sense of Signals

Predictive indexing requires client interaction with Indexica, thus there is benefit in

understanding what can be done to improve the quality of a signal and the digestion of it.

Generally, this requires some understanding of the research we’ve done on the matter.

If a signal is found, new values will plot as news flows into Indexica. These values and the

percent changes from one period to another are the signals. Generally, users should pay more

attention to the up or down movements from one reading to the next, rather than the actual

index values. The percent changes act as oscillating signals that inform whether the

corresponding time series will increase or decrease in the future, which is as far out as the user

pre-selected their precedence to be.

Once a predictive index is finalized, it’s important for users to know that predictive index signals

tend to become less powerful as predictive indicators as time passes. This is logical, because

as time passes, what drives markets changes. Users should run new predictive indexes

frequently. This allows users to benefit from morphed predictive indexes that take into account

changes happening in the real world as reflected by updated metric constituents. Users should

have a sense upfront of how long a signal will be valuable, so they are not stuck looking at a

tired signal.

To understand this, clients should know that predictive indexes can be used to find persistent

drivers of market performance or short-term, momentum-focused drivers. The difference

depends on the length of the initial input series. If Indexica is looking for what drives a certain

market over many years of data, we’ll find persistent drivers, if there are drivers to be found.

Generally, persistent drivers have lower in-sample predictive power than short-term drivers, but

the signal tends to hold relatively steady in out of sample environments. Thus, persistent drivers

can last many months or even years. Assume that if an input series is two years or longer, a

signal will last at least two to three months. Short term drivers tend to have very high predictive

power but for shorter amounts of time, and must be re-run frequently. Assume that if an input

series is six months or shorter, a signal will last between one and four weeks. There is no exact

science to this. Thus, depending on investment strategy, there are positive and negative

attributes to either approach. We always suggest re-running predictive indices as often as

possible.

Page 46: DETECTING ALPHA SIGNALS IN THE NEWS

46

Whichever approach is taken, users should review and research the constituents of each

predictive index to understand causality and the dynamics of the intelligence. A predictive index

may have dozens of components, each contributing towards the ultimate prediction. Since each

component has predictive power, it’s possible that a previously unknown correlation can be

spotted that can be used for alternative predictive purposes or deeper fundamental knowledge

drill-down. For example, if a client wants to predict the price of coffee, and Indexica finds a

predictive index component that is the Volume of rain in Brazil, the client may have just learned

more about what moves a certain market or security. This is now a topic she can focus on in

depth either using Indexica or not.

As a general philosophy, each index component correlates and precedes the input, though

usually not with as much accuracy as a combination of components does. However, each

component tells a real-life, unique story that is useful for understanding correlation for

investment purposes and general knowledge discovery. Thus, while the final output is purely

quantitative, the components of the signal are derived from fundamental rationale.

Page 47: DETECTING ALPHA SIGNALS IN THE NEWS

47

PUTTING PREDICTIVE INDEXES TO WORK

Predictive indexing can be utilized across various types of securities, markets, company data,

and economic data. The only requirement is that the underlying market data is in a time series

format. Often, Predictive Indexes are used for explaining the out of sample price movement of

an individual stock or an equity index. Predictive Indexes can also be used for currencies, bond

rates, commodity prices, company financial data, etc. The following section highlights a few of

these use cases.

The following results are presented in short, summary form and research briefs can be provided

upon request. We highlight a number of results across various assets and time periods. Also

note that some results are presented based on daily data, some are weekly, and some monthly.

These are all variable inputs in the Predictive Indexing process that a client can select based on

their own individual use case. The below results are actual user examples.

Each predictive index is evaluated by a number of key performance indicators:

In-Sample Signal Power indicates whether the signal is powerful enough to be utilized out of sample. It is a statistical measure of the correlation and explanatory power of the leading predictive index to the inputted time series. Generally, if this number is above 40%, the signal can be trusted in an out of sample environment. Indexica tends find Predictive Indexes with in-sample signal power above 40%, approximately half of the time.

Accuracy is measured as the percentage of time periods out of sample when the leading signal accurately predicted the direction of the underlying asset. There is a strong relationship between In-Sample Signal Power and out of sample Accuracy.

As noted in the below graphic, we see a strong relationship between in-sample values and out of sample accuracy scores:

0%

10%20%

30%

40%50%

60%

70%

80%

90%100%

HY CorpBond ETF

(April)

MortgageRates

T-NoteVolatility

UST10YR(Jan)

S&P500 US HYSpread

CPI UST10YR(April)

HY CorpBond ETF

(Jan)

I N -S AMPLE C ORRELATION TO ACCURACY

In-Sample Power Accuracy Accuracy (Long)

Page 48: DETECTING ALPHA SIGNALS IN THE NEWS

48

PREDICTIVE INDEXING: LIVE EXAMPLES

On the following pages, we provide examples of real predictive indexes which have been

implemented by our clients. On the left hand side of each panel is a graphical representation of

the predictive index relative to its underlying asset. As an example, a predictive index on the

S&P 500 Index should be displayed relative to the value of the index, and should lead the

underlying by its desired precedence. On the right side of the panel are the index constituents

that our predictive indexing process determined as final constituents of the index. They sum to

100% and each has predictive power. We provide some key statistics alongside each. As noted

above, these are signal power of the in-sample analysis and the out of sample accuracy. We

also note the time period used for the in-sample analyses. Examples are provided for:

1. Equity Indices

2. Individual Stocks

3. Currencies

4. Volatility Measures

5. Other Instruments

6. Chain-Linked Predictive Indexes

Note that the deliverable for a predictive index is a value that oscillates daily, or at any preset

frequency. The up and down movements of the index inform buy/sell decisions.

Page 49: DETECTING ALPHA SIGNALS IN THE NEWS

49

EQUITY INDEX PREDICTIVE INDEXES

Using equity indices as underlying inputs for Predictive Indexes is a proven strategy for clients

interested in systematic and macroeconomic trends. Predictive Indexes using these assets

generally have a strong in-sample fit consisting of time relevant macroeconomic entities.

Index Constituents

• Daily price from Jan 1

st – Feb 8

th, 2019.

• In-sample signal power: 90%. • Out of sample accuracy: 60% in the following 10 days.

• Daily price from Jan 1st – Feb 8

th, 2019.

• In-sample signal power: 89%. • Out of sample accuracy: 80% in the following 5 days.

0 20 40

OtherVolume for Retail

Severity for TrumpAction Share for ETFProbability for Trump

Futurity for Unitedhealth GroupAction Share for Economy

Volume for MicrosoftProbability for Wall Street

Buzz Sentiment for UnemploymentSentiment for Trade

Action Share for ChinaGender for Real Estate

Gender for UnemploymentAction Share for Trump

Action Share for RepublicanAction Share for Wall Street

Anger for Trump

Percent (%)

0 20 40

Other

Futurity for Germany

Opportunity for United Nations

Action Share for Brexit

Buzz Sentiment for Unemployment

Sentiment for Dow Jones

Action Share for ETF

Anger for Trade

Sentiment for Trade

Buzz Sentiment for Europe

Gender for Europe

Complexity for EU

Action Share for Europe

Futurity for Europe

Percent (%)

SPY Index

Page 50: DETECTING ALPHA SIGNALS IN THE NEWS

50

:

• Weekly values from Jan 2018 – Jan 2019. • In-sample signal power: 60%. • Out of sample accuracy: 75% in the following 8 weeks.

INDIVIDUAL STOCK PREDICTIVE INDEXES

When Predictive Indexing uses individual stock prices as the underlying inputs, we generally

observe that constituents are based on time-sensitive news related to the firm and its industry

and the market’s reading of the firm’s latest earnings or other reporting.

• Daily price from May 1st – June 1

st, 2019.

• In-sample signal power: 48%. • Out of sample accuracy: 80% in the following 5 days. • Out of sample accuracy: 64% in the following 1 month.

0 20 40 60

Other

Buzz Sentiment for S&P500

Futurity for Dow Jones

Futurity for Trump

Sentiment for United States

Percent (%)

0 20 40 60 80

Other

Desire for Under Armor

Sentiment for Sportswear

Volume for Competitor

Action Share for American

Buzz Sentiment for Footwear

Action Share for Under Armor

Description Share for Competitor

Quoteability for Competitor

Sentiment for NYSE

Percent (%)

Page 51: DETECTING ALPHA SIGNALS IN THE NEWS

51

• Daily price from Jan 1

st – Feb 8

th, 2019.

• In-sample signal power: 83%. • Out of sample accuracy: 80% in the following 5 days.

• Daily price from April 15

th – May 15

th, 2019.

• In-sample signal power: 48%. • Out of sample accuracy: 80% in the following 5 days.

• Daily price from April 15

th – May 15

th, 2019.

• In-sample signal power: 64%. • Out of sample accuracy: 60% in the following 3 weeks.

0 10 20 30 40 50

Other

Fear Or Threat Level for Tariff

Futurity for Climate Change

Quoteability for Earnings Report

Severity for Climate Change

Action Share for Chipotle…

Action Share for Tariff

Buzz Sentiment for Q4

Action Share for Reported Earnings

Sentiment for Tariff

Percent (%)

0 5 10 15 20 25 30 35

Other

Action Share for Tariff

Buzz Sentiment for Vegetarian

Volume for Tariff

Buzz Sentiment for Trade

Action Share for Obesity

Sentiment for Chipotle Mexican…

Futurity for Chipotle Mexican Grill

Complexity for Avocado

Futurity for Avocado

Percent (%)

0 10 20 30 40

Other

Futurity for Activision Blizzard

Quoteability for Earnings

Quoteability for Call of Duty

Futurity for Call of Duty

Sentiment for Activision Blizzard

Volume for Call of Duty

Futurity for Activision Blizzard

Sentiment for Call of Duty

Percent (%)

Page 52: DETECTING ALPHA SIGNALS IN THE NEWS

52

CURRENCY PREDICTIVE INDEXES

When Predictive Indexing uses currency rates as the underlying inputs, constituents tend to

reflect the macroeconomic news of the countries of origin. Often, political and monetary entities

drive the movements along these underlying’s.

• Weekly values from Jan 2018 – Jan 2019. • In-sample signal power: 76%. • Out of sample accuracy: 63% in the following 8 weeks.

• Daily price from Jan 1

st, 2019 – Feb 15

th, 2019

• In-sample signal power: 77%. • Out of sample accuracy: 60% in the following 30 days.

• Daily price from April 1

st – May 15

th, 2019

• In-sample signal power: 74%. • Out of sample accuracy: 60% in the following 10 days.

0 10 20 30 40

Other

Buzz Sentiment for Parliament

Futurity for EUR

Sentiment for United States

Action Share for EUR

Percent (%)

0 20 40 60

Other

Sentiment for Rates

Buzz Sentiment for CHF

Buzz Sentiment for Floor

Futurity for Floor

Action Share for Floor

Gender for CHF

Sentiment for CHF

Percent (%)

0 10 20 30 40

Other

Sentiment for Bank Of Canada

Futurity for Canada

Buzz Sentiment for Canada

Buzz Sentiment for Bank of…

Volume for President

Buzz Sentiment for CAD

Volume for Chinese

Action Share for CAD

Buzz Sentiment for Dollar

Futurity for US Dollar

Percent (%)

Page 53: DETECTING ALPHA SIGNALS IN THE NEWS

53

VOLATILITY PREDICTIVE INDEXES

One popular underlying input used in Predictive Indexes is the VIX. Predictive Indexes on the

VIX feature strong in and out of sample scores and often show constituents which reflect turmoil

in financial markets. These Predictive Indexes are useful for clients interested in volatility in

financial markets. Applying other volatility based underlying indexes to Predictive Indexing

provides insight into individual firms, bonds, and other markets.

• Daily values for Q2 2018 (April 1

st – July 1

st, 2018).

• In-sample signal power: 67%. • Out of sample accuracy: 60% in the following 5 days.

• Daily values for Q1 2019 (Jan 2

nd – April 1

st, 2019).

• In-sample signal power: 54%. • Out of sample accuracy: 70% in the following 10 days.

0 5 10 15 20 25 30

Other

Buzz Sentiment for Sector

Severity for Goldman Sachs

Quoteability for Trade

Severity for Merrill Lynch

Quoteability for Merrill Lynch

Severity for Investment

Severity for Nasdaq

Fear Or Threat Level for DJIA

Buzz Sentiment for Credit

Percent (%)

0 5 10 15 20 25 30 35

OtherQuoteability for Europe

Probability for Gross…Quoteability for Trade

Buzz Sentiment for New…Volume for Israel

Uncertainty for D.C.Uncertainty for Wall Street

Complexity for ETFComplexity for Tariff

Volume for Dow Jones…Severity for Real Estate…

Futurity for FearVolume for Uncertainty

Opportunity for DJIAVolume for ETFs

Buzz Sentiment for Fear

Percent (%)

Page 54: DETECTING ALPHA SIGNALS IN THE NEWS

54

• Monthly values from Jan 1

st, 2018 – April 1

st 2019.

• In-sample signal power: 58%. • Out of sample accuracy: 100% in the following 2 months.

• Daily values from Feb 1

st 2019 – March 1

st 2019.

• In-sample signal power: 40%. • Out of sample accuracy: 100% in the following 5 days. • Out of sample accuracy: 60% over the following 2 months.

• Daily values from March 1

st, 2019 – June 1

st, 2019.

• In-sample signal power: 52%. • Out of sample accuracy: 60% in the following 5 days.

0 10 20 30 40 50 60

Other

Gender for Government

Sentiment for Trump

Futurity for Trade

Volume for Dow

Percent (%)

0 10 20 30 40 50 60

Other

Fear Or Threat Level for Wealth

Buzz Sentiment for VIX

Volume for DJIA

Volume for Short

Percent (%)

0 10 20 30 40 50

Other

Quoteability for United States

Description Share for TYVIX

Percent (%)

Page 55: DETECTING ALPHA SIGNALS IN THE NEWS

55

OTHER INSTRUMENTS PREDICTIVE INDEXES

Predictive Indexes can be applied to any time series. Predictive Indexes on commodities,

company revenue data, economic data, mortgage rates, etc., are all possible and show

idiosyncratic information behind the underlying asset of choice.

• Monthly values from Jan 1st,

2017 – Jan 1st, 2019.

• In-sample signal power: 72%. • Out of sample accuracy: 60% in the following 5 months.

• Weekly average mortgage rates from Jan 2018 – March 2019. • In-sample signal power: 48%. • Out of sample accuracy: 80% in the following 10 weeks.

0 10 20 30 40

Other

Action Share for Free

Volume for Recession

Action Share for Recession

Action Share for PPI

Quoteability for Recession

Buzz Sentiment for Recession

Action Share for Toys

Probability for Policy

Sentiment for Long Island City

Sentiment for Little Rock; Arkansas

Buzz Sentiment for Facebook

Buzz Sentiment for Atlanta, Georgia

Sentiment for Toys

Sentiment for Retail

Futurity for Retail

Percent (%)

0 20 40 60

Other

Anger for Chartered Financial…

Probability for Mortgage

Action Share for Fannie Mae

Futurity for Mortgage

Buzz Sentiment for Mortgage

Percent (%)

Page 56: DETECTING ALPHA SIGNALS IN THE NEWS

56

• Daily price from Jan 1

st – Feb 8

th, 2019.

• In-sample signal power: 80%. • Out of sample accuracy: 100% in the following 5 days.

• Daily price from Jan 1

st – Feb 8

th, 2019.

• In-sample signal power: 88%. • Out of sample accuracy: 80% in the following 5 days. • Out of sample accuracy: 60% in the following 1 month.

0 5 10 15 20 25 30 35

Action Share for Fracking

Volume for Export

Buzz Sentiment for WTI

Action Share for Persian Gulf

Futurity for Russia

Gender for WTI

Sentiment for Exxon

Action Share for Middle East

Sentiment for Crude

Futurity for WTI

Sentiment for Oil

Futurity for Oil

Percent (%)

0 10 20 30 40

Other

Uncertainty for Barclays

Volume for Jpmorgan Chase

Gender for Corporate Board

Futurity for ETFs

Action Share for Income

Anger for Bond

Gender for High-Yield Debt

Quoteability for Q1

Action Share for Grade

Sentiment for Bonds

Action Share for Bonds

Futurity for Corporate Bond

Percent (%)

Page 57: DETECTING ALPHA SIGNALS IN THE NEWS

57

CHAIN-LINKING PREDICTIVE INDEXES

Generally, a Predictive Index is used to create a short to medium term out of sample prediction.

However, while one index is being used, another should always be running in the background to

ensure the most up to date information is being utilized in each signal. Once one index loses its

power, another will be ready to take its place. We call this process chain-linking. It is similar to

an equity index rebalancing, because the constituents change, though the index value does not.

This allows us to create continuous out-of-sample prediction product based on the most current

and up-to-date drivers of the underlying asset. Below are a few examples.

• Monthly values from Oct 1

st, 2016 – Dec 1

st 2018.

• Overlapping model uses between 9-12 months in-sample, and predicts 5 months out of sample.

• Average in-sample signal power: 41%. • Average 60% accuracy in out of sample periods.

• Daily values from Jan 2019 – April 2019. • 6 weeks out of sample, comprised of 6 rolling Predictive

Indexes. • Average in-sample signal power: 51%. • Average 67% accuracy in out of sample periods.

0 20 40 60 80

Other

Action Share for Donald…

Sentiment for Services

Futurity for Donald Trump

Buzz Sentiment for…

Futurity for Liabilities

Quoteability for United…

Quoteability for American

Buzz Sentiment for Retail

Buzz Sentiment for…

Percent (%)

PI 1

PI 2

PI 3

0 20 40 60 80

Other

Buzz Sentiment forNasdaq

Quoteability for Option

Description Share forCall

Buzz Sentiment forBearish

Percent (%)

PI 6

PI 1

Page 58: DETECTING ALPHA SIGNALS IN THE NEWS

58

DELIVERING PREDICTIVE INDEXES

Our entire system is built around a detailed and comprehensive API (documentation is available

upon request), which means that delivery of the underlying Predictive Index is simple and

versatile. For each Predictive Index there are many underlying scores, values, percentage

changes, constituent lists, individual scores, and evaluation test metrics. All of these can be

delivered or queried, or only the most important indicators can be transferred. In the below

example, we show a simple data series of Predictive Index values, their percentage changes,

and an indicator of its accuracy, relative to the index’s underlying. These types of indicators and

metrics can be transferred via email, FTP, API, or signaled via an alert at many different

frequencies (e.g., hourly, daily, before market open, after market close, weekly, monthly, etc.).

Simplified Example of Predictive Index Delivery (highlighted in blue):

In-Sample Signal Power 72%

Delivery Day

Predictive Index

Underlying Market

Predictive Index Up/Down

Underlying Up/Down

Accuracy Test

Day 1 2.571 2.504 - - -

Day 2 2.535 2.502 Down Down Yes

Day 3 2.536 2.545 Up Up Yes

Day 4 2.528 2.530 Down Down Yes

Day 5 2.525 2.530 Up Up Yes

Day 6 2.520 2.500 Down Down Yes

Day 7 2.532 2.448 Up Down No

Day 8 2.570 2.485 Up Up Yes

Day 9 2.572 2.453 Up Down No

Day 10 2.484 2.473 Down Up No

Percentage Accurate: 66%

Page 59: DETECTING ALPHA SIGNALS IN THE NEWS

59

CONCLUSIONS AND DEPARTING DISCUSSION

Indexica’s mission is to measure and quantify the world around us in a way that is systematic

and digestible. By converting raw information from textual sources into digestible signals, we

create leading indicators for financial instruments and markets.

Entering the alternative data Promised Land has been a challenge thus far. One must be able to

broadly measure global dynamics, change, and opinions while utilizing an engine that can

identify causal relationships between data and financial markets. Indexica makes this a reality.

Our 40+ proprietary metrics micro-measure real-time political, economic, social, technological,

and opinion dynamics, trends, and collective consciousness. Turning points and collective

behavioral patterns are detected with speed, deciphered, and delivered. While single modern

factors tell partial stories, fusing modern factors into indexes expands measurement capacity

and allows for customized monitoring. Importantly, Indexica measures textual data in real time,

and deciphers it to generate causal predictive indexes, helping clients become proactive in their

investment processes.

This document serves as an overview into our technology, infrastructure, and how we go about

measuring the world around us. We document our metrics and factors in detail and show how

these signals map to markets and how they are the backbone for powerful trading strategies.

Whether the strategies are research oriented, short-term, long-term, active or passive, there is

value across our textual factors and intelligence engines. We have also highlighted our

Predictive Indexing product, which brings together all of the factors and metrics we create into

an AI-driven, machine learning engine. This engine isolates the factors which move markets and

securities by fusing constituents together to create indices which serve as predictive signals.