Classification and Clustering of Stocks, using Genetic Algorithms and Fundamental Analysis David Bugalho de Moura Thesis to obtain the Master of Science Degree in Electrical and Computer Engineering Supervisors: Prof. Rui Fuentecilla Maia Ferreira Neves Prof. Nuno Cavaco Gomes Horta Examination Committee Chairperson: Prof. Hor ´ acio Cl ´ audio de Campos Neto Supervisor: Prof. Rui Fuentecilla Maia Ferreira Neves Members of the Committee: Prof. Jo˜ ao Paulo Baptista de Carvalho November 2016
125
Embed
Classification and Clustering of Stocks, using Genetic ... · Classification and Clustering of Stocks, using Genetic Algorithms and Fundamental Analysis David Bugalho de Moura Thesis
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Classification and Clustering of Stocks, using GeneticAlgorithms and Fundamental Analysis
This section is an introduction to the subject of computational techniques applied to companies’
finances past data as a way to find trading rules and patterns in the stock market to achieve bigger
returns than the S&P500 index.
1.1 Motivation and Context
Since the recent advance of computational technologies (specially with the ease of information ac-
cess since the early 2000’s), several techniques have been used to try to extract trading rules from past
data of the stock market [1]. The stock market is constituted by humans, which gives it the same un-
predictability that humans have, making it hard to find patterns in it. Nevertheless, with the increase of
available information observed in the last couple of decades, this task seems more and more feasible.
Though there is a property of randomness intrinsic to the stock market (associated with all the variables
that are unknown to us), a large number of scientific researchers (for examples of it, see the next chap-
ter) showed results that makes us conclude that it is possible to obtain returns on the stock market using
computational resources.
To predict future stock prices with Artificial Intelligence (AI) and Computational Intelligence (CI) sys-
tems one needs to analyze data, applying data mining and machine learning algorithms to do so. Al-
though Machine Learning (ML) started attracting attention since the 80’s, it only flourished in the 90’s
when AI shifted from rule based methods1 to data driven methods2 (using approaches it had inherited
from AI but shifting towards methods using statistics and probability theory). Data analysis evolved al-
most in the same way, specially Data Mining (DM), that overlaps in terms of methods employed with
ML. They can be distinguished in a key aspect that is, ML focuses on prediction using known propri-
eties (learned in the training phase), and DM focuses on the discovery of unknown properties of data.
Although these algorithms have some decades now, only later these techniques have been applied
to Finance, not because the technology was not worth it, but because data availability and price was
not. In the late 90’s/early 2000’s, the globalization of Internet has made the access to information eas-
ier, cheaper and faster than in any other point in history, taking a big step in the way problems were
approached and in the technologies used.
Several techniques are used to predict stock market quotes, being the most popular one’s AI methods
to optimize financial indicators’ parameters. There are two types of financial indicators: fundamental
and technical indicators. Other methods include the evaluation of each stock of a certain type, being this
type defined by its sector or any other rule defined. Since the amount of data available is exponentially
increasing, probabilistic methods are gaining a lot of attention.
1defines a sequence of steps to be taken, based on a Knownledge Base (which consists in facts, and an inference engine)2describes data to be matched (pattern matching) and the processing in a more abstract way
3
1.2 Problem Statement
The problem here is to construct a system that evaluates the stock market and accurately groups
stocks that show similar behaviors. It should also choose the companies that show bigger returns in the
market, and study which technique is better to do so. The system should be able to perform in a similar
way when trading in real time. The implemented system is constructed from scratch with the objective
of being used by human traders. The architecture of the system should also be open for improvements.
The main objective when solving this problem is to obtain investment strategies that obtain better results
than the S&P500 index.
1.3 Proposed Solution
The implemented solution is a system made from scratch that classifies companies into groups based
on their financial statements, in two ways: a supervised way with parameters defined by the user and
an unsupervised way, applying a genetic algorithm to optimize clustering position, with a fixed number
of clusters. After both classifications, the application uses a genetic algorithm once more to optimize
indicators to buy stocks, comparing the results of the optimization done with the whole pool of stocks
with each classification given, to conclude on which classification classifies stocks better. The proposed
system was implemented in C++ with about 9000 lines of code and it was made to be used by third
parties that wish to improve its features.
1.4 Document Structure
The presented thesis is structured as follows:
• Chapter 2 shows theoretical concepts required to develop this project, including the theory behind
financial indicators, machine learning algorithms with special focus on evolutionary computing,
clustering algorithms and portfolio management. It is also given an overview of the results obtained
by some related works.
• Chapter 3 documents the proposed solution, with a detailed description of the architecture, meth-
ods and details of the application.
• Chapter 4 presents a validation of the system, showing parameters used and a detailed study of
the solution performance and robustness through case studies.
• Chapter 5 summarizes this work, concluding the achievements, limitations and proposing future
Financial indicators analyze statistics and present that information in the form of ratios, which sup-
ports managers on future stock prices’ decisions. These indicators will give better results if used along-
side the right strategy [1].
There are several types of financial indicators, divided into Fundamental Indicators (FI) or Technical
Indicators (TI), which come from Fundamental Analysis (FA) and Technical Analysis (TA).
2.1.1 Fundamental Indicators
FA is used to check for the intrinsic value of a certain market, industry or company, being FA applied to
the latter the most used one. FA uses financial statements to know if a company is under or overvalued.
FI are constructed using FA, and these type of indicators do not take into account market trends, only
its intrinsic value, or in other words, the real raw value of the company. It does not take into account
the people’s feelings about the company, neither the stocks quotes or trends, it only cares about how
the finances of the company are [2]. Through the analysis of the factors that reflect (or influence) a
company’s productivity, profitability or competitive advantage, one can identify if a stock is overvalued or
undervalued.
FA has been made famous by value investors (see section 2.5.1) like Benjamin Graham and Warren
Buffet (for their value investment strategies [3], [4]).
Fundamental analysis can generate trading rules that determine which stocks show signals of being
a good investment (by being financially stable), and which stocks show signs of not being financially
stable [5].
FA collects data on financial statements of several years and analyzes the financial evolution of a
company. This helps managers to make a prediction on the growth of a company. Other applications
of FA also include the evaluation of data that is external to the company, for example Gross Domestic
Product (GDP) or currency value, to evaluate the potential that a market may or not have. There are three
main types of FA [2]: Macroeconomic (analysis of macroeconomic factor like GDP growth to study the
effect of the macroeconomic environment on the future profit of a company), Industry Analysis (analysis
of the industry status and prospect, to estimate the value of the company inserted in a certain industry)
and Company Analysis (which analyses the operational status of a company to evaluate its internal
value, usually by analyzing company financial reports).
• Macroeconomic
This type of analysis use macroeconomic indicators to make assumptions about the type of market
we are investing on [6]. One example is economic growth and price stability in the economy, and
7
price stability1 can be measured as the rate of change in inflation [7]. There are several macroe-
conomical indicators (some can be found in: http://www.rbcpa.com/economic fundamentals.pdf).
The Consumer Price Index (CPI)2 is one of such indicators. CPI measures changes in consumer
prices and theoretically determines to what extent life is getting more expensive for the aver-
age consumer. Another important indicator that also measures inflation is the Producer Price
Index (PPI), that measures the rate of change in prices of goods received by domestic producers,
used in their output. When these prices increase substantially, it is likely that companies eventually
pass the price increase’s burden to consumers.
GDP is also an important, and one of the most used indicators, because it represents the total
output of a given economy. The trend at which the GDP is evolving (up/down) may represent a
expandability/contraction of the economy. When the GDP is stable or declining most companies
will not be able to increase their profits, however if GDP growth is too high, it may mean trouble,
because it will usually come with a growth in inflation, and may come with other negative side
effects. See figure 2.1 to see how these indicators evolved in the United States in the last years
(from 2006 until 2016).
(a) United States GDP growth rate (b) United States Inflation rate
(c) United States Producer Price Index
Figure 2.1: Macroeconomic indicators in the United States of America: GDP growth rate, Inflation rate and PPI,from 2006 until 2016 (adapted http://www.tradingeconomics.com/)
In figure 2.1 one can see that the inflation graph looks like the PPI one, and GDP growth follows a
The analysis of the fundamental value of an industry or sector (amount of possible clients, volume
of transactions on that industry, etc.) is used as indicator to check if the target market is good or
not for investment [2].
The possible number of costumers in an industry can be an indicator about what kind of market we
are targeting. Usually markets that rely on a small number of clients for a big part of their revenues
are not good markets to invest on, since a loss of one of those clients may cause a major loss on
revenues (for example, if a military supplier has 100% of its sales to the government, a change in
a defense policy may cause the company to go bankrupt).
Industry Growth, just as macroeconomic growth, is another indicator to check if a market is good
or bad for investment. Before looking for companies with certain requirements, one can check the
growing potential of an industry to check if a target market is promising or not. If a market has a
stable or a declining number of clients, it will be harder for a company to grow in that market, since
it will need to steal market share from other companies.
• Company Using information given in financial statements, one can calculate some fundamental
ratios used to compare companies, to decide which ones are the best to invest in. See figure 2.2
to see the evolution of Agilent Technologies fundamentals. Some companies’ FI are described as
follows (these appear in [8], [9], [10] [11], for example):
– Debt Ratio (DR)
DR is a ratio used to measure the level of debt of a company. Companies with a higher debt
ratio will have a larger amount of debt compared to their assets, leaving them more vulnerable
to an adverse economy, a reduction in their profits or an increase in their debt interests. In
most cases, a high DR can mean that a company is in a highly competitive market, with a
constant need for research and development, usually carried by external financing.
DR =Total Debt
Total Assets(2.1)
– Return On Equity (ROE)
ROE measures the performance of the company Net Income (NI) using the company eq-
uity (measures performance of profits to equity level). This is obtained through operational
efficiency, efficient use of assets and financial leverage. This ratio allows one to select com-
panies that maximize the return on the investment made in them, since the higher this ratio
9
is, the higher the return made of the money invested in the stock.
ROE =NI
Total Equity(2.2)
– Profit Margin (PM)
PM is a ratio that measures the cost of the business to generate profit, or, as the name says,
the margin (profit) the company has after paying all the operating, administrative and financial
costs, along with taxes. Although it is strange (and may be a bad signal) if this ratio varies a
lot (it may mean a decapitalization if it increases, or that the revenue is not making profits, if it
decreases) it is usually a good sign when this indicator is high.
PM =NI
Revenue(2.3)
– Price Earnings Ratio (PER)
PER is a ratio that indicates the value of a company share price when compared with its
per-share earnings. It is the inverse of the percentage of the per-share earnings. It is usually
used to look for undervalued companies. When the PER ratio is going up, it is usually because
investors are expecting a higher growth in the future. However this indicator has to be taken
into account in comparison with PER of stocks of the same sector, and because of these
nuances, it may be a misleading indicator, when used without comparison.
PER =SharePrice
EPS(2.4)
– Revenue Growth (RG)
RG is an indicator that shows the evolution of the business. It increases with two main factors:
either the company is gaining market share from other competitors, or the company is inserted
in a growing market and its growing with it. It only reflects the growth of a company’s revenue
and not its profits.
RG =RevenueCurrent −RevenueLastY ear
RevenueLastY ear(2.5)
– Common Stock Outstanding (CSO)
CSO is an indicator of the ownership hold of the company by shareholders. When a company
issues shares there is a share dilution, and when a company reduces the outstanding shares,
there is an increase in the Earnings Per Share (EPS) (since the same earnings go to a fewer
number of shares), and a decrease in the PER ratio. This is a good indicator to find companies
10
that have repurchased their shares (reduced the outstanding shares).
∆CSO =CSOCurrent − CSOLastY ear
CSOLastY ear(2.6)
– Net Income Growth (NIG)
NIG is an indicator about the trend of the profits of a certain company, and it is used to check
if a good result obtain in a certain year is not just a result of the economic conjecture or of
financial engineering. This indicator can be used to search for undervalued stocks, if the stock
prices do not follow the same behavior as the net income trend.
∆NI =NICurrent −NILastY ear
NILastY ear(2.7)
– Payout Ratio (PR)
PR indicates the percentage of net income distributed by the investor as dividends. High PR
indicates a stable company that does not need to do a lot of investment to keep their business
running but at the same time that is inserted in a stable market where stock performance will
be smaller than those in a fast growing pace, since the part of earnings not paid to investors is
used to invest and create future earning growths. Investors seeking high incomes with limited
earnings growth choose high PR, and investors seeking for capital growth choose lower PR.
PR =DPS
EPS(2.8)
– Capital Expenditures (CE)
CE, when increasing with a greater momentum than NI, is an indicator that the company is
probably inserted in a competitive market. This indicator is compared to NI to avoid compa-
nies that show this type of behavior (to avoid companies in competitive markets).
∆CE =CECurrent − CELastY ear
CELastY ear(2.9)
– Cash From Operating Activities Growth (CFOAG)
CFOAG is a measure of performance of generating money through operations, or operational
money (ability to transform paper operating income into the income statement in receivable
cash). This accounts the cash flow that comes in the company, because the company may
have a high operating net income but be inefficient in the collection of its cash profits.
CFOAG = ∆CFOA =CFOACurrent − CFOALastY ear
CFOALastY ear(2.10)
11
(a) Balance Sheet (b) Cash Flow Statement
(c) Income Statement
Figure 2.2: Financials of Agilent Technologies Inc from 2011 until 2015 (Adapted fromhttps://www.google.com/finance). The balance sheet holds information about total debt, total as-sets and the DR. The cash flow statement holds information about cash from operating activities, cashfrom investing activities and cash from financing activities. The income statement shows the revenue,the net income, the profit margin, the operating income and the operating margin
2.1.2 Technical Indicators
TA [12] and FA use different approaches towards investment, since TA uses movement of stock
prices [13] and volume of transactions [14] as the main information to predict stock markets. TIs look
for patterns in past data and use those patterns to forecast market tendencies(see figure 2.3 to see
how some TI follow the trends of the S&P500 index). TA generates trading rules by analyzing previous
patterns of technical indicators [2], and can be grouped into eight main groups [15], five of them are
described as follows (apart from these there are also other kind of TI: flow of funds, sentiment and raw
data):
• Trend
Trend analysis is a price-based indicator used to track stocks (or other assets) price’s trends.
12
Figure 2.3: S&P500 index and 3 TI - SMA, RSI and MACD (adapted from https://www.google.com/finance)
Strategies that use this indicator assume that political and economical events usually change mar-
ket prices through a change in market trends instead of returning to the most rational point. The
most common trend indicator are Moving Average (MA)s (see for example [16]).
• Momentum
Momentum analysis is also a price-based indicator but used to evaluate the velocity of price
change, and evaluate if a trend reversal is about to happen.
• Volatility
Volatility analysis investigate fluctuations of price ranges in stocks. It can be used to evaluate risk
and identify the level of support and resistance. Stock prices usually are recognized to fluctuate
between the level of support (lower level) and resistance (higher level), but continue to fall/rise if
they break through that level. Volatility indicators include Average True Range, Bollinger Band,
among others. Volatility can also be used to predict Macroeconomic Indicators.
According to [17], volatility is a good GDP growth measure, since GDP growth shrinks after spikes
in volatility. Markets also react to volatility, either in or out a crisis context, and regardless the
market context (either bull or bear). An increase in volatility is usually associated with an increase
in inflation and unemployment rate, and during recessions, on average, volatility rises and interest
rates drop. When a random shock in volatility occurs, GDP reacts to it but reverts to its mean
13
quickly after (1 or 2 quarters). However, if volatility is created by economic politics uncertainty, the
reversion to the mean can take a lot longer (specially if the shock in the politics is unexpected).
One way of measuring volatility, according to [17], is by the quarterly and monthly variance of the
average daily Morgan Stanley Capital International (MSCI) country stock market index.
In an attempt to proxy monetary policies, one can control short term interest rates, with a given
lag (to proxy implementation and effectiveness). Also, one can check for the overall tax level of
a country checking for the ratio between Tax Revenue and Real GDP. Industry production will
decrease with an increase in tax rates.
Volatility affects growth much more than the other way around. Three possible measures of
Macroeconomic uncertainty are the Leading indicator index from Organisation for Economic Co-
operation and Development (OECD) (contains various macroeconomic indicators, one of which is
industrial production index), the Oil Price Volatility and economic policy volatility [17].
[18] shows that permanent shocks (being shock defined as a volatility measure) explain the bulk
of the variation of stock prices over short periods. The author also says that three big American
indexes (Dow Jones Industrial (DJI), National Association of Securities Dealers Automated Quota-
tions (NASDAQ) and S&P500) share a common trend and a common cycle relationship, therefore
shocks will affect all markets similarly.
• Volume
Volume based indicators reflect the amount of investment from buyers/sellers, which can also
predict stock price movements. Volume indicators include Volume change rate, On Balance Vol-
ume (OBV), among others.
In [14] it is used a Volume Adjusted Moving Average (VAMA). It is based on equivolume charting,
a technique that analyses stock prices in relation with the amount of volume traded. In this type of
charting the stock price goes to the vertical axis, and the volume traded goes to the horizontal axis.
Short and wide boxes tend to occur at turning points (stock price is having difficulties moving), and
tall, narrow boxes usually occur at stable markets (stock price is moving easily).
• Cycle
Cycle analysis is a type of indicator that assumes periodic variation in stock prices. Long cycles
can take years and include several smaller cycles. Strategies that use this indicator analyze the
position of the stock price in the cycle.
[19] tries to find a correlation in the amount of business between countries, and the impact of
shocks in business cycles and GDP.
14
Dow’s theory [20] (one of the origins of the trend analysis) assumes there are three types of trends
in the stock market:
– Primary trend: Long term movement of prices (from a year to three years)
– Secondary trend: Short term deviations of prices from the underlying trend. It can be seen
as a correction from the primary trend (from three weeks to three months).
– Tertiary trend: A corrective movement from the secondary trend (less than three weeks).
A cycle is defined as an up trend, down trend and up trend again [21], taking only one of the Dow’s
theory trends into consideration. Longer cycles are constituted by several smaller ones.
2.2 Computational Intelligence Algorithms
CI combines methods and tools to solve problems that normally would require human intelligence.
There are several known CI algorithms: artificial neural networks, fuzzy logic systems, evolutionary
algorithms, among many others. In all of them the success on solving a problem depends mostly on
how that problem is represented by the algorithm.
When it comes to algorithmic implementation in computational finance, a popular approach is Evolutionary
Computation (EC) to optimize rule discovery, because the population based system used by EC greatly
increases the number of searches in the solution search space (by doing parallel search), thus reducing
computational time.
EC is a subfield of AI that will receive the focus of this work. Evolutionary Algorithms (EA) are
algorithms that optimize or learn tasks with the ability to evolve. EAs have three main characteristics,
they are: population-based (the algorithm maintains a set of solutions to search the solution space in a
parallel way), fitness-oriented (the algorithm has a fitness function which measures the success of the
solution, and this is the main aspect that guarantees convergence) and variation-driven (solutions will
suffer several variation operations, to cover more of the search space and to avoid local maximums) [22].
Most CI problems can be seen as a mapping of a domain space into a solution space, and usually
the possible number of solutions becomes so huge it becomes impossible to search all of it. EAs are
stochastic methods that use heuristics to find solutions, which means they will not guarantee the best
solution, but will take a significant reduction in cost and time [23].
The EAs used in this work are GAs. GAs are the most used kind of EAs. They can be either used
as an optimization algorithm or to study adaptive systems. GAs simulate natural selection, where better
solutions are more prone to reproduce than worst solution, each solution (individual) has a limited life
span, there is variation in the population and the ability to survive is positively correlated with the ability
to reproduce [24].
15
Apart from EC techniques, there are other popular approaches, such as Artificial Neural Networks
(ANN) and Fuzzy Systems.
2.2.1 Genetic Algorithms
GAs are a type of EAs that will receive focus on this work. GAs are based on the theory of evolution
developed by Darwin, simulating the evolution of a specie in a certain environment. It starts with a
population of individuals (chromosomes), where each one codify a solution. As it happens with species
evolutionary process, these individuals reproduce in order to create offspring solutions better than the
parent solutions.
GAs were discovered as a useful optimization an search algorithm. A lot of problems in AI can be
defined as a search in a solution space (called search space) which contains every possible solution.
GAs search this space by comparing solutions and looking for the best one.
This heuristic allows the search of several solutions in parallel, converging to better ones. This
convergence is measured by a fitness function. Fitter solutions are privileged when selecting solutions
to ”reproduce”, attracting the whole population of solutions to somewhere near them in the search space
[22].
Usual implementations of GA individuals are arrays or trees of values (as in [25]), where each value
codifies a parameter to be optimized.
A set of genetic operators has to be defined for the GA. The way these genetic operators are imple-
mented determine the success of the algorithm. In a simple GA, the algorithm has to take four steps on
each iteration (generation) [26].
• Selection After evaluating the fitness of each individual of the population, the first step is to select
individuals to reproduce. This selection is done randomly, taking into account the relative fitness
of individuals, such that the best solutions are chosen.
• Reproduction In this step, offspring are created from the selected individuals. For this, it can be
used both recombination and mutation of values.
• Evaluation The fitness of the new population is reevaluated.
• Replacement In the last step, recently created individuals replace individuals from the old popu-
lation.
The algorithm will repeat until a stopping condition is reached, and this is either a maximum number
of generations, no change in the best fitted individuals of the population for a predetermined number of
generations or when a specified time elapsed.
16
A simple GA pseudocode is given in algorithm 1.
Algorithm 2.1: Simple GAt← 0;P (t)← random;Evaluation P (t);while notEndcondition do
Pp(t)← Selection of parents from P (t);Pc(t)← Crossover from Pp(t);Pm(t)← Mutation of Pc(t);Evaluation Pm(t);P (t+ 1)← New Generation Creation from (P (t), Pm(t));t← t+ 1 ;
The basic operators of a GA are defined as follows:
• Selection
Selection is made by evaluating and ranking individuals, using the fitness function of the GA [27].
Selection has several ways of being implemented. This work will focus on two implementations:
Roulette Wheel Selection and Ranking Selection. These are described as follows:
– Roulette Wheel Selection The principle here is that of a linear search in a roulette wheel,
where the slots in the wheel are weighted in proportion to the individual’s fitness value. To
implement the roulette wheel one has to go through the following steps: first the total expected
value of individuals in the population is obtained (see equation 2.11), and afterwards the
algorithm can run (see algorithm 2)
T =
N∑i=1
Fitnessi (2.11)
then:
Algorithm 2.2: Roulette Wheel Selectioni = 0 ;while i != N do
chose random number r ∈]0, T ];j = 0 ;Fit Sum = 0 ;while Fit Sum < r do
Fit Sum = Fit Sum + Fitnessj ;++j ;
++i ;
– Ranking Selection In Ranking Selection individuals received their fitness by their ranking.
This results in slow convergence, however avoids quick convergence and possibly getting
trapped in a local maximum. A suggestion to do this is to select two individuals at random,
the one with the best ranking becomes the parent. Then, repeat this process to find the other
17
(a) Single Point Crossover (b) Multi-point Crossover
Figure 2.5: Boolean valued and integer valued mutation. The integer valued mutation is not associated with anyprobabilistic distribution in this image, is purely figurative
parent.
• Crossover
Crossover simulates the biological crossover, and mixes values of individuals (of the old popula-
tion) to generate an offspring (that will be an individual in the new population). The Crossover can
be made in a single point (2 segments are exchanged between individuals), or in multiple points
(more than 2 segments exchanged) as shown in figure 2.4.
• Mutation
Mutation is an operator used to increase diversity in solutions (being able to cover more of the
search space, and avoiding being stuck at a local maximum). Mutation perturbs a value in the chro-
mosome, adding noise with a certain probability distribution (a popular choice is Gaussian noise) in
real valued chromosomes, randomizing that value, interchanging values or flipping boolean values,
as shown in figure 2.5.
• Other paradigms
In [25] a tree structure representation of a portfolio as a GA is used, where the GA has to fill
out some more rules, for example, each branch of the tree must be a portfolio by itself, each node
represents the weight of that branch, and the leafs are the stocks. In this representation operations
in the GA are handled differently.
There are also other operators or techniques that can be applied to the GA in order to improve its
18
results, such as elitism, that propagates a percentage of the best individuals in a population into
the next generation.
Adaptive Genetic Algorithm (AGA)s are used for Dynamic Optimization Problems (DOP) (problems
where variables change over time). These kind of problems need a solution that tracks the moving
optima over time. To achieve this, one has to make some enhancements to GAs such that it adapts
to the new optima over time. An AGA can be a GA whose parameters (such as population size,
mutation or crossover probability) changes while the GA is running. According to [28], a DOP is
characterized as:
F = f(~x, ~ψ, t) (2.12)
Where ~x are the decision variables, ~ψ the parameters and t is time. The challenge here is to track
the moving solution without having to restart the algorithm. There are 5 main approaches to this:
– Memory: store useful information
– Diversity: handle convergence
– Multi-Population: co-operate between sub-populations
– Adaptive: adapt generators and parameters
– Prediction: forecast changes and take action
Details about these techniques are given as follows:
– Memory Approaches
Memory approaches are particularly useful for cyclic DOPs. This approach can be divided into
implicit memory approaches and explicit memory approaches. Implicit Memory approaches
uses redundant information. A way of implementing it in GAs is by using a pair of chromo-
somes on each individual (Diploid GA) that encode the genotype of the individual, and a
dominance scheme that maps the genotype to phenotype. Explicit Memory approaches use
extra memory to store useful information of the population. The best solutions are saved in
memory, such that when a change occurs the memory solution will be used to track the new
optima. If Direct Memory is used only good solutions are stored into memory, and if Asso-
ciative Memory is used, good solutions and environmental information (context) is stored. In
this case when a memory update occurs, a new pair (AD) (with ~D being the environmental
information) replaces another, and solutions are generated by sampling ~DM . An example of
it is in figure 2.6.
19
(a) Random Immigrants (b) Memory-based Immigrants
Figure 2.7: Diversity Approaches
Figure 2.6: DOP Memory Approaches
– Diversity Approaches
Diversity Approaches will use diversity of individuals to cover more of the search space in
order to have a faster convergence when a change occurs. A way of achieving this is the
Random Immigrants approach (see figure 2.7). This approach inserts random individuals
each generation to maintain diversity, such that when a change occurs the random individu-
als will attract the population to the new optimum. A second approach is using Memory-based
Immigrants, where some points in the search space are stored into memory and re-evaluated
each generation. On each generation the best memory point is chosen and the immigrants
are generated by mutating this point with a certain probability, and then the population re-
places the worst individuals with these solutions.
– Multi-Population Approaches
Multi-Population approaches use several co-operating sub-populations to explore the search
20
(a) Shifting Balance (b) Self-organizing Scouts
Figure 2.8: DOP Multi-Population Approaches
space at the same time. One approach to this is the Shifting Balance, where a core population
explores the area of the present optimum while several colonies (sub-populations) explore the
rest of the search space. Whenever a change in the optimum occurs, the most fit individuals
of the colony searching the space of the current optimum will migrate to the core population,
attracting the core population to this search space. Another approach is the Self-Organizing
Scouts (SOS), where a core population explores the promising search space and is split into
child populations under certain conditions. Each child population explores limited promising
areas and are also split under certain conditions (see figure 2.8).
– Adaptive Approaches
Adaptive approaches change the operators/parameters of the GA, usually after a change, to
pressure the population to dramatic changes for a certain period. Hyper-mutation, Hyper-
selection and Hyper-learning are 3 operators used to achieve this (augmenting mutation rate,
selection pressure and learning rate temporarily).
– Predictive Approaches
Prediction approaches analyze patterns in the DOP to forecast the next optimum, when the
next change will occur and which environment may appear. Kalman Filters and forecasting
are two examples of techniques used.
2.2.2 Neural Networks
ANN have attracted the attention researchers due to its predicting power and flexibility. ANN is a
biological inspired computational model which consists in processing elements (neurons) and connec-
tions between them with coefficients (weights). These connection weights are the ”memory” of the
21
system [23]. This kind of systems can be used for either supervised or unsupervised learning.
Usually, neurons are visualized as being arranged in layers, and typically neurons in the same layer
behave in the same manner. The arrangement of neurons into layers and the the connection patterns
within and between layers is called ”net architecture”. In figure 2.9 is the example of a feedforward
network: a network in which the signals flow from the input units, to the output units, in a forward
direction [29].
ANN applied to computational finance are implemented in several ways (see [30], [31], [21], [32], [33]
or [34] for some examples).
Figure 2.9: Artificial Neural Network representation - wij is the weight given to the connection between nodes iand nodes j, w′
jk is the weight of the connections between nodes j and k. These weights are chosenaccording to a mathematical function, that will decide which neurons (nodes) will be used as path forthe inputs.
2.2.3 Fuzzy Systems
In 1965, Lotfi Zadeh published a paper ( [35]) formally developing the multi value set theory, that later
has come to be known as fuzzy logic. In that paper, the author showed how the function IA of non-fuzzy
subset A of X, described as equation 2.13 could be extended to the multivalued indicator function, µA
of fuzzy subset of X, given by the membership function in equation 2.14.
IA(x) =
{1→ x ∈ A0→ otherwise
(2.13)
In equation 2.13 0 represents non-membership and 1 represents membership.
µA(x) : X → [0, 1] (2.14)
22
In equation 2.14 µA(x) is interpreted as the degree of membership of element x in fuzzy set A for each
x ∈ X [36].
If the universe is discrete, a membership function can be defined by a finite set in the following way:
A =∑
µi/ui (2.15)
In equation 2.15 the symbol / separates the membership degrees µ(ui) from the elements of the universe
ui ∈ U [23].
Fuzzy Rules applied to computational finance usually create linguistic rules of the type IF this THEN that
using technical indicators, which can be understood by a human trader. Fuzzy systems are usually used
with ANNs, examples of these are given in [37], [32].
2.3 Clustering Algorithms
Data mining is the process of exploring data from different perspectives to discover previously un-
known patterns, and develop a model used to understand phenomena from the data and summarizing
it into useful information [38].
This analysis allows to obtain correlations and to learn new features about the data set. Although the
term is relatively new, the technology is not, and it is used by large distribution companies (Walmart for
example) to relate costumer’s buying patterns, being able to increase revenue using this information.
DM is considered a process in Knowledge Discovery from Data (KDD), which processes consists in
the iteration of the following steps [39]:
• Data cleaning Removes noise and inconsistent data
• Data integration Multiple data sources may be combined
• Data selection Relevant data for the analysis task is retrieved from the databases
• Data transformation Data is transformed or consolidated into appropriate forms for mining
• Data mining Intelligent methods are applied to extract patterns from data
• Pattern evaluation Identifies patterns that represent knowledge based on some interesting mea-
sures
• Knowledge presentation Visualization and knowledge representation techniques are used to
present the mined knowledge to the user.
23
There are several DM algorithms, for an explanation of some of them see [40]. This work will use a
clustering algorithm, that is a type of algorithm that can be used in data mining (although in this work is
used as a classification algorithm) and so it will focus more on detailing this technique.
Clustering is a tool of data analysis, which solves classification problems, applied when there is
no class to be predicted, but instead, when instances can be divided into natural groups. Clustering
itself is not an algorithm, but a task, with several algorithms that can be used to find a solution. The
best algorithm to apply depends on the data and desired results [41]. It is a unsupervised technique,
since it does not use preclassified data. Instead the algorithm discovers similarities (in the requested
attributes) between objects of the set, grouping them in the same cluster. The identified groups may be
exclusive (an instance belongs to one only group), overlapping (an instance belongs to several groups),
or probabilistic (an instance belongs to each group with certain probability) [42].
Exclusive clustering objective is to group a set into smaller subsets, such that the degree of associ-
ation is strong between members of the same cluster and weak between members of different clusters.
There are many clustering algorithms (see [43]). Some Clustering algorithms use metrics to measure
intra-cluster or inter-cluster distance. In [44] a GA is used to optimize clustering, using the Calinski-
Harabasz index as fitness function, obtaining better results than classical clustering methods as K-
means and Fuzzy C-means (FCM) (FCM is a Fuzzy Clustering method, a method where an object can
belong to more than one cluster, for more information see [45] and [46]).
The algorithm from [44] uses cluster points as chromosomes for the GA, and activation values for
those points in an algorithm called ACGA. It is described in algorithm 3.
Algorithm 2.3: ACGAt← 0;P (t)← random;A(t)← random;while notEndcondition do
Pp(t)andAp(t)← Selection of parents from P (t)andA(t);Pc(t)andAc(t)← Crossover from Pp(t)andAp(t);Pm(t)andAm(t)← Mutation of Pc(t)andAc(t);Check for bigger A(t) values;Choose clusters corresponding to bigger A(t) values ;Compute calinski-harabasz index to attribute fitness values ;P (t+ 1)← New Generation Creation from (P (t), Pm(t))and(A(t), Am(t));t← t+ 1 ;
In algorithm 3, P (t) is the population (each individual is an array of points in the space that can be
chosen as a cluster center) and A(t) are the activation values (each individual is an array of values
∈ [0, 1]), and each individual in A(t) corresponds to an individual in P (t) (chromosomes are of the same
size).
24
2.4 Portfolio Composition Problem
The original portfolio composition optimization problem described by Markowitz [47] is described as
in equations 2.16.
Max(expectedreturn) =
M∑i=1
uixi (2.16a)
Min expected risk =
M∑i=1
M∑j=1
oijxixj (2.16b)
s.t.
M∑i=1
xi = 1 (2.16c)
In equations 2.16 ui is the expected return of asset i, xi is the investment portion on the asset, and oij
is the covariance between asset i and j.
There are two main approaches to this problem. One is Single-Objective (SO) optimization, and the
other Multi-Objective (MO) optimization.
In the case of SOs, a single criteria is optimized, and an optimum is either its maximum or its minimum
and a solution dominates another if it is above (for maximums) or below (for minimums) another solution.
In a MO there are several criteria to be optimized, and the exact mutual influences between objectives
can become complicated, and are not always obvious. This approach uses Pareto optimality, which
defines the frontier of solutions that can be reached by trading off conflicting objectives [48]. According
to [49] and [50], a MO algorithm is said robust if solutions maintain as close as possible to the Pareto
Front, the rankings are the same in training and validation and solutions are diverse (uniform distribution
by the Pareto Front) and non dominated solutions maintain that way in training and test. In [50], there
are also techniques used to improve robustness, for example Mating Restriction (restricting mating to
occur only between dominated and non-dominated individuals).
Fitness can be described as proximity to the Pareto front, and solution’s diversity can be described as
the distribution of solutions in the Pareto Front. The MO approach has shown great results in optimizing
the portfolio composition problem and in ranking stocks [51]. Examples of works using MO are [52] [11]
[53].
2.5 Investment Strategies
Strategies vary a lot, ranging from Value Investing strategies to Growth Investing strategies, or even
a mixture of both. One can choose between several strategies when investing, some of which are
described below.
25
2.5.1 Value Investing
As referred above, FI were largely used by investors such as Warren Buffet, that made value investing
famous. He would look for companies with a high intrinsic value, and/or companies that had some sort
competitive advantage, and buy them, as said in [4]. This generated huge profits for these investors,
because the market (that is self regulated) eventually realized the value of the stocks, and since the
companies had a competitive advantage in the market they were inserted in, the stocks never had a big
breakdown in recession times, and continued to grow further. This type of strategy uses FI above all
other indicators, since it gives the best mechanism to evaluate the financial strength of a company, even
if its quote is currently falling. FIs do not take into account any kind of trends, and so, all you can assume
(using only FI) is that a company is good or bad, according to its financial statements (evaluating the
intrinsic value with the market value) and wait for the market to eventually realize the company’s value.
2.5.2 Growth Investing
TI are mostly used for strategies that take into account tendencies such as growth investing (this
investment strategy is the most commonly used in computational finance). This means that instead of
measuring the intrinsic value of an asset at given times, it measures how the market reacts to it, either
by simply analyzing the stock price trend, or going into a more complex analysis of relating the trend of a
stock price to its volatility. Studying past tendencies of certain TIs can help to predict the tendency, using
that information to buy, sell or hold a certain stock. Many times this is not done by analyzing only a single
TI, but several of them, and by drawing conclusions on how a market will evolve given that information.
2.5.3 GARP Investing
GARP takes the best of value and growth investing, by looking for companies with a good intrinsic
value, with good growth prospects. One of the biggest supporters of this kind of investing is Peter Lynch
(see [54]). He segmented the type of existing markets, and search for cycles, and other type of indicators
(mainly, but not only, macroeconomic indicators) that indicate the type of market in which the investor
is inserted. By doing so, he was able to adapt a better strategy to that kind of market, and use more
relevant indicators for the type of stocks he is looking for. GARP investing avoids companies with huge
growths, since those companies have an higher risk associated with it. It also avoids companies that
have a good intrinsic value but do not grow.
26
2.5.4 Income Investing
Income Investing prefers to rather have a fixed income than to risk investing in stocks that show
volatility. So, income investors prefer bonds to stocks, and in stockpicking they pick stocks that have
high DPS values, so they can have a fixed income on their dividends.
2.6 Classification of Stocks
Several investors classify stocks according to company’s FA or TA ( [55], [56], for example). This
classification allows them to group stocks with similar features, making the study of the behavior of those
stocks easier. If stocks are well classified and inserted in the right cluster, it will be easier to evaluate if
a company is going to grow or not evaluating its behavior inserted in the group. Having this information,
the investor is able to create customized investment strategies to each group of stocks (a better adapted
investment strategy, since behaviors will be similar). Lynch and Rothschild did this on their book [54].
Using FA they evaluated stocks according to their growth rate, capitalization and economic behavior,
and classified companies into six major types, creating specific investment strategies to each of these
types. The main characteristic of each type is explained as following, according to the book:
• Slow Grower
These companies are usually large and aging companies, that are expected to grow only slightly
faster than the Gross National Product. Normally slow growers start out as fast growers and
eventually stop growing as fast, either because they have grown as much as they can, or because
the industry they are inserted in slows down its growth. Every Fast growing industry eventually
slows down and becomes a slow growing industry. Usually slow growers pay a generous and
regular dividend (because these companies usually can not use that money to expand business).
• Stalwart
These are usually multi-billion dollar companies, that are not exactly agile climbers but are faster
growers than slow growers. These companies have around 10 to 12 percent annual growth in
earnings. They can give a sizable profit if bought and sold at the right time, and are also a good
protection against recession times, since they are so big that will not go bankrupt, and soon enough
after the recession their value will be restored.
• Fast Grower
These are small, aggressive enterprises that grow at a rate of 20 to 25 percent a year. A fast
growing company does not necessarily has to belong to a fast growing industry. All it needs is
the room to expand in a slow growing industry. Usually these upstart enterprises learn to succeed
in one place, and then replicate their winning formula over and over. These kind of stocks are
27
usually risky, especially in younger companies that tend to be overzealous and underfinanced, and
underfinanced companies do not end up well during recession times. Also, Wall Street does not
look kindly on fast growers that run out of stamina and turn into slow growers. Once a fast grower
grows too big it faces the problem of having trouble growing further.
• Cyclicals
These are stocks whose sales and profits rise and fall in a regular if not completely predictable
fashion. In a cyclical industry business expands and contracts, then expands and contracts again.
When coming out of a recession and into a vigorous economy, the cyclicals flourish, and their stock
prices tend to rise much faster than stalwart prices. However, during a recession the cyclicals
suffer, and so do shareholders. Buying a cyclical in the wrong part of the cycle can make one lose
a lot of money, and so, timing is everything when buying a cyclical.
• Turnaround
These are companies that have no growth at all. Sometimes turnarounds are poorly managed
cyclicals that go so far down in a cycle that people think they will never come back up. Neverthe-
less, turnarounds are companies that can make up lost ground quickly, and the best thing about
investing in successful turnarounds is that of all the categories of stocks they are the least related
to the general market. Failed turnarounds are dragged into bankruptcy, making it a very risky type
of stock, that varies between a major success and a major failure.
• Asset Plays
These are companies that own one or more valuable assets that Wall Street has overlooked (and
so, has not valued the stocks accordingly). These assets can be as simple as a pile of cash or the
subscribers of a TV cable provider, for example, but usually they are real state assets. These are
companies whose assets may value more now (or may get more valuable) than the value given to
the company itself. When the market realizes this value (or when the assets grow in value), the
stock prices grow accordingly.
2.7 Data Set / Markets
Most works done in computational finance have been done with well known and stable markets (from
strong European or American economies). This work will also be focused on these kind of markets
(specifically focused on the S&P500 index), not only because most work has been done with similar
markets, but also because other kind of markets have different relationships between indicators, and
although being easier to predict (because they are less efficient), are usually more unstable. According
to [57] funds located on the US that invest in emerging markets underperformed funds physically located
28
in the emerging markets (one of the reasons is because some of this markets have not so stable financial
policies, and the market is very affected by policy changes). The author also shows that geographically
focused funds outperform the ones that invest globally. These factors made me chose a well known and
studied market, with easy access to information.
2.8 Related Work Results
In this section, some previous work results and data used will be analyzed and compared between
them in order to know which are the best solutions.
2.8.1 Evolutionary Computing
In [8] it is used an hybrid approach to portfolio composition, using both fundamental and technical
indicators. In this paper uses a Multi-Objective Evolutionary Algorithm (MOEA) with two objectives
(return and variance of returns), computes the Pareto front and tries to find solutions near it with the
technical indicators.
In [13] is proposed an approach of technical rules optimization using a GA. In this approach each
individual is an asset classifier equation that takes into account the value of the technical indicators
applied to the available data prices.
In [49] is proposed a change to the Strength Pareto Evolutionary Algorithm 2 (SPEA2) algorithm
(becoming the Robust Strength Pareto Evolutionary Algorithm 2 (R-SPEA2)) in order to make it more
robust. In [50] Multi-Objective Genetic Programing (MOGP) robustness is also studied and it is con-
cluded that mating restriction is a promising technique to use accordingly: mating of similar parents will
converge solutions, and mating of dissimilar parents will promote diversity.
In [25] is proposed a tree structure representation of the GA for portfolio optimization, instead of the
more common array approach. In a tree structure GA each terminal node holds an asset and each
non-terminal node holds the weight of the subtree, and each subtree is also considered a portfolio.
In [16] is used an approach using GAs to search for the optimal period lengths, adjustment frequen-
cies and adjustment volumes of moving averages to predict changes in price of crude oil for investment
in the crude oil future market.
In [58] a Multiple-Criteria Decision-Making (MCDM) method is applied to portfolio optimization, divid-
ing the criteria of return and risk into several measurements.
[59] uses standard Genetic Programming (GP) optimization, with a function set comprising a com-
mon set of arithmetic and a terminal set comprising a collection of technical indicators and constants.
The objective is consistently outperform the B&H strategy, based on the work of [60].
29
[61] uses GAs to decide trading strategies, consisting of two stages, elimination of unacceptable
stocks and stock trading construction.
2.8.2 Other methods
In [62] a correlation matrix, between the most significant indicators and future prices, is applied,
alongside a data discretization using a Cumulative Probability Distribution Approach (CPDA) and a
Minimize the Entropy Principle Approach (MEPA). Afterwards Rough Set Theory (RST) is used to obtain
linguistic rules and a GA is used to refine it.
In [14] is used a mix of strategies, applying ANN, fuzzy logic and GA to an approach that uses VAMA
as the indicator used. The system consists of three phases: the ANN system having VAMA as baseline,
the definition of fuzzy rules with the ANN outputs and the refinement of those rules made by a GA.
In [32] a Fuzzy Cerebral Model Articulation Controller (FCMAC) approach to Forex Exchange is
proposed. This approach divides data into sets and uses local learning (focus on useful local information
from observed data).
In [30] it is used a Modular Neural Network with sliding window, error back propagation method and
supplementary learning as a way of retraining avoiding over fitting.
In [21] it is used an Adaptive Network Fuzzy Inference System, supplemented with Reinforcement
Learning (RL). This RL process uses feedback reward/punishment according to environment state.
The author uses Momentum and MA indicators to discover cycles in the data, and invest using that
This Chapter will give a description of the system architecture. First it will be given an overview of
the proposed solution, afterwards a module style description of the most important modules and lastly,
an explanation of some other implementation details.
The architecture of the proposed solution was made from scratch in C++. The implementation has
22 C++ classes, and a size of about 9000 lines of code.
3.1 Module View of the Global Architecture
The overview of the module architecture of the proposed solution is presented in this section (see
figure 3.1).
There are nine main modules in the system, each one with a specific functionality. Their specific
function is described as follows:
• Download This module is used to fetch information about companies (the stocks’ raw data) and
their financial statements (the financial statements are fetched with [63] algorithm) and store that
information to later be used by the Stock and Fundamental Analysis modules.
• Fundamental Analysis (FA) This module is used as a data analysis module. It processes in-
formation of the financial statements of the company, computes all the FI needed and does the
growth analysis of the variables used.
• Stock The Stock module holds information about the stock’s raw data (the quotes) and information
about the classification of the stock, given by the Classifier and the Clustering modules in each
quarter.
• Classifier The Classifier module is used by the Stock module. Uses a configuration file to define
thresholds, and uses those thresholds to classify the stock in a certain quarter, according to the
information given by the Fundamental Analysis module.
• Clustering This module is also used by the Stock module, and its objective is to attribute a cluster
to a stock in a certain quarter, using the Clustering Genetic Algorithm module, according to the
information given by the Fundamental Analysis module. It uses an unsupervised data mining tech-
nique, inspired in the ACGA algorithm from [44] that uses a GA to optimize clustering positioning.
The number of clusters used is fixed (given by the user as input) and the algorithm simply opti-
mizes the location where each should be. The number of clusters used in this work was 5, in order
to have the same number of clusters as the number of types created by the Classifier module.
• Genetic Algorithm (GA) This module contains two submodules: the Clustering Genetic Algorithm
submodule, used by the Clustering module to optimize clustering position (with a fixed number
35
of clusters), and the Fundamental Analysis Genetic Algorithm module, used by the Investment
Simulator module, to optimize Fundamental Indicators’ weights to give buy and sell signals.
• Investment Simulator This is the module responsible by the kind of portfolio that is created. Is
used by the Investor module, and uses both the Stock and Fundamental Analysis Genetic Algo-
rithm modules. It creates the Portfolio module, and is responsible for giving the buy and sell signals
to the Portfolio module, making the bridge between the Investor module and the Portfolio module.
• Portfolio The Portfolio module is used by the Investment Simulator one, and uses the Stocks
module to retrieve quote’s information. It saves the state of the portfolio information: the stocks
that are currently in the portfolio, when they were bought, at which price they were bought, and
the current return of the portfolio. It is also responsible for getting the necessary stock information
from the Stock module to simulate buy and sell.
• Investor This is the main module of the system, it receives user input, coordinates data flow and
calls every other method as necessary. Is responsible for: using the Download module to fetch
information, creating the Fundamental Analysis and Stock module (one for each company) using
the information from the Download module, receiving and distributing user input information by the
modules that use it, and giving information to the Investment Simulator module for the portfolio
construction.
36
Figure 3.1: Modules view of the architecture - The UML schematic shows which modules are used by each mod-ule. The full arrow in the GA modules represent inheritance, broken line arrows mean usage withoutinstantiation, an full line arrows mean association
3.2 General System Dataflow
The overview of the general dataflow of the proposed solution is presented in this section (see figure
3.2).
37
Figure 3.2: Modules Data Flow - User input and the system output is represented with a box, every module has itsname and functionality, and the arrows represent data flow
A step by step description of the data flow of the algorithm is given as follows:
• The system starts by receiving system parameters as inputs. In this work this is done with a
configuration file for general configuration (GA parameters), a second file with parameters for the
Stock Classifier (with the thresholds used) and a third file with the list of stock tickers used in this
work.
38
• The second thing the system does, is to create the Download module to fetch stock information.
This includes downloading quotes from yahoo site (constructing the URL and using GNU’s wget
to retrieve the CSV files with stocks’ information), and also downloading, sorting and rewriting the
financial statements (Balance Sheets, Income Statements and Cash Flow Statements) from the
• TICKS: Are the tickers of the Stocks wanted. Several tickers can be put together with the sign +.
• FLAGS: Are the flags that specify the information required by the API (for more information check
http://www.jarloo.com/yahoo finance/).
This will give the necessary information to create the Stock raw data. The information will be down-
loaded in the format described in figure 3.3, and saved in the Stock module.
Figure 3.3: Structure of the stock’s raw data downloaded from the Yahoo API
3.3.2 Stock
The stock module is where company’s data is saved. This includes stock’s raw data (quotes, adjusted
quotes, the stock quote higher and lower point of the day, the volume of transactions on that day and the
date of the quotes) and the Fundamental Analysis of the company.
The module also holds information about the Classification attributed by the Classifier, and the Clus-
ter given by the Clustering module. We can think of this module as the database from which information
to simulate investment is retrieved (see figure 3.4).
3.3.3 Fundamental Analysis
The Fundamental Analysis module is made from the public information about the companies. This
data is obtain through three spreadsheets (Balance Sheet, Income Statement and Cash Flow State-
ment), each one with information organized by quarter. For a better understanding of the structure of
the module see figure 3.5.
For further detail on how data is processed see section 3.4.1.
41
Figure 3.4: Representation of the Stock module
After all the information is retrieved, this module uses this information to create several indicators
and performs evaluations on the growth of the variables obtained by the sheets. This growth analysis is
made annually (compares the same quarter in different years).
Figure 3.5: Representation of the FA module
This is also the module where Fundamental Indicators are computed to later be optimized by the
GA module and used by the Investment Simulator module to give buy and sell signals. Each indicator
is modified if needed for the objective to be a maximization of the indicator. The indicators used are
42
described as follows:
• Debt
Assuming that no company in the S&P500 index has more total debt than total assets, the DR will
always be ∈ [0, 1], however, the objective is to minimize this indicator. To change this minimization
objective into a maximization one (as required so that all indicators’ objective is to maximize that
indicator), instead of calculating the percentage of debt, the indicator calculates the percentage of
the company that is not in debt, doing TotalAssetsTotalAssets −DR. The Debt indicator used in this work will
then be:
Debtindicator =TotalAssets
TotalAssets−DR = 1− Total Debt
Total Assets(3.1)
• PR
The Payout Ratio, as described in Chapter 2, is the percentage of net income distributed to the
investors as dividends. Since financially healthier companies have a bigger PR, the objective will
be to maximize it, so the PR indicator used in this work will be:
PR =DPSEPS
(3.2)
• ROE
The Return on Equity is described in Chapter 2, and in this work, this indicator will be used as it is,
and the objective will simply be to maximize it. The indicator used is:
ROE =Net IncomeTotal Equity
(3.3)
• PM
The objective in this work will be to maximize the indicator as it is explained in Chapter 2:
PM =Net Income
Revenue(3.4)
• RG
Although Revenue Growth (explained in Chapter 2) is used already in Classification and Clustering
of stocks, it is used as an indicator to choose the stocks that grow more inside each group. The
indicator used is:
RG = ∆Revenue =RevenueActual −RevenueLastY ear
RevenueLastY ear(3.5)
43
• NIG
NI growth will be used as an indicator, as explained in Chapter 2. The indicator used is:
NIG = ∆NI =NetIncomeActual −NetIncomeLastY ear
NetIncomeLastY ear(3.6)
• ∆ RG
Sometimes the growth of a company does not affect the stock quote growth as much as the per-
spective of growth. Since the market is sensible to changes in predicted returns, if a company
grows more (or less) than it is supposed to, it may change market confidence on that stock. This
indicator reflects the growth momentum of revenue, admitting that a bigger momentum will create
more confidence in a certain company, and so it will be represented by:
∆RG = ∆∆Revenue =RGActual −RGLastY ear
RGLastY ear(3.7)
Since it compares the momentum of annual growth, this indicator will only be available after the
second year of analysis.
• ∆ NIG
This indicator will have the same impact as ∆RG, and will also be used to measure changes in
different NI grows. This will indicate if the NI grow of a company is slowing down or not. The
indicator used is:
∆NIG = ∆∆NI =NIGActual −NIGLastY ear
NIGLastY ear(3.8)
• CFOA
This indicator (as explained in Chapter 2) will be used to choose companies that create cash flow
income from operating activities. The indicator used is:
∆CFOA =CFOAActual − CFOALastY ear
CFOALastY ear(3.9)
3.3.4 Classifier
This module does the classification of stocks for each quarter, based on the approach used by Peter
Lynch in [54], explained in section 2.6.
Although the book is old, and Economy changes at a fast pace, there are underlying theories that
make sense (and can be applied) in the present, with adaptations. The book was written in the end of the
1980’s, when economic context was completely different, and because of this the reference parameters
used by the author are not suitable. This update in parameters, either for economic context, as for
44
implementation ease is explained in the following paragraphs.
Although the author identifies six types of stocks, only five types are used in this work (not all types
are directly deduced from the book), mainly because they are the ones one can identify by directly
analyzing companies’ accountings. Cyclicals and Asset Plays are two types of stocks from the book
not used in this work. Cyclicals need a more careful analysis, to look for patterns in the quotes and
in accountings, in a way one determine cycle parameters and in which part of the cycle a company is.
Asset Plays are mainly based on the evaluation of the companies assets, which requires careful asset
examination, which can not be determined in spreadsheets.
There is a new type of stock used in this work, which was introduced to cover all the Size × Growth
space. This type represents the small stocks with normal and good growths (see the following para-
graphs to understand these classifications, and see figure 3.6 to better understand the structure) and
are given the name Potential Stocks.
Very Good
Good
Normal
Bad
Very Bad
MediumSmall Big
Fast Grower
Stalwart
Slow Grower
Potential
Turn Around
Growth
Size
Figure 3.6: Types’ Quadrants
The Classifier is the module that defines in which of the five types of stocks a stock belongs to, and
also gives a classification to the company financial health, all evaluated through the company’s FA. The
classification is given quarterly, so, a company may be of a type in a quarter, and change its classification
in the next one. The necessary information for the classification made by this module is given by user
input, however one could implement a fuzzy system instead of a human input classifier, in order to make
the system more sophisticated, and possibly to get better returns.
The classification given has only into account the size and the growth of the company (see figure
45
3.6), and not the financial health, that was implemented for possible human analysis.
• Size
The size of a company is classified given classifications to assets (using thresholds given as user
input), and then averaging the classification given to each one. Each asset is classified by doing
an average of the value over the last year (for further detail on how data is processed see section
3.4.1) and comparing that value with two thresholds. These thresholds will indicate if the asset is
either small, medium or big. After all the assets are classified, the company’s size is classified by
averaging all those classifications and rounding them (i.e. if taken into account 3 assets, and 2 of
them are classified as big, and the third one as medium, the company is classified as big).
The way classification is given using thresholds is the following:
– Classification = 1→ V alue < THLow
– Classification = 2→ THLow ≤ V alue < THHigh
– Classification = 3→ THHigh ≤ V alue
Afterwards the Classification is averaged:
TotalClassification =
∑Ni=1 Classificationi
N(3.10)
In equation 3.10 N is the total number of assets to be classified, and Classificationi is the classi-
fication given to asset i. For last, the classification given to the size is accordingly:
– Small→ Total Classification ∈]0, 1.5[
– Medium→ Total Classification ∈ [1.5, 2.5[
– Big→ Total Classification ∈ [2.5, 3]
In this work the only size indicator used is the last year’s Total Assets average, and the thresholds
are:
– THLow = 5B2
– THHigh = 10B
• Growth
The Growth of the company is classified in a similar way (by classifying the growth of user input
variables and then averaging the classification), however there are some differences. In the growth
classification, each classification has five possibilities (very bad, bad, normal, good, very good),
2the 5B and 10B represents respectively 5 and 10 Billion dollars
46
unlike the size classification that had only three. One out of 10 possible classifications is given to
each variable, and as in the size classification, the growth classification will also be the average of
the classifications of all the variables used as user input.
The growth is measured yearly (between the same quarter of different years). For further detail on
how this is done see section 3.4.1.
The procedure is similar with the one used to classify the size:
– Classification = 1→ Indicator < TH1
– Classification = 2→ TH1 ≤ Indicator < TH2
– Classification = 3→ TH2 ≤ Indicator < TH3
– Classification = 4→ TH3 ≤ Indicator < TH4
– Classification = 5→ TH4 ≤ Indicator < TH5
– Classification = 6→ TH5 ≤ Indicator < TH6
– Classification = 7→ TH6 ≤ Indicator < TH7
– Classification = 8→ TH7 ≤ Indicator < TH8
– Classification = 9→ TH8 ≤ Indicator < TH9
– Classification = 10→ TH9 ≤ Indicator
Afterwards the Classification is averaged:
TotalClassification =
∑Ni=1 Classificationi
N(3.11)
In equation 3.11 N is the total number of assets to be classified, and Classificationi is the classi-
fication given to asset i. For last, the classification given to the growth is accordingly:
– Very Bad→ Total Classification ∈]0, 2]
– Bad→ Total Classification ∈]2, 4]
– Normal→ Total Classification ∈]4, 6]
– Good→ Total Classification ∈]6, 8]
– Very Good→ Total Classification ∈]8, 10]
In this work the only growth indicator used is the revenue yearly growth, and the thresholds are:
– TH1 = −0.2
– TH2 = −0.1
47
– TH3 = −0.05
– TH4 = −0.02
– TH5 = 0
– TH6 = 0.02
– TH7 = 0.05
– TH8 = 0.1
– TH9 = 0.2
• Health
The Health evaluation of a company has into account the amount of debt the company has. It is
classified in a similar way to the size (by doing an annual average of parameters, and comparing
them to two thresholds). The financial health of a company does not interfere with its type (there
may be several companies of the same type with different financial healths), it is simply indicative
for human analysis.
In this work the only financial Health indicator used is the last year DR indicator average, and the
thresholds are:
– THLow = 0.3
– THHigh = 0.7
(a) Health (b) Size
(c) Growth
Figure 3.7: Classifier Structure
48
• Type
The type of the stock is, as said before, obtained from the evaluation of the size and growth of a
company, based on the approach presented by Peter Lynch in [54]. After a stock has its growth
and size evaluated, the type is determined by combinations between them.
The 5 classifications given in this work are (see figure 3.6):
– Slow Grower - These are the companies that are considered Medium in terms of size and
had a Normal or Good growth classification
– Stalwart - These are the companies that are considered Big in terms of size and had a Normal,
Good or Very Good growth classification
– Fast Grower - These are the companies that are considered Small or Medium in terms of size
and had a Very Good growth classification
– Potential - These are the companies that are considered Small in terms of size and had a
Normal or Good growth classification
– Turn Around - These are all the companies that had a Bad or Very Bad growth classification,
independently of their size.
3.3.5 Clustering
The clustering in this work is not threated as a typical clustering problem, but instead it is used for
classification, given a fixed number of clusters. This module creates five clusters each quarter, and
associates each stock to the nearest cluster in the Growth × Size space. It uses the GA module to
optimize clusters’ positions, given at least one year training. The number of clusters chosen (5) was
chosen taking into account the number of types created by the user input classifier (also five), to check if
there was any kind of resemblance between the two classification methods. After the GA module outputs
the clusters’ locations in the search space, the clustering module associates each company to a cluster
by minimizing the euclidean distance between company and clusters (it choses the cluster with minimum
distance). Since each axis comes with a big unit difference (size comes in billions, and growth comes
in a fraction representing the percentage), values are scaled to help the algorithm to converge. This is
done such that 1B$3 in assets represents a distance equivalent to 1% in growth, both representing a unit
distance from the origin. These scaled units are given in equation 3.12.
Size =Total Assets[Million$]
1000(3.12a)
Growth = ∆Revenue× 100 (3.12b)3B stands for the American billion, 1000 million in European units
49
The growth measure is the annual revenue growth (for more details on how data is processed check
section 3.4.1), and the size measure is the average of the Total Assets over the last year.
After trying with different scales this one was the most successful in clustering the same type of
stocks, since companies’ data is so sparse (the difference in size of the companies is very big comparing
to the difference in growth). The most intuitive normalization would be to normalize both values over the
maximum value of that quarter, however, this would make the density of points near the origin would be
too high for the algorithm to converge properly.
There will be 5 clusters (the same amount as the types in the Classifier module), enumerated from A
to E, and they will move each quarter (the GA will recompute the best locations for clusters each quarter
passed). So, to maintain consistency, the first cluster (cluster A) will always be the one nearest to the
origin of the plane (Origin = Coordinates (0, 0)), B the second nearest, and so on. This way we can
check in a coherent way if a stock changed its cluster.
3.3.6 GA
This module has two functionalities (divided into two submodules). One is to define weights given
to fundamental indicators, used with the Investment Simulator module in order to give buy/sell signals.
Other is used with the Clustering module to optimize the location of the clusters in the plane. The way
each works is described as follows:
• Fundamental Analysis Genetic Algorithm (FA GA)
This is where all the training phases used for different portfolios occur. Since the Portfolios use
Fundamental Indicators, and FI requires a more long term analysis than Technical Indicators, each
time unit is considered a quarter. A generation is an iteration over the last 4 quarters of the GA.
See figure 3.8 to see a structure of a chromosome.
Figure 3.8: Fundamental Indicators’ Chromosome Representation
A pseudocode of the GA used is described:
50
Algorithm 3.1: Used GAg ← 0 ;t← current Quarter −4 ;P (g)← random ;while g != number of generations do
while t != current Quarter doInvestment Simulation from P (t) ;
fitness← Returns from P (t) Simulation ;Pp(g)← Selection of parents from P (t) ;Pc(g)← Crossover from Pp(t) ;Pm(g)← Mutation of Pc(t) ;P (g + 1)← New Generation Creation from (P (t), Pm(t)) ;g ← g + 1 ;
In algorithm 4 t are trimesters and g are generations.
At the beginning of the algorithm the population is generated randomly. Each FI weight is initialized
as in equation 3.13a, the buy signal value is initiated as in equation 3.13b, and the sell signal value
initiated as in equation 3.13c.
r ∈ [0, 1] (3.13a)
b =
N∑i=1
riai (3.13b)
s =b
2(3.13c)
In equations 3.13 N is the number of indicators used, r and a are different random numbers, b
is the buy signal value and s is the sell signal value. Even though b can take values in [0, N ],
finding a random number in this interval will not simulate randomness of indicators and weights.
To construct a truly random b one has to create N random numbers r to simulate weights, N
different random numbers a to simulate indicators’ values, and apply the equation 3.13b. The s
value is calculated as in equation 3.13c to guarantee that it is smaller than the b value, and that it
has a substantial percentage difference from the b value.
A buy signal is given if the sum of the weights times the value of the fundamental indicators is
above b, and a sell signal is given if this sum is below s, as described in equations 3.14. Signal is
the type of signal given to the Portfolio module to simulate buy or sell, vi is the value of indicator
index i and wi is the weight of indicator index i. The average of the top 5 individuals of the algorithm
is used at the end to define the values used by the Investment Simulator.
51
Signal =
{BUY →
∑Ni=1 vi × wi > b
SELL→∑Ni=1 vi × wi < s
(3.14a)
The fitness function will be the ROI of each individual, when simulating investment.
Fitness = ROI =Return - Initial Investment
InitialInvestment(3.15)
The iterations start by simulating investment with the Fundamental Indicators’ weights and the
buy/sell values of the chromosomes. The returns from the simulations with each individual will be
the fitness of that individual.
• Clustering GA
This is where the training and validation of the optimization of clustering positions algorithm occur,
inspired in the Automatic Clustering Genetic Algorithm from [44]. The GA module will receive as
input the size of the chromosome and the number of desired clusters (a fixed value) and run the
GA to find the best locations for clusters centroids, being the output the centroids locations in the
Size×Growth plane. It is relevant to note that
It will use the stocks’ FA to calculate their positioning in the plane, and use the Calinski-Harabasz
index as fitness function. See figure 3.9 to see the structure of a chromosome.
The GA, apart from the usual GA parameters (such has population size, number of generations,
etc..) uses two user inputs:
– Number of possible cluster positions
– Number of solutions (or clusters) created
The chromosomes are constructed with cluster points and have an auxiliary structure called acti-
vation values, in equal numbers, since the activation values will determine if a certain cluster point
is going to be used or not. The number of cluster points and activation values is the possible
number of cluster positions given by the user. Cluster points and activation values are such that:
– Cluster Point is a tuple (size, growth), where size ∈ R+ and growth ∈ R
– Activation value is a number n ∈ [0, 1]
Figure 3.9: Clustering Chromosome Representation
52
A stock position in the plane Size × Growth is normalized before computing centroids and dis-
tances, and described in equation 3.16 (see section 3.3.5 for an explanation on why this normal-
ization was used).
Size = Total AssetsLast Year Average/1000 (3.16a)
Growth = ∆RevenueLast Year Average ∗ 100 (3.16b)
At first, the maximum size of all stocks in the first year (the minimum training period is 1 year) is
obtained, and used as a reference (maxsize). The maximum growth is also computed and used
as reference (maxgrowth). Afterwards chromosomes are randomly initiated, by assigning random
values to Size and Growth, in order to create the N different possible points. This is done as in
equation 3.17.
Sizerandom =maxsize
2× r1 (3.17a)
Growthrandom =maxgrowth
2× r2 (3.17b)
In equation 3.17 r1 ∈ [0, 1] and r2 ∈ [0, 1] are distinct random numbers and maxsize, maxgrowth
are the maximum size and growth measured in the first year.
Activation values, that are an auxiliary structure whose only purpose is to evaluate which clusters
have more members (the GA does not apply to the activation values) are computed after this
initialization, measuring the percentage of stocks that belong to each cluster.
Fist, stocks are assigned to clusters. To check the distance between stocks positions and cluster
positions, the distance function used is the euclidean distance, or L2 norm as in equation 3.18.
‖A−B‖ =√
(xA − xB)2 + (yA − yB)2 (3.18)
That applied to this specific problem comes in the form of equation 3.19.
Figure 4.1: User input classification portfolios’ accumulated returns. The key is the following: SG - Slow Growers;SW - Stalwarts; FG - Fast Growers; POT - Potentials; TA - Turnarounds
It is shown in figure 4.2 that Stalwarts was the type that had the biggest rate of trade success, and
Turnarounds the type with the worst rate of trade success by a large margin. Stalwarts was also the
type with the biggest gain in a trade, and surprisingly, Fast Growers type was the one with the smallest
biggest gain by a large margin, nevertheless it was also the one with the smallest biggest loss. The type
with the biggest loss in a trade was the Turnaround type. Both Fast Growers and Stalwarts obtained
more positive quarters than any other, and Turnarounds got the worst result once more, with only 62,5%
of positive quarters. Fast Growers got the biggest average return per quarter, more than twice the
Turnarounds average.
One of the surprising things here is that the Fast Growers got the best sharpe ratio of all portfolios,
even though it was the one with the biggest return, meaning not only was the portfolio that got the
best returns, it was also the portfolio carrying less risk. Fast Growers type did not had a big drawdown
(comparing with the other types), although the smallest drawdown was the one of the Slow Growers
type. The biggest drawdown was the one from the Turnaround type.
In figure 4.3 one can see that Fast Growers and Potentials were the only types (including the B&H
and the S&P500 index) that got positive results every year. It is also visible that Turnarounds got not
Figure 4.2: User input portfolios metrics table. The key is the following: SG - Slow Growers; SW - Stalwarts; FG -Fast Growers; POT - Potentials; TA - Turnarounds
only the biggest return in the time period (in 2013) but also got the biggest loss (in 2015), showing the
high volatility from that type.
In figure 4.4 is presented the quarterly, accumulated and yearly results of the portfolios, and the worst
quarter of each portfolio is marked in red (this does not mean the drawdown is in the same quarter, since
negative quarters have a bigger impact on portfolios that have bigger accumulated returns).
4.2.1.A Conclusion
The results of Fast Growers (the best type) and Turnarounds (the worst type) were unsurprising,
however, this can not be said about the other types, because without these results there was no way of
knowing what kind of result one could expect.
It needs to be mentioned that these results were obtained with thresholds that were hardwired from
the start. This means that the user that defines these thresholds have access to information about all
the data from the past, and can make a decision about the thresholds using that information. Unless the
user is a financial expert (or at least knows about the subject), the robustness of this classification (to
do it in real-time with the market) is uncertain, since it depends on how well the user understands the
Figure 4.5: Clustering classification portfolios’ accumulated returns. The key is the following: A - Cluster A; B -Cluster B; C - Cluster C; D - Cluster D; E - Cluster E.
biggest loss.
Cluster E has also the greatest drawdown by far, which made the returns of this cluster fall from
101,68% in the second quarter of 2015, to 77,34% in the fourth quarter of 2015, as seen in figure 4.5.
This explains the low value of the Sharpe ratio. Cluster B is the less risky portfolio, having the lowest
drawdown and the biggest Sharpe ratio.
In figure 4.7 is visible a yearly decrease in the returns of cluster E. One can also see that cluster B
was the only portfolio to have positive returns every year.
In figures 4.8 and 4.9 is shown how companies are divided into clusters. It is remarkable how data
is so sparse in the Size axis. Cluster A, as said before, is the biggest cluster, containing most of the
companies, and cluster E has only 4 companies in the fourth quarter of 2012 and 3 companies in the
fourth quarter of 2013. This is due to a shift to the right of the center of the cluster, which made the
leftmost company in cluster E move into cluster D.
In figures 4.10 and 4.11 is shown a zoom of the first 3 clusters. The main difference between cluster
A and cluster B is in the Growth axis, which explains why cluster B got its results. The fact that cluster
B got the best Sharpe ratio, and the smallest drawdown can be now related to the revenue growth. One
73
Number of Trades
Rate of Trade Success
Biggest Gain Biggest LossAverage Time on Market (in
Figure 4.7: Cluster’s portfolios yearly returns. The key is the following: A - Cluster A; B - Cluster B; C - Cluster C;D - Cluster D; E - Cluster E.
-80
-60
-40
-20
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800 900 1000
Grow
th [%
]
Size [1.000.000.000$]
Clusters 2012 Q4
A
B
C
D
E
Cluster A
Cluster B
Cluster C
Cluster D
Cluster E
Figure 4.8: Cluster’s representation in the whole plane in the fourth quarter of 2012. Cluster centers have the key”Cluster X” where X is the name of the cluster (A, B, C, D or E), and the points belonging to a certaincluster have only the letter of that cluster.
75
-60
-40
-20
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800 900 1000
Grow
th [%
]
Size [1.000.000.000$]
Clusters 2013 Q4
A
B
C
D
E
Cluster A
Cluster B
Cluster C
Cluster D
Cluster E
Figure 4.9: Cluster’s representation in the plane in the fourth quarter of 2013. Cluster centers have the key ”ClusterX” where X is the name of the cluster (A, B, C, D or E), and the points belonging to a certain clusterhave only the letter of that cluster.
-80
-60
-40
-20
0
20
40
60
80
100
0 20 40 60 80 100 120 140 160 180
Grow
th [%
]
Size [1.000.000.000$]
Clusters 2012 Q4
A
B
C
Cluster A
Cluster B
Cluster C
Figure 4.10: Cluster’s representation in the plane in the fourth quarter of 2012, zoomed in the first three clusterrepresentation. Cluster centers have the key ”Cluster X” where X is the name of the cluster (A, B, C),and the points belonging to a certain cluster have only the letter of that cluster.
76
-60
-40
-20
0
20
40
60
80
100
0 20 40 60 80 100 120 140 160 180 200
Grow
th [%
]
Size [1.000.000.000$]
Clusters 2013 Q4
A
B
C
Cluster A
Cluster B
Cluster C
Figure 4.11: Cluster’s representation in the plane in the fourth quarter of 2013, zoomed in the first three clusterrepresentation. Cluster centers have the key ”Cluster X” where X is the name of the cluster (A, B, C),and the points belonging to a certain cluster have only the letter of that cluster.
77
Qua
rter
1Q
uart
er 2
Qua
rter
3Q
uart
er 4
Qua
rter
1Q
uart
er 2
Qua
rter
3Q
uart
er 4
Qua
rter
1Q
uart
er 2
Qua
rter
3Q
uart
er 4
Qua
rter
1Q
uart
er 2
Qua
rter
3Q
uart
er 4
Qua
rter
ly
Retu
rns
12,0
0%-3
,29%
5,76
%-1
,01%
10,0
3%2,
36%
4,69
%9,
92%
1,30
%4,
69%
0,61
%4,
39%
0,44
%-0
,23%
-6,9
4%6,
45%
Accu
mul
ated
Re
turn
s11
,99%
8,31
%14
,54%
13,3
9%24
,76%
27,7
0%33
,69%
46,9
6%48
,87%
55,8
5%56
,80%
63,6
8%64
,40%
64,0
3%52
,64%
62,4
9%
Year
ly R
etur
nsQ
uart
erly
Re
turn
s10
,14%
-4,4
5%6,
20%
3,35
%9,
17%
2,69
%6,
45%
7,60
%4,
52%
2,59
%-2
,70%
7,02
%1,
99%
-2,6
0%-7
,14%
1,16
%
Accu
mul
ated
Re
turn
s10
,14%
5,25
%11
,77%
15,5
1%26
,10%
29,4
9%37
,84%
48,3
2%55
,03%
59,0
4%54
,74%
65,6
0%68
,90%
64,5
1%52
,77%
54,5
4%
Year
ly R
etur
nsQ
uart
erly
Re
turn
s9,
13%
-4,7
4%6,
13%
3,65
%9,
19%
2,33
%7,
10%
7,10
%4,
43%
2,29
%-2
,50%
7,35
%1,
85%
-3,8
7%-5
,90%
0,58
%
Accu
mul
ated
Re
turn
s9,
13%
3,96
%10
,33%
14,3
6%24
,87%
27,7
9%36
,86%
46,5
8%53
,08%
56,5
8%52
,68%
63,9
0%66
,93%
60,4
7%51
,00%
51,8
7%
Year
ly R
etur
nsQ
uart
erly
Re
turn
s11
,64%
-3,8
8%4,
82%
4,73
%5,
87%
-1,0
2%4,
43%
7,60
%6,
18%
4,60
%-4
,08%
3,70
%5,
32%
-0,7
5%-2
,88%
0,30
%
Accu
mul
ated
Re
turn
s11
,64%
7,31
%12
,49%
17,8
1%24
,72%
23,4
5%28
,92%
38,7
2%47
,30%
54,0
8%47
,79%
53,2
5%61
,41%
60,1
9%55
,58%
56,0
6%
Year
ly R
etur
nsQ
uart
erly
Re
turn
s10
,06%
-5,1
0%3,
01%
1,08
%9,
64%
5,69
%3,
85%
8,74
%4,
67%
3,80
%-1
,76%
4,69
%-0
,78%
0,80
%-9
,50%
0,75
%
Accu
mul
ated
Re
turn
s10
,06%
4,45
%7,
59%
8,75
%19
,24%
26,0
3%30
,88%
42,3
2%48
,97%
54,6
3%51
,91%
59,0
3%57
,79%
59,0
5%43
,94%
45,0
2%
Year
ly R
etur
nsQ
uart
erly
Re
turn
s9,
46%
5,51
%5,
78%
-2,5
6%5,
55%
4,08
%1,
20%
8,65
%0,
55%
-2,9
2%-3
,27%
5,86
%1,
70%
1,12
%-1
0,33
%1,
37%
Accu
mul
ated
Re
turn
s9,
46%
15,4
9%22
,17%
19,0
5%25
,65%
30,7
7%32
,35%
43,8
0%44
,59%
40,3
7%35
,78%
43,7
4%46
,18%
47,8
2%32
,55%
34,3
7%
Year
ly R
etur
nsQ
uart
erly
Re
turn
s28
,17%
-5,5
9%12
,58%
9,41
%6,
81%
6,86
%4,
77%
9,16
%-1
,37%
4,55
%-1
,67%
3,53
%-3
,58%
2,40
%-1
1,82
%-0
,29%
Accu
mul
ated
Re
turn
s28
,17%
21,0
0%36
,22%
49,0
4%59
,20%
70,1
2%78
,24%
94,5
7%91
,90%
100,
64%
97,2
9%10
4,26
%96
,95%
101,
68%
77,8
5%77
,34%
Year
ly R
etur
ns
Clus
ter A
-13,
18%
Clus
ter
D
19,0
5%20
,79%
-0,0
4%-6
,52%
Clus
ter E
49,0
4%30
,54%
4,98
%
-8,8
1%
Clus
ter B
17,8
1%17
,75%
10,4
7%1,
83%
Clus
ter C
8,75
%30
,87%
11,7
5%
B&H
15,5
1%28
,40%
11,6
5%
S&P5
00
13,4
1%29
,60%
11,3
9%
2014
2015
14,3
6%
2012
2013
-0,7
3%
-6,6
8%
28,1
7%11
,81%
-7,3
4%
Figu
re4.
12:
Inve
stm
entb
yTy
peta
ble.
The
key
isth
efo
llow
ing:
A-C
lust
erA
;B-C
lust
erB
;C-C
lust
erC
;D-C
lust
erD
;E-C
lust
erE
.
4.2.3 Case Study III - Using GAs to optimize FI
This case study presents the results of using the FA GA described in section 3.3.6 to optimize weights
given to Fundamental Indicators. It is studied the results of applying GAs to give buy and sell signals on
each of the groups of case studies 1 and 2, and the whole dataset.
Since the GA uses the last year of data on training, these portfolios will have a training period of 2
years (2010 and 2011). The first year of training is used to define the type/cluster of stocks, and the
second year is used to train the GA.
The algorithm ran 10 times for each model, and in the models that used the clustering algorithm, the
clustering algorithm has run twice, so for each portfolio that uses clustering, 5 runs used the first run of
the clustering algorithm, and the other 5 used the other run of the clustering algorithm. Results were
compared between them. In table 4.2 is shown the best, worst and median run of each portfolio.
Table 4.2: Best, worst and median run of each portfolio. For the type portfolios: SG - Slow Grower, SW - Stalwart,FG - Fast Growers, POT - Potential, TA - Turnarounds. The keys: A, B, C, D and E refer to the clusters.
Best Run Worst Run Median RunGA 43,41% 36,05% 41,30%
Figure 4.13: GA optimization portfolios. The key ”SG” represents the portfolio of type Slow Growers and the key”E” represents the portfolio of cluster E, as constructed in case studies 1 and 2. The suffix ”-GA”represents the portfolios using GA to optimize FI weights. The key ”GA” represents the GA algorithmrunning over the whole dataset.
Also in figure 4.14 it is shown that the SG-GA portfolio increased in only 1,94% the returns, comparing
with the SG portfolio from case study 1. It reduced substantially the number of trades done, when
compared with portfolio SG (78,03% - from 132 trades in portfolio SG, to 29 trades in portfolio SG-GA).
The rate of positive quarters increased (from 68,75% to 75%), however the drawdown also increased
(from 10,07% to 16,45%). This is visible in figure 4.13 that the SG-GA portfolio rose more than the SG
portfolio in the fourth quarter of 2015, but fell right afterwards until the point it almost crossed below the
SG portfolio, and in figure 4.15 we can see how small the difference between the two portfolios is in
terms of yearly returns.
4.2.3.A Conclusions
One can conclude from this case study that Fundamental Indicators optimization has better results
when applied to small sets of companies (as there were cluster E and D).
Also, the GA used in this work greatly reduced the number of trades done in all portfolios, reducing
Figure 4.14: GA portfolios metrics table. ”GA” is the whole dataset using the FA GA algorithm, and SG-GA andE-GA is respectively the Slow Growers type and the cluster E using the FA GA algorithm.
potential losses, but also potential gains. In the case of the E-GA portfolio, the major difference from the
cluster E portfolio presented in case study 2 was the reduction of losses in 2015.
One can also conclude that clustering classification is better than the user input one, since it groups
companies with more similarities than those of the user input classification, which allows the FA GA
Figure 4.15: FA GA portfolios’ yearly returns. ”GA” is the whole dataset using the FA GA algorithm, and SG-GAand E-GA is respectively the Slow Growers type and the cluster E using the FA GA algorithm.