Features - cdn.whu.edu · Features 14 Market Model Benchmark Suite for Machine Learning Techniques by Martin Prause and Jürgen Weigand 25 Intelligent Asset Allocation via Market
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Volume 13 Number 4 ❏ November 2018www.ieee-cis.org
product/national, HHL product/international. At the market
level, firms have to decide on their competitive strategy, i.e.
their market-specific strategic positioning as either price-/cost-
leader, differentiator, or outpacer [40]. The actual positioning
results from the relative value–price relationship in each partic-
ular market. The external decisions determine the potential
demand of a firm and contribute to that of the market as a
whole. The core drivers of firm-specific and total market
demand are the firms’ marketing efforts [42], their brand values,
and macroeconomic factors. Marketing efforts are defined by
the “four P’s”: (1) price, (2) promotion, or expenditures that
increase total market demand via advertising, (3) placement,
represented by the number of salespeople, which influences
firm-specific demand, and (4) product value, which can be
increased through R&D investments. The value/price ratio,
promotion, and placement efforts in conjunction with the
brand value and macroeconomic factors determine a firm’s
potential demand. Brand value is a cumulative measure reflect-
ing consumer satisfaction. It represents first- and second-mover
advantages/disadvantages of market entry and can be influ-
enced by corporate identity expenditures, the firm’s attractive-
ness, and its ability to live up to its promises.
This setup ensures that a firm can tap into all sources of rev-
enue advantage such as differentiation (value/price ratio), inno-
vation (product value and brand value), and people (R&D staff
and salespeople). Figure 1 summarizes the domain for external
decisions that define a firm’s business strategy.
C. Aligning the OrganizationInternal firm decisions fall into two categories: operations and
financing. The operations part consists of three pillars: (1) mate-
rial sourcing, (2) product outsourcing, and (3) production. For
each product, the material must be purchased. Starting with a
one-to-one relationship of material-to-product, a firm can
invest in materials development to reduce the amount of
*****
Product Level(1…2 Products)
Decisions of Firm x
AttractivenessExpenditures
CorporateIdentify Exp.
PriceDecision
SalesStaff
PromotionExpenditures
R&DInvestments
Capabilityof
Delivering
BrandValue
MacroeconomicFactors
PricePolicy
PlacementPolicy
PromotionPolicy
ProductValue
ProductPolicy
BrandValue
MacroeconomicFactors
PricePolicy
PlacementPolicy
PromotionPolicy
ProductPolicy
* * * * *
Potential Demandof Firm x
Decisions of All Other Firms
* Relative Effects
Firm Level Market Level (1....4 Markets)
FIGURE 1 The potential demand for firm x is influenced by the relative effects of the four P’s of its marketing, brand value, and macroeconomic factors. Dark rounded rectangles denote decisions.
marti
Notiz
Instead of "Identify" use "Identity"
18 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | NOVEMBER 2018
material needed for production. The sourcing costs depend on
macroeconomic factors. Inventory management is a key factor
because keeping stock maintains flexibility but incurs high
costs. While material sourcing is necessary to produce products
in-house, outsourcing is an alternative. A firm can determine
the number of units to be manufactured outside as well as the
length of the outsourcing contract. Like materials and products
made in-house, outsourced products are immediately available
and can be delivered as ready-made products. The outsourcing
costs are pegged to an external exchange rate, which varies
according to macroeconomic factors. Thus, by selecting con-
tract length, the firm may hedge against cost fluctuations. Final-
ly, a firm must decide on the volume of production for its
product/market portfolio. It can produce more or less than its
potential demand in any market. Potential demand is not
known ex ante because it is a function of all external decisions
made simultaneously by all firms. Therefore, a firm must esti-
mate its potential demand as implied by its business strategy.
The sum of product inventory, outsourced products, and man-
ufactured products serves to satisfy the potential demand. For
the production decision, two independent capacity factors have
to be taken into account: (1) the production capacity of the
respective facility and (2) the production staff capacity. Facilities
come in different sizes and production capacities. A firm can
purchase and retire facilities to adjust its total capacity. It can
also invest in infrastructure and flow optimization to increase
capacity. One unit of capacity equals one unit of the low- or
high-cost product. The second capacity factor is the production
staff. Each blue-collar worker has a base productivity rate per
round. The productivity rates differ by product and can be
increased over time by investing in training and incentives. The
productivity rate times the number of workers defines the total
production staff capacity. This setup enables a firm to exploit
different types of cost advantages, such as economies of scale
(e.g., bulk purchase and mass production), scope (e.g., utilizing
facilities for both products), and learning (e.g., productivity
increases of production staff). The minimum of the facility
capacity and production staff capacity defines the number of
units that can be produced (Figure 2).
The final element of the internal perspective is corporate
financing and shareholder wealth creation. A firm must decide
how to finance its activities, e.g., out of cash flow, using short-
term vs. long-term loans and managing interest and repayment
trade-offs, and how much to pay out to shareholders in terms
of dividends.
III. Characteristics and Applications of the Benchmark SuiteGames such as chess and Go entail perfect information. There-
fore, theoretically, they can be solved by an exhaustive search of
the full game tree. In practice, an exhaustive search is infeasible
because of the large search space: 3580 for chess and 250 015 for
Go [25]. The search space can be reduced, however, by limiting
the state space and the action space and using metaheuristics to
map between both [26]. For Atari jump-and-run games, the
state space consists of multiple subsequent frames (pixel
TotalUnits
PotentialDemand
ProductInventory
Must Be Aligned
Must Be Aligned
OutsourcingQuantity
OutsourcingContractLength
MaterialQuantity
ProductionQuantity
MaterialDevelopmentExpenditures
TrainingExpenditures
IncentiveInvestments
MaterialNeeded
MaterialInventory
Total ProductionCapacity
ProductionStaff
Number ofFacilities
OptimizationExpenditures
InfrastructureExpenditures
ProductionStaff Capacity
FacilityCapacity
AdvanceProductivity
AdvanceCapacity
BaseProductivity
BaseCapacity
*
**
*
* Per Facility Type
FIGURE 2 Internal structure of a firm. Dark rounded rectangles denote decisions. The production quantity, material, outsourced quantity, and inventory are product specific and should be aligned with each other and with the potential demand.
NOVEMBER 2018 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 19
screens), and the action space comprises com-
mands such as moving the game figure right,
left, up, or down. The real-time strategy game
StarCraft II poses the next stage of challeng-
ing problems, with more than two players
interacting, a more extensive state space and
action space (approximately 108 ), imperfect
information due to partially observed game
states, and many thousands of frames of game-
play [33]. The proposed business simulation
testbed pushes these boundaries further along the dimensions
of game complexity, game dynamics, and game objectives. First,
referring to complexity, there are four to eight players in the
testbed, and the action space has become infeasible to address.
Whereas actions in chess, Go, or arcade games are limited to
one per round or frame, StarCraft II allows multiple actions at
once, such as moving a group of units to a specific grid posi-
tion. However, the number of sequential actions per minute is
limited in the StarCraft II Learning Environment to account
for a fair comparison with human players. In the business simu-
lation, an agent can make up to 48 decisions per round. A deci-
sion can take any number within a specific range, such as for
price (1500–2500), number of salespeople (0–999), or produc-
tion volume (0–500,000). The median of ranges is 1000, lead-
ing to 100048 possible decision combinations. While most of
the actions in StarCraft II are sequential over a long period of
time, the business simulation focuses on simultaneous decisions
over a short period of time (a few rounds). Thus, the training
data are very sparse compared to existing test problems. Second,
with respect to the game dynamics, the external environment
changes over time: Demand, costs, exchange rates, and other
macroeconomic factors vary according to the industry life-
cycle and customized market shocks. Finally, the game objec-
tives differ from established games. Various rewards such as share
price, market share, cash position, profit level, and survival of
rounds can be used as performance measures, reflecting agents’
extrinsic and intrinsic motivators [43]. Whereas StarCraft II is a
zero-sum (win/lose) game, the business simulation need not be
if agents have different objectives. The benchmark suite focuses
on the environment and not on the agent. It is a testbed for
algorithms to compare and replicate scenarios without pro-
gramming language limitations. Therefore, the benchmark suite
provides a representational state transfer (REST) application
programming interface (API) to access the current game state,
reward, and decision/action vectors using a JavaScript Object
Notation (JSON) as follows:
[{“SHAREPRICE”:257,”TOTALREVENUES”:257,”
NETINCOME”:13, …},… {…}]
The game state is a representation of the environment con-
sisting of (1) public information from other firms, such as bal-
ance sheet and cash flow statement data, (2) sales, revenues, and
product values from each market and firm, (3) macroeconomic
variables such as the GDP, exchange rate, and material costs,
and (4) each firm’s internal measures such as unit costs and
inventory levels. The workflow for using the benchmark suite
(Figure 3) consists of five main functions: (1) Reset the game
and define specific parameters (demand cap, minimum product
value standards, production costs, etc.) and induce system
shocks, using either artificial or real-world data (RESET), (2)
obtain the upper and lower limits for each variable (LIMITS),
(3) submit the decision vector and proceed to the next round
(STEP), (4) obtain the game status (STATUS), and (5) simulate
internal decisions (SIM).
Business decisions can be decomposed into two distinct
types: external decisions (which influence the market and com-
petitors) and internal decisions (which do not interact with the
environment). Once the business strategy is set, and an agent
estimates the potential sales based on her own decisions and
assumptions about other players, she needs to align the internal
decisions. This setup reflects Alfred Chandler’s paradigm of
Structure follows Strategy. The choice of external decisions can be
characterized as a pattern recognition problem of mapping
states to actions. Aligning the internal decisions, however, is an
optimization problem. When the potential sales (the result of
predicting the effect of the external decisions) for each market
are submitted along with the internal decisions of a firm (simac-
tion vector) and the firm number ( ),x the tool returns all inter-
nal measures such as the inventory, unit cost, cash flow, and
balance sheet. In summary, the design of the benchmark suite
helps to develop and evaluate general machine learning tech-
niques for decision making and calibration and combines the
learning tasks of exploration and exploitation (external deci-
sions) with multivariable optimization (internal decisions).
IV. The Challenges of Reinforcement Learning in Business SimulationsPattern recognition algorithms can be categorized broadly as
unsupervised, supervised, or reinforcement learning [44]. As
data labels in the business simulation are sparse and time-
delayed, the problem is one of reinforcement learning. This
approach assumes an agent situated in an environment. She
observes the environment ( ;s e.g., a JSON object), takes actions
( ;a e.g., a JSON object), and receives a reward ( ,r determined
by the agent; e.g., number of rounds survived, profit after round
,t etc.) and a new state ( ;sl e.g., a JSON object) from the envi-
ronment: , .s r sa" l Reinforcement learning seeks a sequence of
actions, called a policy, to maximize the total reward [45]. It
solves the credit assignment problem by rewarding preceding
The design of the benchmark suite helps to develop and evaluate general machine learning techniques for decision making and calibration and combines the learning tasks of exploration and exploitation (external decisions) with multivariable optimization (internal decisions).
20 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | NOVEMBER 2018
actions that contribute to the final outcome. From the agent’s
perspective, the environment is non-deterministic; the total
reward ( )Rt at time t is the sum of the current reward ( )rt and
discounted future rewards , [ , ]R 0 1t 1 dc c+^ h when a specific
action is pursued: .R r Rt t t 1c= + + c is called the discount
factor, with a value close to 1 to consider future discounted
cumulative rewards. Following [46], a function ,Q s at t^ h can be
defined estimating the total reward for a given state s and
action .a An optimal choice is an action that maximizes the
discounted future rewards: , .maxQ s a Rt t t 1= +^ h Assuming
such a Q-function exists, an optimal policy would choose the
action with the highest Q-value: ,maxs Q s aa
=^ ^h h. Rewrit-
ing this formula gives the Bellman equation, which can be used
to iteratively approximate the Q-function. Lin [47] proves that
this equation converges if the states are finite: ,Q s a =^ h
, .maxr Q s aac+ ) )) ^ h Following the suggestions in [27], a neu-
ral network with multiple hidden layers is used to approximate
the Q-function. The input layer is the game state vector; the
output layer is the Q-values for any possible action. The net-
work is initialized randomly, and the regression task is opti-
mized using the least-squares error method and stochastic
gradient descent.
Unfortunately, the game states are not finite in a business
simulation because the domain is of real numbers. Thus, con-
vergence can no longer be guaranteed. Several countermea-
sures are proposed to stabilize the process, such as experience
replay [47], e-greedy policy [27], or target networks [48]. In
experience replay, state–action transitions are stored in memory
from a knowledge base, random generation, or preceding train-
ing tasks; the neural network is trained by mini-batches from
this memory. The e -greedy policy prevents the algorithms
from always picking the action with the highest Q-value and
thus getting stuck in a local optimum: With some probability, a
random action is chosen instead of the action with the highest
Q-value. A typical implementation follows the policy that over
time, the algorithm first explores and later becomes increasing-
ly greedy; this is reflected by the factor ,m which controls the
speed of decay of exploration: .emin max mint
ee e e= + - m-^ h
Target networks are used to stabilize the training process. For
any forward pass of samples regarding a specific action, all
Q-values associated with similar actions will also be affected,
and the neural network will never stabilize. To avoid this
behavior, every x steps the weights of the training networks are
copied to a target network, and predictions of Q-values (used
Agent: CompileAction/Decision Vector
Setup and Customize the Test Bed(Economic Model)
End?
Start
Workflow REST API Command (Pseudocode)
Get Decision Limits and CurrentGame State
Optional: Simulate theEffect of Internal Decisions
Submit Action Vector andMove to the Next Round
End
RESET Function:game_state <- GET x.x.x.x/reset/token/param
LIMIT and STATUS Function:boundaries <- GET x.x.x.x/limits/token
status <- GET x.x.x.x/status/token
SIM Function:sim_state <- GET x.x.x.x/sim/token/x/simaction
STEP Function:game_state <- GET x.x.x.x/step/token/action
Yes
No
FIGURE 3 The general workflow for using the benchmark suite is depicted on the left. The right part of the figure shows the corresponding REST API commands in Pseudocode. x.x.x.x, the service’s IP address1; The dark rounded rectangle highlights the external part of the suite, the algorithm or agent.
1 Online Resource: Access to the public free service is available at www.stratchal.com/
demo
marti
Notiz
In front of "(s)=max ..." the capital greek letter Pi (\Pi) is missing. Please check the attached image for the correct formula.
NOVEMBER 2018 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 21
in the error function for the training network)
are simply taken from the target network. This
slows the learning process but stabilizes the
Q-values. Various other stabilization and
exploration–exploitation techniques are used
in [26], [48], and [49] for demonstration pur-
poses; these three are the most prominent
ones. A general algorithm is given in Figure 4.
The major challenges in this task are the large
state and action space and the distinction between external
variables and internal optimization. Therefore, three experi-
ments are presented to highlight the challenges and to demon-
strate the use of the benchmark suite. The first experiment
showcases a simple Cournot oligopoly just looking at the
external decisions. The second mimics a cobweb model and
incorporates also the internal decisions but without differenti-
ating on the learning approach. The third experiment advances
the second one by utilizing different learning approaches on
external and internal decisions.
In the first experiment, the state space consisted of 21 vari-
ables: potential sales, actual sales, cash position, balance sheet
total, and share price for each of the four players, plus the
domestic market’s GDP as a hint for the industry cycle. The
action vector was an 8-bit string, resulting in 256 different
actions. Two bits signaled a price increase of 50 ( ),012 a price
decrease of 50 ( ),102 or no price change (112 or 002) for a sin-
gle player. Production volumes, material procurement, and pro-
duction workers were automatically adjusted at a fixed rate,
ignoring any internal complexity. All other variables were set to
default values and left unchanged. The reward, the time-delayed
data label, was the number of rounds
all firms survived. Although different
rewards could have been imposed to
represent different objectives for individ-
ual firms, all firms shared the same
objective in order to compare the results
with other empirical findings in [8].
Once a firm goes bankrupt, the game
restarts. The industry cycle takes four
stages. Each stage lasts for 8 rounds, and
the cycle was repeated 25 times. Further
parameters, whose values were chosen to
optimize the trade-off between comput-
ing time and memory capacity, are (1)
size of the replay memory, 1000; (2)
mini-batch size, 64; (3) target network
updates, 100; and (4) the neural network
structure, two hidden dense layers (first,
512 neurons; second, 256 neurons); hid-
den layer activation functions, rectified
linear; output activation function, soft-
max; input layer, 21 neurons; and output
layer, 256 neurons. The e-greedy policy
parameters were standard lower and
upper probability limits ,mine 0.01, and
,maxe 0.99; with a slow convergence rate of m , 0.001; and the
discount factor ,c 0.99, was chosen according to the literature
in [50]. Figure 5 shows the results of 1100 training runs. The
overall firm survival rate increased from 18 to a maximum of
32 rounds (left image). The decisions in the last 200 rounds
(right image) demonstrate that the algorithm learned to reduce
the price in times of industry recession (cycles , ,a b c and g ) and
to increase the price in times of growth (cycles ,a b and c ). This
collusive behavior aligns with the findings in [8, p. 3287],
wherein the authors applied a Q-learning algorithm to a simple
oligopoly model and concluded, “Q-learning firms generally
learn to collude with each other in Cournot oligopoly games,
although full collusion usually does not emerge, that is, firms
usually do not learn to make the highest possible joint profit.”
The second experiment incorporated price, sales, and basic
internal production decisions. Each decision was represented by
a single bit: (1) a price increase (1) or decrease (0) of 50, (2) an
increase (1) or decrease (0) of 40 in the number of salespeople,
and (3) an increase of 5000 units in the production quantity and
of 70 people in the production staff (1) or a corresponding
decrease (0). The resulting 12-bit string creates an action space
Initialize replay memory ( ) with random transitions
Initialize the training neural network ( ) with random weights
Copy the weights of to a target network ( )
<- GET x.x.x.x/reset/token/param
while (continue):
select action (external variables) at state with -greedy method
Create action vector from and internal aligned variables
∗ <- GET x.x.x.x/step/token/
reward <- GET x.x.x.x/status/token
Store transition ∗ > in ; if is full discard the oldest entry
Sample mini-batches from :
Train
Every k steps, copy weights from to
∗
FIGURE 4 Pseudocode of the reinforcement learning algorithms (with experience replay, e -greedy policy, and target networks) using the benchmark suite; x.x.x.x, the service’s IP address.
The presented business simulation provides a testbed for machine learning algorithms on a problem domain with a continuous state and action space, in a competitive, non-zero-sum game of imperfect information with time constraints.
22 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | NOVEMBER 2018
of 4096 combinations. The underlying associated economic
model is the cobweb model in which production quantities are
chosen before the market price is observed [51]. The defined
reinforcement learning approach is best suited for credit assign-
ment tasks. Therefore, it is also applied to the second experi-
ment without any structural adjustments. Internal and external
decisions are treated with the same learning approach, despite
the fact, that the testbed proposes a problem of credit assign-
ment and optimization. The results (Figure 7) show a slower
learning rate within 1100 training steps (from 8 to 14) than in
the first setup due to the larger action space. However, qualita-
tively, two different scenarios can be identified (Figure 6).
0
2,000
4,000
6,000
8,000
10,000
12,000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
Sta
cked
Pric
e Le
vel
Runs
0
200
400
600
800
1,000
1,200
1,400
1,600
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
Sta
cked
Sal
es P
eopl
e Le
vel
Runs(a) (b)
aL bL cL dL aR bR cR dR
Firm 1 Firm 2 Firm 3 Firm 4
FIGURE 6 Second Experiment. (a) The stacked prices with firm 1 at the bottom and firm 4 on the top. (b) She stacked salespeople decisions. The Red Queen effect is observed in cycles aR and .dR Lifecycles aL to dL and aR to dR in Figure 6 correspond to lifecycles a to d in Figure 5.
0
5
10
15
20
25
30
35
Num
ber
of R
ound
s S
urvi
ved
Training Steps
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
1 12 23 34 45 56 67 78 89 100
111
122
133
144
155
166
177
188
Sta
cked
Pric
e Le
vel
Runs
(a) (b)
0 200 400 600 800 1,000
Firm 1 Firm 2 Firm 3 Firm 4
a b c d e f g
FIGURE 5 First Experiment. (a) The average number of rounds survived for all firms by manipulating the price and neglecting quantity decisions. The red line depicts the regression line. (b) Price decisions for the last 200 runs. Prices are stacked to highlight their similarity, with firm 1 at the bottom and firm 4 on the top. Each vertical dotted line signals the beginning of a new industry lifecycle (a to g ).
marti
Notiz
Instead of "She stacked" use "The stacked"
NOVEMBER 2018 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 23
(1) During recessions, expenditures for sales-
people are used to mitigate a potential price
decrease: In rounds 71–76 and 86–91, the
number of salespeople increased as the price
dropped. (2) Once firms spend for more sales-
people, a rate race begins (as suggested by the
Red Queen principle) and firms hire even
more, especially in growth phases (rounds
16–21 and 106–111) when prices increase.
Due to the option to set internal decisions, the
algorithm was able to adjust better to industry life-cycle fluctu-
ations than in the first experiment (compare Figure 5 cycles
,a ,b and c with Figure 6 cycles ,aL ,bL and cL ), which mim-
icked the behavior in a cobweb model [51]. However, with this
algorithm, the performance (survival rate) was significantly
lower, which highlights the influence of the complexity of the
action space on learning speed.
The third experiment advances the second one by distin-
guishing between learning approaches for external and
internal variables. The explicit distinction between external
and internal variables provides the algorithm with domain
knowledge. Instead of changing the production staff at a
fixed rate an evolutionary algorithm, based on the approach
in [51], is implemented to determine the optimal num-
ber of workers. The inventory, the previous sales volume
and current change in production volume is used to deter-
mine the potential demand of the upcoming round to call
the SIM command. The cash position returned by the SIM
command is used as the fitness measure in the evolution-
ary algorithm. Figure 7 shows that this leads to a better
per formance due to a less complex action space and the
im ple menta tion of domain knowledge (separation of exter-
nal and internal decisions).
The first two experiments demonstrate that the testbed can
be used to replicate economic models and generate economic
behavior patterns in agent-based computational economics [8],
[51]. The benchmark suite improves upon the existing approach-
es by providing decisions across all areas of Porter’s value chain in
a standard setting that can be customized and, due to the REST
implementation, can be accessed by all types of agents (algo-
rithms) regardless of their underlying programming structure and
language. The experiments also demonstrate that a trivial imple-
mentation of state-of-the-art reinforcement learning algorithms
is capable of simulating real market behavior yet explode in
computational time and space [48] for any continuous action
space. Further algorithmic improvement techniques such as
automatic network structure determination [52] or hyperparam-
eter optimization [53] improve the performance, but the main
problem is the combination of simultaneous decision making
and infinite action space in conjunction with the optimization of
internal variables. This calls for other approaches, such as proba-
bility distributions of actions [33], the actor-critic approach with
0
5
10
15
20
25
30
35
Num
ber
of R
ound
s S
urvi
ved
Training Steps900 950 1,000 1,050 1,100
0
5
10
15
20
25
30
35
Num
ber
of R
ound
s S
urvi
ved
Training Steps
(a) (b)
900 950 1,000 1,050 1,100
FIGURE 7 Second and third Experiment: The average number of rounds survived for all four firms in the last 200 training steps. The average number of rounds all firms survived in the third experiment is significantly higher (Right picture, 17 rounds) than in the second experiment (Left picture, 13 rounds). The red line depicts the regression line overall training steps.
The experiments demonstrate that a trivial implementation of state-of-the-art reinforcement learning algorithms is capable of simulating real market behavior yet explode in computational time and space.
24 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | NOVEMBER 2018
discrete [54] or continuous action space [48], [55]. The third
experiment applies the testbed to the joint solution of the credit
assignment and optimization problem. Moreover, the model can
be used to test algorithms to autonomously (without domain
knowledge) differentiate between different learning styles.
Machine learning approaches alone may not be sufficient
because the business simulation is hard to predict, not self-con-
tained and focuses on simultaneous decisions in a short period of
time, rather than sequential decisions over a long period. There-
fore, the authors of [23] suggest that major breakthroughs might
utilize hybrid techniques combining deep learning with reason-
ing, in this case, economic reasoning.
V. ConclusionThis article has presented a benchmark suite for machine learn-
ing algorithms for strategic decision making in a business con-
text. This tool extends the current set of training environments in
an artificial context, such as rllab, OpenAI Gym, or StarCraft II,
by providing a dynamic multi-market, multi-product environ-
ment with few producers and many consumers for applications
of strategic planning. It extends the current state-of-the-art prob-
lems for reinforcement learning by allowing a continuous state
and action space in a non-zero-sum game of imperfect infor-
mation. Three reinforcement-learning approaches using deep
Q-learning with experience replay, e-greedy policy, and target
networks for stabilization were used to demonstrate price-setting
and quantity-setting behaviors of four firms. The results show
that the decisions made by the algorithms align with expected
outcomes in such oligopolistic markets.
References[1] J. P. Kotter, Accelerate: Building Strategic Agility for a Faster-Moving World. Boston, MA,
USA: Harvard Business Review Press, 2014.
[2] B. Gilad, “‘Competitive intelligence’ shouldn’t just be about your competitors,” Harv.
Bus. Rev., 2015, May 18. [Online]. Available: https://hbr.org/2015/05/competitive-intelligence-
shouldnt-just-be-about-your-competitors. Accessed on: August 27, 2018.
[3] M. Reeves and G. Wittenburg, “Games can make you a better strategist,” Harv. Bus.