Top Banner
arXiv:1309.0085v1 [cs.NI] 31 Aug 2013 1 Artificial Intelligence Based Cognitive Routing for Cognitive Radio Networks Junaid Qadir Abstract—Cognitive radio networks (CRNs) are networks of nodes equipped with cognitive radios that can optimize perfor- mance by adapting to network conditions. While cognitive radio networks (CRN) are envisioned as intelligent networks, relatively little research has focused on the network level functionality of CRNs. Although various routing protocols, incorporating varying degrees of adaptiveness, have been proposed for CRNs, it is imperative for the long term success of CRNs that the design of cognitive routing protocols be pursued by the research community. Cognitive routing protocols are envisioned as routing protocols that fully and seamless incorporate AI-based techniques into their design. In this paper, we provide a self-contained tutorial on various AI and machine-learning techniques that have been, or can be, used for developing cognitive routing protocols. We also survey the application of various classes of AI techniques to CRNs in general, and to the problem of routing in particular. We discuss various decision making techniques and learning techniques from AI and document their current and potential applications to the problem of routing in CRNs. We also highlight the various inference, reasoning, modeling, and learning sub tasks that a cognitive routing protocol must solve. Finally, open research issues and future directions of work are identified. I. INTRODUCTION In cognitive radio networks (CRNs), nodes are equipped with cognitive radios (CRs) that can sense, learn, and react to changes in network conditions. Mitola envisioned that CRs could be realized through incorporation of substantial computational or artificial intelligence (AI)—particularly, ma- chine learning, knowledge reasoning and natural language processing [1]—into SDR hardware. In a modern setting, this is achieved by incorporation of a cognitive engine (CE) using various AI-based techniques through which the CR adapts to the network conditions to satisfy some notion of optimality [2]. CRs have also been proposed for a wide range of applications including intelligent transport systems, public safety systems, femtocells, cooperative networks, dynamic spectrum access, and smart grid communications [2] [3]. CR promises to dramatically improve spectrum access, capacity, and link per- formance while also incorporating the needs and the context of the user [2]. CRs are increasingly being viewed as an essential component of next-generation wireless networks [3] [4]. Although cognitive behavior of CRNs can enable diverse applications, perhaps the most cited application of CRNs is This work has been supported by Higher Education Commission (HEC), Pakistan under the NRPU programme. Junaid Qadir ([email protected]) is with the Electrical Engineering Department at the School of Electrical Engineering and Computer Science (SEECS) at the National University of Sciences and Technology (NUST), Pakistan. dynamic spectrum access (DSA) 1 [5]. DSA is proposed as a solution to the problem of artificial spectrum scarcity that re- sults from static allocation of available wireless spectrum using the command-and-control licensing approach [5]. Under this approach, licensed applications represented by primary users (PUs) are allocated exclusive access to portions of the available wireless spectrum prohibiting other users from access even when the spectrum is idle. With most of the radio spectrum already being licensed in this fashion, innovation in wireless technology is constrained. The problem is compounded by the observation, replicated in numerous measurement based studies world over, that the licensed spectrum is grossly underutilized [3] [5]. The DSA paradigm proposes to allow secondary users (SUs), also called cognitive users, access to the licensed spectrum subject to the condition that SUs do not interfere with the operations of the primary network of incumbents. While CRs have been defined differently [2], the following tasks are considered integral to them: i) observation or aware- ness, ii) reconfiguration, and iii) cognition. In this paper, we will be occupied mostly with cognition as we seek to build cognitive, AI-based, routing protocols. Cognition subsumes both reasoning and learning with reasoning being the process of finding the appropriate action for particular situations to meet some system target, and learning being the process of accumulating knowledge based on the results of previous actions [2] [6]. Generally speaking, cognition for a CR entails understanding and reasoning about the radio environment so that informed decisions may be taken to optimize the performance of the radio and of the overall network. Both learning and reasoning are essential elements of cog- nition and a lot of research attention has rightly focused on incorporating cognition in CRs. However, while incorporating learning and adaptiveness into CRs is highly desirable, the vision of a ‘cognitive network’ will not be realized until the networks, and the network layer functions, seamlessly incorporate intelligence. Cognitive networks are envisioned as intelligent networks that perceive current conditions to plan, decide and act while catering to the network’s overall end-to- end goals [7] [8]. Cognitive networking broadly encompasses models of cognition and learning that have been defined for CRs but are distinguished from isolated CRs in its emphasis on its networking wide and end-to-end scope. In previous work on cognitive networks, Mahonen et al. proposed a cognitive re- source manager as a framework for network-wide optimization 1 DSA is such a dominantly cited application of CRNs that DSA and CRN are often assumed to be synonymous incorrectly. CRNs, in fact, is a much broader concept allowing for diverse applications representing intelligent behavior [5].
28

Artificial Intelligence Based Cognitive Routing for ... - arXiv

Mar 16, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Artificial Intelligence Based Cognitive Routing for ... - arXiv

arX

iv:1

309.

0085

v1 [

cs.N

I] 3

1 A

ug 2

013

1

Artificial Intelligence Based Cognitive Routing forCognitive Radio Networks

Junaid Qadir

Abstract—Cognitive radio networks (CRNs) are networks ofnodes equipped with cognitive radios that can optimize perfor-mance by adapting to network conditions. While cognitive radionetworks (CRN) are envisioned as intelligentnetworks, relativelylittle research has focused on the network level functionality ofCRNs. Although various routing protocols, incorporating varyingdegrees of adaptiveness, have been proposed for CRNs, it isimperative for the long term success of CRNs that the design ofcognitive routing protocols be pursued by the research community.Cognitive routing protocols are envisioned as routing protocolsthat fully and seamless incorporate AI-based techniques into theirdesign. In this paper, we provide a self-contained tutorialonvarious AI and machine-learning techniques that have been,orcan be, used for developing cognitive routing protocols. Wealsosurvey the application of various classes of AI techniques to CRNsin general, and to the problem of routing in particular. We discussvarious decision making techniques and learning techniques fromAI and document their current and potential applications tothe problem of routing in CRNs. We also highlight the variousinference, reasoning, modeling, and learning sub tasks that acognitive routing protocol must solve. Finally, open researchissues and future directions of work are identified.

I. INTRODUCTION

In cognitive radio networks (CRNs), nodes are equippedwith cognitive radios(CRs) that can sense, learn, and reactto changes in network conditions. Mitola envisioned thatCRs could be realized through incorporation of substantialcomputational or artificial intelligence (AI)—particularly, ma-chine learning, knowledge reasoning and natural languageprocessing [1]—into SDR hardware. In a modern setting, thisis achieved by incorporation of a cognitive engine (CE) usingvarious AI-based techniques through which the CR adapts tothe network conditions to satisfy some notion of optimality[2].CRs have also been proposed for a wide range of applicationsincluding intelligent transport systems, public safety systems,femtocells, cooperative networks, dynamic spectrum access,and smart grid communications [2] [3]. CR promises todramatically improve spectrum access, capacity, and link per-formance while also incorporating the needs and the contextofthe user [2]. CRs are increasingly being viewed as an essentialcomponent of next-generation wireless networks [3] [4].

Although cognitive behavior of CRNs can enable diverseapplications, perhaps the most cited application of CRNs is

This work has been supported by Higher Education Commission(HEC),Pakistan under the NRPU programme.

Junaid Qadir ([email protected]) is with the Electrical EngineeringDepartment at the School of Electrical Engineering and Computer Science(SEECS) at the National University of Sciences and Technology (NUST),Pakistan.

dynamic spectrum access (DSA)1 [5]. DSA is proposed as asolution to the problem ofartificial spectrum scarcitythat re-sults from static allocation of available wireless spectrum usingthe command-and-control licensing approach [5]. Under thisapproach, licensed applications represented byprimary users(PUs) are allocated exclusive access to portions of the availablewireless spectrum prohibiting other users from access evenwhen the spectrum is idle. With most of the radio spectrumalready being licensed in this fashion, innovation in wirelesstechnology is constrained. The problem is compounded bythe observation, replicated in numerous measurement basedstudies world over, that the licensed spectrum is grosslyunderutilized [3] [5]. The DSA paradigm proposes to allowsecondary users (SUs), also called cognitive users, accesstothe licensed spectrum subject to the condition that SUs donot interfere with the operations of the primary network ofincumbents.

While CRs have been defined differently [2], the followingtasks are considered integral to them:i) observationor aware-ness, ii) reconfiguration, and iii) cognition. In this paper, wewill be occupied mostly with cognition as we seek to buildcognitive, AI-based, routing protocols. Cognition subsumesboth reasoning and learning withreasoningbeing the processof finding the appropriate action for particular situationstomeet some system target, andlearning being the processof accumulating knowledge based on the results of previousactions [2] [6]. Generally speaking, cognition for a CR entailsunderstanding and reasoning about the radio environmentso that informed decisions may be taken to optimize theperformance of the radio and of the overall network.

Both learning and reasoning are essential elements of cog-nition and a lot of research attention has rightly focused onincorporating cognition in CRs. However, while incorporatinglearning and adaptiveness into CRs is highly desirable, thevision of a ‘cognitive network’ will not be realized untilthe networks, and the network layer functions, seamlesslyincorporate intelligence. Cognitive networks are envisioned asintelligent networks that perceive current conditions to plan,decide and act while catering to the network’s overall end-to-end goals [7] [8]. Cognitive networking broadly encompassesmodels of cognition and learning that have been defined forCRs but are distinguished from isolated CRs in its emphasis onits networking wide and end-to-end scope. In previous workon cognitive networks, Mahonen et al. proposed a cognitive re-source manager as a framework fornetwork-wideoptimization

1DSA is such a dominantly cited application of CRNs that DSA and CRNare often assumed to be synonymous incorrectly. CRNs, in fact, is a muchbroader concept allowing for diverse applications representing intelligentbehavior [5].

Page 2: Artificial Intelligence Based Cognitive Routing for ... - arXiv

2

of radio resources, and proposed utilizing machine-learningtechniques to manage cross-layer optimization [9] [10]. Someten years ago, Clark et al. proposed that Internet must havea knowledge planedistinct from the data and the controlplanes that will allow building up an intelligent networkcapable of setting itself up given high level instructions,adaptitself to changing requirements, manage itself to automaticallydiscover anomalies, and automatically fix problems or explainwhy it cannot do so [11]. Clark et al. noted that buildingsuch a ‘cognitive network’ would require AI-based cognitivetechniques and not just incremental algorithmic techniques.

To help CRNs becomecognitive networks, it is imperativethat intelligence be integrated into the fabric of CRN archi-tecture and protocols across the stack. Some challenges thatconfront learning algorithms in CRNs, as identified in [12],are as follows:

1) Learning algorithms have to operate in certain cases inunknown RF environments without any supervision.

2) Learning algorithms have to operate in environments thatare only partially observable.

3) Learning algorithms for CRNs require distributed algo-rithms due to the decentralized nature of CRNs and areproperly envisioned in multi-agent learning which aremore challenging that single-agent learning scenario.

Contributions of this paper:In this paper, we weave to-gether ideas from multiple disciplines (such as optimizationtheory, game theory, machine learning, artificial intelligence,control theory, and economics) to present a cogent and holisticoverview of techniques that can be useful for network-layerdecision making in CRNs. This task has been non-trivial dueto the multi-disciplinary nature of CRN research which iscompounded by the fact that many of parent fields use differentterminology and notation for similar concerns. Previous surveyarticles that are similar to this work have focused mainly onapplication of machine-learning and AI techniques to problemsof spectrum sensing, power control, and adaptive modulationin CRNs [2] [12]. To the best of our knowledge, this isthe first survey article that focuses on the application of AItechniques to the problems of modeling, design and analysisof network-layer issues (in particular, the problems of routingand forwarding) in CRNs.

In this paper, the basic concepts of relevant AI techniquesare presented and their applications to CRNs, particularlyforrouting, are highlighted. While this paper attempts to be self-contained, it is not intended as a exhaustive document keepingin view the breadth of topics covered. It has been attempted toprovide links to more comprehensive resources on specializedtopics where ever appropriate.

Organization of this paper:The rest of the paper is orga-nized as follows. Section II presents the necessary machine-learning background before we discussdecision and planningtechniquesin section III, andlearning techniquesin sectionIV, respectively. A survey of existing routing protocols forCRNs is presented in section V and it is shown that whilethese protocols do support certain adaptive features, morework needs to be done to build AI-enabled cognitive routingprotocols for CRNs. Some important tasks that an AI-enabled

cognitive routing protocol must implement are discussed insection VI. Open research issues and future research directionsare identified in section VII. Finally, the paper is concludedin section VIII.

II. BACKGROUND: MACHINE LEARNING

For a radio to be deemed acognitiveradio, it is necessaryfor it to be equipped with the ability of learning [4]. Onreceiving certain environmental input, systems (e.g., animals,automata, and in our case, cognitive radios) exhibit some kindof behavior. If the system changes changing its behavior overtime in order to improve its performance at a certain task,it is said to learn from its interaction with its environment.This implies that these systems may respond differently to thesame input later on than they did earlier. The field of machinelearning focuses on the theory, properties and performanceoflearning algorithms.

Machine learningis a field of research that formally studieslearning systems and algorithms. It is a highly interdisciplinaryfield building upon ideas from diverse fields such as statistics,artificial intelligence, cognitive science, information theory,optimization theory, optimal control, operations research, andmany other disciplines of science, engineering and mathemat-ics [13] [14] [15] [16]. Russell and Norvig [14] describe ma-chine learning to be the ability to “adapt to new circumstancesand to detect and extrapolate patterns”. Machine learning tech-niques have proven themselves to be of great practical utility indiverse domains such as pattern recognition, robotics, naturallanguage processing, autonomous control systems. They areparticularly useful in domains, like CRNs, where the agentsmust dynamically adapt to changing conditions.

Type of machine learning algorithms:Machine learningconcerns itself with a learner using a set of observations touncover the underlying process [13]. There are principallythree variations to this broad definition and machine learningcan be classified into three broad classes with respect to thesort of feedback that the learner can access:i) supervised learn-ing, ii) unsupervised learning, andiii) reinforcement learning.Briefly, supervised learning is one extreme in which thelearner is provided with labeled examples by its environment(alternatively, a supervisor or teacher) in a training phasethrough which the learner attempts to generalize so that itcan respond correctly to inputs it has not seen yet. We canthink of learning a simple categorization task as supervisedlearning. Unsupervised learning is the other extreme in whichthe learner receives no feedback from the environment atall. The learner’s task is to organize or categorize the inputsin clusters, categories, or with reduced set of dimensions.A third alternative, closer to supervised learning than tounsupervised learning, is reinforcement learning in whichalthough the learner is not provided feedback about whatexactly the correct response should have been, it gets indirectfeedback about the appropriateness of the response throughareward (or reinforcement). Reinforcement learning, therefore,depends more on exploration through trial-and-error. We willbe covering these three kinds of learning in more detail laterin sections II-A, II-B, and II-C, respectively.

Page 3: Artificial Intelligence Based Cognitive Routing for ... - arXiv

3

Previous work on applying machine learning to CRNs:Bkassiny et al. provide a comprehensive survey of applicationsof machine-learning techniques in CRNs [12], and dividelearning applications for CRNs into two broad categories offeature classificationanddecision making. Feature classifica-tion mainly has applications in spectrum sensing and signalclassification. Decision making has diverse applications inCRNs including adaptive modulation, power control, rout-ing and transport-layer applications [12]. Decision makingproblems can be further classified into policy making anddecision rules problems. In a policy making problem, anagent determines an optimalpolicy (or an optimalstrategyin game theory terminology) to determine what actions itshould perform over a certain time duration. In a decisionrule problem, on the other hand, the problem is formulated ashypothesis testing problem and the aim is to directly learn theoptimal values of certain design and operation parameters [12].Bkassiny et al. also establish the relationship between learningand optimization and show that many learning algorithmsconverge towards theoptimalsolution concept in their respec-tive applications (whenever it exists). Applications of machinelearning to CRNs are vast [17] [18], and we shall develop amore complete picture gradually as we proceed in this paper.Interested readers are referred to the surveys [2] [12], andthe references therein, for a comprehensive complementarytreatment of general applications of machine learning to CRNs.

A. SUPERVISED LEARNING

In supervised learning, algorithms are developed to learnand extract knowledge from a set of training data whichis composed of inputs and corresponding outputs assumedto be labelled correctly by a ‘teacher’ or a ‘supervisor’.To understand supervised learning, imagine a machine thatexperiences a series of inputs:x1, x2, x3, and so on. Themachine is also given the corresponding desired outputsy1,y2, y3, and so on, and the goal is to learn the general functionf(x) through which correct output can be determined givena new inputxi (not necessarily seen in the training examplesprovided).

The output can be a continuous value for a regressionproblem, or can be a discrete value for a classification problem.The objective of supervised learning is to predict the outputgiven any valid input. In other words, the task in supervisedlearning is to discover the function through which an inputis transformed into output. This contrasts with ‘unsupervisedlearning’ in which the example of objects are available in anunlabelled or unclassified fashion.

Types of supervised learning problems: There are essentiallytwo types of supervised learning problems—classification andregression (or estimation). Classifiers itself can be furtherclassifiedinto computational classifierssuch as support vectormachines (SVM),statistical classifierssuch as linear clas-sifiers (e.g., Naive Bayes classifier or logistic regression),hidden Markov model (HMM) and Bayesian networks, orconnectionistclassifiers such as neural networks.

A central result in ‘supervised learning theory’ is the ‘nofree lunch theorem’ which informs that there is no single

learning method that will outperform all others regardlessofthe problem domain and the underlying distributions. For thisreason, a variety of domain and application specific techniqueshave emerged to deal with diverse applications with varyingdegrees of success. The design of practical learning algorithmsis therefore a mixture of art and science [19].

Major issues in supervised learning:The major issue withsupervised learning is the need to generalize a function fromthe learned data so that the technique may be able to conjureup the correct output even for inputs it has not explicitlyseen in the training data. This task of generalization cannotbe solved exactly without some additional assumptions2 beingmade about the nature of the target function as it is possibleforthe yet unseen inputs to have arbitrary output values. Potentialproblems arise in supervised learning of creating a model thatis underfitted (perhaps due to limited amounts of training data)or overfitted (in which a unnecessarily complex model is builtto model the spurious and uncharacteristic noisy attributes ofdata). Depending on the application, huge amounts of trainingdata may be necessary for the supervised learning algorithmto work.

B. UNSUPERVISED LEARNING

In supervised learning, it was assumed that a labeled set oftraining data consisting of some inputs and their correspondingoutputs was provided. In contrast, in unsupervised learning,no such assumption is made. The objective of unsupervisedlearning is to identify the structure of the input data. Tounderstand unsupervised learning, again imagine the machinethat experiences a series of inputs:x1, x2, x3, and so on. Thegoal of the machine in unsupervised learning is to build amodel ofx that can be useful for decision making, reasoning,prediction, communication, etc.

The basic method in unsupervised learning is clustering(which can be thought of as the unsupervised counterpart ofthe supervised learning task of classification). This clusteringis used to find the groups of inputs which have similarity intheir characteristics.

Application of unsupervised learning to CRNs:An applica-tion to which unsupervised learning is particularly suitedto isthe extraction of knowledge about primary signals on the basisof measurements [12]. A prominent unsupervised classificationtechnique that has been applied to CRNs particularly for thisproblem is the Dirichlet process mixture model (DPMM). TheDPMM is a Bayesian non-parametric model which makes veryfew assumptions about the distribution from which the dataare drawn by using a Dirichlet process prior distribution [20].The benefit of Dirichlet process based learning is that trainingdata is not needed anymore, thus allowing this approach to beused for identification of unknown signals in an unsupervisedsetting. Dirichlet process has been proposed in literature[21]for identifying and classifying spectrum usage by unidentifiedsystems in CRNs.

2These assumptions are subsumed in the phraseinductive bias. See [15]for more details.

Page 4: Artificial Intelligence Based Cognitive Routing for ... - arXiv

4

C. REINFORCEMENT LEARNING

Reinforcement learning (RL) is inspired from how learningtakes place in animals. It is well known that an animal canbe taught to respond in a desired way by rewarding andpunishing it appropriately; conversely, it can be said thattheanimal learns how it must act so as to maximize positivereinforcementor reward. A crucial advantage of reinforcementlearning over other learning approaches, and a main reason forits practical significance, is that it does not require any infor-mation about the environment except for the reinforcementsignal.

To understand RL, we again take recourse to the exampleof the machine which experiences a series of inputs:x1, x2,x3, and so on. In this new setting, the machine can alsoperform certain actionsa1, a2, ... through which it can affectthe state of the world and receive rewards (or punishments)r1, r2, and so on.3 The mapping from the actions to rewardsis probabilistic in general. The objective of a reinforcementlearner is to discover apolicy (i.e., a mapping from situationsto actions) such thatexpectedlong-term reward is maximized.

III. DECISION AND PLANNING TECHNIQUES

The cognitive cycle which epitomizes the essence of acognitive radio is based on a cognitive radio’s ability to:i)observeits operating environment, decide on how toii) bestadaptto the environment, and then as the cycle repeats, toiii)reasonand iii) learn from past actions and observations [6].The termplanning, for the purpose of our discussion, refersto any computational process that produces (or improves) adecisionpolicy of how to interact with the environment givena model of the environment. Planning is sometimes oftenreferred to as asearchtask, since we are essentially searchingthrough the space of all possible plans [15] [22].

In the remainder of this section, we will discuss two majordecision planning frameworks that have been widely appliedto CRNs. Specifically, we shall be studying Markov decisionprocesses and game theory.

A. MARKOV DECISION PROCESSES:

Markov decision processes (MDPs) provide a mathemati-cal framework for modeling sequential planning or decisionmaking by an agent in real-life stochastic situations wheretheoutcome does not follow deterministically from actions. Insuch cases, the output (also, called the reward) is specifiedbya probability distribution that depends on the action adoptedin a particular state. MDPs approach this multi-stage decisionmaking process sequential as an ‘optimal’ control problem inwhich the aim is to select actions that maximize some measureof long-term reward.4 MDPs differ from classical deterministicAI planning algorithms in that its action model is stochastic(i.e., the outcome does not follow deterministically from theaction chosen).

3The reinforcement is a scalar value that can be negative to express apunishment or positive to indicate a reward.

4Please see figure 1 and table II to see how MDPs relate to other techniquesand AI related fields.

More formally, an MDP is a discrete time stochastic optimalcontrol process. Every time step, the process is in some states, and the decision maker has to choose some actiona fromamongst theA actions available in the current state. Aftertaking the action, the process will move randomly to some newstates′, with the decision maker obtaining a correspondingrewardRa(s, s

′). We note here that the reward is used in aneutral sense: it can imply both a positive reward or a negativereinforcement (i.e., a penalty). The choice of actiona in statesinfluences the probability that the process will move to somenew states′. This probability (of going from states to s′

by taking actiona) is given by the state transition functionPa(s, s

′).5 The next states′, therefore, depends stochasticallyon current states and the actiona taken therein by thedecision maker. In MDPs, an extra condition holds crucially:given s and a, the Pa(s, s

′) is conditionally independent ofall previous states and actions. This condition is known asthe Markov propertyand this condition is critical for keepingMDP analysis tractable.

To put MDPs into perspective, we note here that they are ageneralization of Markov chains. The difference is that MDPsincorporate actions and rewards in the model while Markovchains do not. Conversely, the special case of MDPs with onlyone action available for each state and with identical rewards(e.g., zero) is in fact a Markov chain. This, and the relationshipof various Markov models and games that we will developlater in this paper, can be seen graphically in figure 1.

The roots of such problems can be traced to the work ofRichard Bellman [23] who showed that the computational bur-den of solving an MDP can be reduced quite dramatically viatechniques that are now referred to asdynamic programming(DP). We will discuss these techniques next.

Solving an MDP:The core problem in MDPs is to deter-mining an optimal ‘policy’ for the decision maker which isdefined to be a functionπ that maps a states to an actionπ(s). Intuitively, the policyπ specifies what action must theagent perform when in various states so that the long-termrewards are maximized. It may be noted that once the MDP isspecified with a policy, the action at various states is fixed,andthe resulting MDP effectively behaves like a Markov chain.

We can now make the notion of long-term rewards moreprecise now. In a potentially infinite horizon environment,with continuous decision making which goes on forever,to reason about the various different possible policies, itis important that the reward function be non-finite. This isusually accomplished throughdiscountingthrough which thepreference of immediate rewards over delayed rewards maybe quantified. Discounting works by reducing future rewardsby a factor ofγ chosen such that0 ≤ γ < 1 in every timestep. The discount factorγ is used as a parameter to describethe relative importance of future rewards. Ifγ is chosen tobe 0, the agent will become short sighted or‘myopic’ andwill consider current rewards only. Asγ approaches 1, theagent will become long-sighted and it will strive for long-term rewards. To ensure that action values do not diverge,

5In some literature, the state transition functionPa(s, s′) is expressedthrough the alternative notation ofT (s, a, s′).

Page 5: Artificial Intelligence Based Cognitive Routing for ... - arXiv

5

the discount factor should not be equal to, or exceed, 1.Solving an MDP now entails determining the policyπ thatmaximizes the cumulative discounted reward function over apotentially infinite horizon:

∑∞

t=0 γtRat

(st, st+1) where wechooseat = π(st), γ is the discount factor, and the subscriptt refers to the time-step.

We can also define thevalue of a state which followsnaturally from the concept of rewards. Intuitively, the valueof a state is a sum of discounted rewards that accrue fromfollowing the optimal policy onwards from that state. Moreprecisely,V (s) or the value of a states will contain theexpected sum of discounted rewards to be earned (on average)by following the policyπ from states. A value functionis amapping from the states to their values or expected upcomingcumulative reward. For compactness, we refer toRat

(st, st+1)whereat = π(st), or the reward achieved in timet + 1 byfollowing the optimal policyπ at time t simply asrt+1. Thevalue function mapping is shown below.

V (st) = E[rt+1 + γrt+2 + γrt+3 + ...] (1)

It is worth emphasizing that the value abstraction is a keyidea, and all efficient methods for solving sequential decisionproblems estimate value functions as an intermediate step [24].Apart from using the equation above (eq. 1), another efficient,but remarkably simple, method can be used for calculatingthe value function on the basis ofbootstrapping. We will seethis method when we later will study eq. 3 when theBellmanequationis introduced.

We emphasize again that it is due to the Markov propertythat the optimal policyπ is written as a function of onlythe current states and not of the past trajectory of theprocess through various states. We shall see later than analysisbecomes intractable and convergence guarantees are lost whenthis condition is not met.

1) Dynamic programming solutions to MDPs:Assumingthat we wish to calculate the policy that maximizes theexpected discounted reward given that the state transitionfunc-tion P and the reward functionR is known (this assumptionis not always met, but we start with this simple case).

The naive approach to the problem of optimal sequentialdecision making would be to consider the set of all feasiblepolicies, compute the return for each, and then to choosethe policy providing the maximum return. This brute-forceapproach will not work except for the most trivial problemsand will be hopelessly inadequate for processes involving evena moderate number of stages and actions. If we momentarilyre-examine the situation practically, we will see that thispriceof excessive dimensionality arises from too much information.How much information is actually needed to carry out a multi-stage decision process?

The basic idea of the theory underlying dynamic program-ming is refreshingly simple. Optimal policy should be viewedas determining the decision required at each time in termsof the current state of the system. Regardless of the initialstate and decisions, the remaining decisions must constitutean optimal policyπ for the continuation process treating thecurrent state as starting input. This is known as theprinciple of

optimality. This strikingly simple insight allows computationof the optimal policy through backward induction starting atthe terminal point. The concept of value functionV is relatedto this, and it captures the expected future utility at any nodeof the decision tree, if we assume that anoptimal policy willbe followed in the future.

Value Iteration Algorithm:

The standard method of calculating this optimal policyrequires calculation of the value function and the policyfunction. These two functions are stored in two arrays indexedby state: i) valueV containing the real values of states, andii) policy π which contains the actions of states. At the end ofthe algorithm,π will contain the optimal solution (i.e., actionsto perform for each state) whileV (s) will contain the valuesof various states (capturing the expected discounted sum ofthe rewards to be earned by following the policyπ from thatstate).

The algorithm has the following two steps that are repeatedfor all the states until the values converge. These steps aredefined recursively as follows. Note that the two equationsabove are intimately connected. In particular, the calculationof V (s) utilizes current policy information fromπ(s).

π(s) = argmaxa

{∑

s′

Pa(s, s′) (Ra(s, s

′) + γV (s′))

}

(2)

V (s) =∑

s′

Pπ(s)(s, s′)(Rπ(s)(s, s

′) + γV (s′))

(3)

Before discussing eq. 3 in more detail, it is contrasted witha method we have earlier derived for calculatingV (s) in eq.1. The method in eq. 1 was based on an explicit summationover expected future rewards. It turns out that eq. 3, whichalso happens to be theBellman equationfor this process, isconsiderably more simple and useful for practical purposes.The key insight here is to employbootstrappingto estimatethe values of states iteratively and recursively. This is done byrelating the value of each state to the values of the states thatfollow it. The Bellman equation for calculatingV (s) can bealternatively expressed more simply as follows:

V (st) = E[rt+1 + γV (st+1)] (4)

While both the definitions of calculating value functions(based on the extensive definition in eq. 1 and the bootstrap-ping definition in eq. 4) have the same exact solution, theytellingly have different approximate solutions. The bootstrap-ping eq. 4) is considerably more convenient in terms of time.

In value iteration, proposed by Bellman in 1957 [23], thepolicy functionπ is not used directly. The value ofπ(s) is in-stead calculated indirectly withinV (s) whenever it is needed.This technique is also known by the name backward induction.Substituting the calculation ofπ(s) into the calculation ofV (s) gives us the followingBellman equationfor this problem.The value iteration update works by iteratively calculating thevalues ofV (s).

Page 6: Artificial Intelligence Based Cognitive Routing for ... - arXiv

6

V (s) = maxa

{∑

s′

Pa(s, s′) (Ra(s, s

′) + γV (s′))

}

(5)

Note that eq. 5 is just an alternate representation of eq. 3 butit serves to emphasize a potential problem that can arise withvalue iteration when it comes to solving complex large-scaleMDPs. For each action, we calculate a weighted average overpossible outcomes to determine the expected reward from thataction. We then choose the action with the maximum expectedreward. Since the equation above is taking a maximum overallpossible actions, this calculation does not lend itself naturallyto the usage of approximate methods. With the preclusion ofapproximation techniques, this method then becomes unwieldyfor large-scale problems complex problems.

Policy Iteration Algorithm:

Policy iteration was devised based on the observation thatit is possible to get an optimal policy even with inaccuratevalue function estimate or before this function converges.Thisis especially the case when one action is clearly better thanall others; in such a case, it becomes clear what action needsto be taken even with imprecise estimates of the exact valuemagnitudes [14].

This insight can be exploited to devise a new strategy forcalculating optimal policies calledpolicy iteration algorithmthat directly explores the policy space. This algorithm beginsfrom some initial policyπ0 and thereafter alternates betweenthe following two steps:

1) Policy evaluation:Given a policyπi, calculateVi = V πi

which calculates the value of each state ifπi is to be executed.2) Policy improvement:GivenVi, calculateπi+1 using one

step look ahead based onVi (as in eq. 2).The policy iteration algorithm terminates when the policy

improvement step yields no change in the utilities.The choice of which solution method is better depends on

various factors. If there are many actions, or if there existsalready a fair policy, it is better to use policy iteration. Onthe other hand, if there are few actions, and acyclic statetransitions, then value iteration is a better option.

Partially observed MDPs:A MDP in which the environment is only partially observ-

able is known as a partially observable MDP (POMDP). In themethod discussed above for solving MDPs, it was assumedthat the states is known when the action is to performed.This assumption does not hold for POMDPs. POMDPs areable to model uncertain aspects of the environment such as thestochastic effects of actions, incomplete information andnoisyobservations over the environment. Although POMDPs havebeen known for decades, their widespread uptake is impededfor two main reasons: i) it is difficult to satisfactorily modelthe environment dynamics (such as probabilities of actionoutcomes and the accuracy of data), and ii) it is difficult tosolving the resulting model.

2) Solutions for complex MDPs:While the classical DPalgorithms of value iteration and policy iteration work verywell for simple to moderately complex MDPs, they break

down for large-scale and complex MDPs as the requirement ofcomputing, storing, and manipulating the so-called transitionprobability matrices becomes prohibitive. In complex MDPs,two crippling problems arise:i) the curse of modeling, and,ii) the curse of dimensionality. In the former problem, itbecomes very difficult to compute the values of the transitionprobabilities while for the latter problem, storing or manip-ulating the elements of the so-called value function neededin DP becomes challenging due to the large dimensionality.Therefore, classical DP techniques are rather ineffectiveatsolving large-scale complex MDPs [25].

Dealing with MDPs with unknown probabilities: If theprobabilities of MDP are unknown, then the problem becomesa reinforcement learning (RL) task. We have earlier seenRL in section II-C where we noted that the task of RL isto determine for an agent what actions it should take in astochastic environment. We will methods of dealing with thiswhen we develop solutions for RL later in section IV-B.

3) Previous work of applying MDPs in CRNs:MDPshave been applied to study a wide range of planning andoptimization problems in CRNs. It is noted here that MDPs intheir native form require complete knowledge of the system(such as the state transition probabilities and the number ofstates, etc.) and they are not directly applicable when CRsare operating in unknown RF environments. However, varioustechniques exist (such as reinforcement learning) that canworkin such scenarios where the environment is not completelyknown. In [26], Choi et al. proposed a partially observableMarkov decision process (POMDP) based framework forchannel access to opportunistically exploit frequency channelsa primary network operates on. In another work, Zhao et al.had devised a POMDP framework to develop a cognitive MACprotocol [27]. MDPs have also been applied extensively incommunication networks. Interested readers are referred to asurvey paper [28] which highlights the applications of MDPsto communication networks, and also includes a discussion onits use for routing.

B. GAME THEORY

Game theory is a mathematical decision framework com-posed of various models and tools through which we canstudy and analyzecompetitiveinteraction between multipleself-interestedrational agents. Although, game theoretic mod-els exist for both cooperative and non-cooperative settings,the ability to model competition mathematically distinguishesgame theory from optimal control-theoretic frameworks suchas the MDP [4]. Game theory is also differentiated fromoptimization theory (which caters to a single decision makerscenario) in their ability to modelmulti-agent decision makingscenarios where the decisions of each agent affect each other.

Every gameinvolves a set ofplayers, actions for each ofthe players representing how players interact,preferencesforeach of the players defined over all the possible outcomes.The preferences, orpayoffs, are typically defined through autility function, or apayoff function, which maps each possibleoutcome to a number representing that outcome’s desirability.An outcome brings more reward, or is more desirable, if it

Page 7: Artificial Intelligence Based Cognitive Routing for ... - arXiv

7

SINGLE

AGENT

MULTIPLE

AGENT

SINGLE

STATE

MULTIPLE

STATE

MULTIPLE

STATE

STATE

TRANSITIONS

NOT

CONTROLLABLE

K-ARMED

BANDIT

SINGLE

STATE

REPEATED

GAME

ENVIRONMENT

COMPLETELY

OBSERVABLE ?

YES

STOCHASTIC

GAME

NO

INCOMPLETE

INFORMATION

GAME

STATE

TRANSITIONS ARE

CONTROLLABLESTATE

TRANSITIONS ARE

CONTROLLABLE

ENVIRONMENT

COMPLETELY

OBSERVABLE ?

YES

MARKOV

CHAIN

NO

HIDDEN

MARKOV

MODELS

ENVIRONMENT

COMPLETELY

OBSERVABLE ?

YES

MARKOV

DECISION

PROCESS

(MDP)

NO

PARTIALLY

OBSERVABLE

MDP (POMDP)

Fig. 1. Relationship between various Markov models, processes, and games.

has a higher utility [29]. In order to maximize itspayoff,each player acts according to itsstrategy. More formally,a game can be mathematically represented by the 3-tupleG = (N,S, U) whereN represents the set of players,S theset of strategies, andU the set of payoff functions.

The terms strategy and action should not be confusedtogether: the strategy in fact specifies how the player shouldact in each possible situation, and can be envisioned as acomplete algorithm documenting how the player will play thegame. The strategy of a player can be a single action (fora single-shot or astatic games) or a set of actions duringthe game (for a sequential or adynamic games) [30]. Aplayer’s strategy setdefines what strategies are available forit to play: the strategy set may be finite (e.g., when a choiceis made from a countable discrete set of values) or infinite(e.g., when some continuous value is chosen). Apure strategydeterministically defines how a player will play a game, whilea mixed strategydefines a stochastic definition by assigningprobability to each pure strategy. Thestrategy profile, or theaction profile, documents the strategy of each player and itfully specifies all actions in a game. The outcome of thegame depends, possibly stochastically, on the player’s strategyprofile and returns payoffs to various players.

Game theory is popularly used in CRNs since each CR ina CRN interacts with a dynamic environment composed ofother rational agents that sense, act, and learn while aimingto maximize personal utility. For games specific to CRNs,individual CRs typically represent the players, and the actionsmay include the choice of various system or design parameterssuch as, e.g., the modulation scheme, transmit power level,flow control parameter, etc. One of the main goal of gametheory is to determineequilibria points for a given game.

These are sets of stable strategies in which individuals areunlikely to unilaterally change their behaviour. To gauge theirefficiency, these equilibria points are often contrasted withsome notion of sociallyoptimal point which produces the‘best’ outcome when interests of all the players is taken intoaccounts.

In recent years, game theory has provided deep insights intohow to design decentralized algorithms for resource sharingin networks particularly through the theory known asmecha-nism designsometimes known as reverse game theory. Whiletraditional game theory focuses on analyzing how rationalplayers would play a given game, in mechanism design, weare interested in engineering or design a game which rationalplayers will play into a desired equilibrium point. Intuitively,mechanism design aims to set up the game such that playersdo what the designers want them to do but because the playersthemselves want to do it [31].

1) Representation of games:There are two common waysof representing non-cooperative games. Thenormal-formrep-resentation of a game explicitly lists the payoff for each playerof every conceivable outcome. This representation, also knownas thestandard- or strategic-form, is appropriate for staticgames of complete, and perfect information. For two playergames, this can be depicted in a matrix form either as a pairof payoff matrices (one each for therow player and columnplayer) or as a single payoff matrix (with an entry containingpayoffs for both players). On the other hand, anextensive-formgame is a representation that allows, unlike the normal-formgames, explicit representation of temporal aspects of dynamicgames such as the sequencing of players’ possible moves andtheir choices at every decision point along with payoffs for

Page 8: Artificial Intelligence Based Cognitive Routing for ... - arXiv

8

all possible game outcomes. It also allows representation ofthe (possibly imperfect) information each player has aboutthe other player’s moves when making a decision, and ofincomplete information (about the nature of the game) inthe form of chance events encoded as moves by the player‘nature’. More details about representation of the games canbe seen at [29].

2) Solution Concepts:In game theory, asolution conceptformalizes the concept of ‘solving’ a game by predictinghow rational players would play a specified game. Thesepredictions, calledsolutions, describe what strategies would bechosen by players and, therefore, it also describes the predictedresult of the game. The most commonly used solution conceptsare equilibrium concepts and the optimality concepts.

We shall now discuss three concepts of equilibrium that arerelevant to our subject.

The Nash equilibrium (NE) is a solution concept of anon-cooperative game involving two or more players. A NEis a stable equilibrium point of a game representing thesituation where no player can benefit by changing its strategyunilaterally (i.e., by the player changing its strategy whileother players keep their unchanged). In other words, a NEimplies that each player’s strategy is thebest responseagainstthose of the others. It is noted that it is possible for games tohave multiple NE. While NE is a very useful concept, analysisbased solely on NE has many drawbacks as pointed out in [4][32]. Also, the significant complexity of computing NEs hasprompted development of alternative solution concepts.

The Correlated equilibriumis an intuitive solution conceptthat generalizes the Nash equilibrium and is much easier tocompute.6 The idea is that each player chooses its action afterobserving a common public signal. The player’s strategy as-signs an action to every possible observation. If no player hasany incentive to deviate from the devised strategy, assumingthat others don’t deviate, the game is in correlated equilibrium.

The Wardrop equilibrium is a common solution conceptuseful for modeling selfish routing in transportation andtelecommunication networks with congestion. It is assumedthat in the study of transportation and telecommunicationnetworks that the players (travelers or packets, respectively)choose the shortest perceived routes given the current trafficconditions. For a network in Wardrop equilibrium, all the flowpaths in use for a source-destination pair have an equal delay.No other unutilized path has a lower delay in the Wardropequilibrium.7 A wireless routing analogue of this was exploredin [33] where a flow-avoiding routing protocol was proposed.

While optimality has a well-defined unambiguous meaningin optimal control problems (one-player games), optimality, insettings of multi-player decision making, is a difficult conceptto define precisely. Equilibrium points are not necessarilyoptimal since equilibria points may not be ‘socially optimum’(e.g., as in the classical Prisoner’s dilemma game [34]). A

6Roger Myerson has pithily remarked that: “If there is intelligent life onother planets, in a majority of them, they would have discovered correlatedequilibrium before Nash equilibrium.”

7If this property was not met, the system would not be in equilibriumintuitively, for it would have been possible for a flow to reduce its latency byswitching to an unutilized path.

common notion of optimality in game-theory is that ofPareto-optimality. A strategy profile is stated to be aPareto-optimalsolution if no other joint decision of the players can improvethe performance of at least one of them without degradingthe performance of another. It must be noted that achievingPareto optimality does not imply equality nor fairness. Anotheroptimality concept is theMinimax solution concept useful fornon-zero-sum games in which it is aimed to minimize themaximum loss a player will face in the worst-case scenario[35].

Game theory predicts the agents’ equilibrium behaviortypically without specifying by itself how to reach such astate. Algorithms for computing equilibria and determiningthe dynamics of games towards it is a subject studied in thefledgling discipline ofalgorithmic game theorywhich is atthe intersection of game theory and algorithms [34]. It hasbeen shown that equilibrium points do not have necessarilyhave to socially optimal. An interesting question then isto quantify how inefficient the equilibria points (which arereached through self interested behavior) are with reference tothe idealized ‘optimal’ situation (where the agents collaborateselflessly in a bid to minimize total cost). Since there can bemultiple NE with varying overall payoffs, the comparison ofthe worst NE with the ideal is known as the‘price of anarchy’while the comparison of the best NE with the ideal is knownas the‘price of stability’ [34].

We have covered only the most basic solution conceptsthat are relevant to our subject. For a discussion on advancedsolution concepts such as rationalizability,ǫ-Nash equilibrium,trembling-hand perfect equilibrium, we refer the interestedreader to standard game theory texts [36].

3) Categories of games:There are various ways to catego-rize games, we will discuss games through the following sixcontrasting categories:

1) Cooperative vs. non-cooperative:in all game theoreticmodels, a basic primitive is the concept of aplayer. Aplayer may be either be interpreted as an individual oralternatively as a group of individuals. After defining theset of players in a game, we may distinguish betweentwo kinds of models:i) in which we are dealing withthe possible actions of individual players;ii) in whichwe are dealing with possible joint actions of groups ofplayers. Models of the former kind (individual-based)are sometimes known as ‘noncooperative’, while thoseof the latter kind are correspondingly known as ‘coop-erative’. The difference can be summarized in that ina cooperative game, players can make binding commit-ments, while in noncooperative game, they cannot. Agame in which the players are groups of individualsthat can make binding commitments is also known asa coalition game [37].

2) Complete vs. incomplete information:A game withcomplete information is a game in which each playerknows the exact game being played. The game is rep-resented by 3-tupleG = (N,S, U) with N representingthe set of players,S the set of strategies, andU the

Page 9: Artificial Intelligence Based Cognitive Routing for ... - arXiv

9

set of payoff functions. This complete information isnot known in games of incomplete information. Wetypically employ the model of aBayesian gameto modelsituations in which some of the parties are not certain ofthe characteristics of some of the other parties. Gameswith incomplete information should not be confusedwith games with imperfect information (in which thehistory of the game is not available to all players). In aBayesian game, at least one player is unsure of the type(and therefore the payoff function) of another player. Ingames of imperfect information, on the other hand, whilethe actual moves of agents are not common knowledge,but the game itself is.

3) Sequential vs. simultaneous:In a sequential game, oneplayer chooses his action before the others choosetheirs—the latter player can utilize knowledge about theprevious move to decide on its action. Insimultaneousgames, on the other hand, players choose their moveswithout being aware of other player’s moves. A game inwhich players have sequential interaction is also knownas adynamic game.

4) Static vs. dynamic:In static games, alternatively knownassingle-stage gamesor one-shot games, it is assumedthat there exists only a single time step implying that theplayers only have one move as a strategy. However, in adynamic game, players interact with each other sequen-tially. Repeated games, also known as supergames, area subclass of dynamic games in which a similar stagegame is played numerous times. Players in a repeatedgame, unlike those in simultaneous games, have thebenefit of historic information which they can utilize toadapt their strategy. Depending on the number of stages,we can classify dynamic games intofinite-horizon gamesandinfinite-horizon game—the strategies for such gamescan hugely vary. If players in a finite-horizon game arenot aware of the duration of the game (which is clearlya common situation in practical interactions particularlyin a networking setting), then infinite-horizon gameswith discountingcan be used an appropriate model. Inorder to cater for the potentially abrupt end to the game,discounting entails decreasing the value of future stagepayoffs so that payoffs in nearer-by time are preferred.The study of dynamic game is taken in a subfield ofgame theory known asdynamic game theorywhich canbe envisioned as child discipline of game-theory andoptimal control theory [35].

5) Perfect vs. imperfect information:We refer to a game asa perfect-information gameif the players have perfectknowledge of all previous moves in the game at anymoment they have to make a new move. Since playersin simultaneous games (which includes practical gameslike poker and bridge) do not know the actions of otherplayers, simultaneous games areimperfect-informationgames. Only sequential games, therefore, can be gamesof perfect information, with an an example sequentialperfect-information game being chess.

6) Symmetric vs. asymmetric:If the game is symmetric,the identities of the players may be changed withoutchanging the payoff to the strategies. In other words,even if the role of the two players in a two-playersymmetric game is reversed, the same payoffs would beobserved. This condition does not hold for asymmetricgames.

7) Zero-sum vs. non-zero-sum:In a zero-sum game, thesum of payoffs of all the players must be zero—in otherwords, a player cannot get better off without affectingsome other player’s utility. A game which is not zero-sum is callednonzero-sum gameor variable-sum game.

Uncertainty can come into games in three distinct ways:i)a player may use chance to determine which strategy to use(such a strategy is known as mixed strategy),ii) the gameitself can include random events, andiii) you may not beexactly sure what game you’re playing—i.e., you may notknow what strategies other players are capable of, or theirpayoffs precisely. The latter two points refer to theincompleteinformation nature of the game. In addition, the game mayhave imperfect informationwhere the players do not knowprevious history or haveasymmetric information. We notehere that simultaneous games are always imperfect informationgames since players choose their moves without being awareof other player’s moves.

Stochastic games, introduced by Lloyd Shapley in 1950s,are games in which (potentially multiple) agents take decisionsin a sequence of stages (i.e., in a dynamic game) and eachplayer receives a payoff that depends probabilistically onthecurrent state and the chosen actions [4]. Intuitively speaking,the agents in a stochastic game repeatedly play games from acollection of games—the particular game played at any giveniteration depends probabilistically on the previous game playedand on the actions taken by all agents therein [36]. Stochasticgames have been applied in wireless networks in areas suchas flow control, routing, and scheduling [38].

Stochastic games generalize the concepts of MDPs, Markovchains and repeated games—MDPs can be viewed as thespecial case of a single-agent stochastic game, Markov chainsas single agent stochastic game where each player has a singleaction in each stage, while repeated games can be viewedas a single state (or, single stage) stochastic game [39]. Wehave seen previously that MDP are appropriate models forreinforcement learning techniques that address the problemof a single agent learning through experience and interactionwith an environment (assumed stationary). Stochastic gamesextend the concept of MDPs for multi-agent environments. Inmulti-agent environments, the other agents are also learningand adapting and thus the environment can no longer beassumed stationary. Stochastic games, also called competitiveMDPs, allow us to model uncertainty in the players’ operatingenvironment by allowing probabilistic state transitions in adynamic game.

Auctions: With a plethora of heterogeneous technologies,the wireless communication system has become quite com-plex. The dynamism of the overall wireless ecosystem has

Page 10: Artificial Intelligence Based Cognitive Routing for ... - arXiv

10

led researchers to explore using models from other similarlycomplex domains so that complementary mechanisms may beexploited. Indeed, there has been a lot of work in applyingvarious economics-based approaches to wireless networking[40]. CRNs, in their distributed nature, complexity and het-erogeneity, have become analogous to real-world markets [41]and are amenable to incorporation of market mechanisms andincentives.Auction theoryis an interdisciplinary field that hasshown itself to be particularly useful for CRN applications.Traditional static methods of managing spectrum are grosslyinadequate for modern CRNs, and the market mechanism ofauctions seems to be a promising approach for distributedallocation of network resources. A detailed survey of var-ious auction approaches for resource allocation in wirelessnetworks is provided in [41].

Incidentally, there are clear connections between MDPsand game theoretic models, in particular stochastic games.The relationship between Markov Chains, MDPs, POMDPs,and HMM and Markov (or stochastic) games can be seenin figure 1. MDPs are observable stochastic environmentsin which a single agent takes a decision by choosing anaction given knowledge of the current state. Markov games,or stochastic games, generalize the MDP model to allow apair of agents to control state transitions (either jointlyor inalternation). Note that a one-state stochastic game is equivalentto an (infinitely) repeated game, while the special case ofan one-agent stochastic game is equivalent to an MDP. APOMDP models partially observable stochastic environmentsin which asingle agenttakes a decision while being providedwith partial knowledge of the current state. In incompleteinformation games, on the other hand,multiple agentscontrolthe transitions in the environment while having incompleteknowledge of the environment’s state.

4) Game theory for Wireless Networks:There has been alot of work in applying game-theoretic ideas to the designand analysis of wireless networks [29] [47] [54] [55] andcognitive radio networks [56]. A comprehensive survey ofgame-theoretic approaches developed for different multipleaccess schemes in wireless networks is provided in [52].

In [30], Felegyhazi present a tutorial on the application ofgame-theory in wireless networks. To clarify the concepts,fourgames are constructed for wireless networks that are analogousto classical games in game-theory literature. In particular,they proposed two games, the ‘Forwarder’s dilemma’ and the‘Joint Packet Forwarding’, that relate to network-layer issuesof packet forwarding [30]. The ‘Forwarder’s dilemma’ is anal-ogous to the classical game-theoretic problem of ‘Prisoner’sdilemma’ [34] in which iterated strict dominancesolutionexists. It is shown that the Forwarder’s dilemma problemis a symmetric nonzero-sum game, because the players canincrease their payoffs by mutually cooperating. In the secondproblem of ‘Joint Packet Forwarding’, no iterated strict dom-inant solution exists and therefore analysis in terms ofNashequilibrium (NE) is shown—since this game has two NE, theexample is exploited to explain the concept of Pareto optimal-ity. The Joint Packet Forwarding problem is also nonzero-sumbut it is no longer symmetric but is asymmetric.

Challenges and experiences in applying game-theoreticideas to system design are related in [57]. Various approachesfor incentivizing cooperative forwarding behavior were ana-lyzed including bartering primitive, virtual currency primitive,and setting up a equilibrium point at a desired forwarding ratethrough appropriate game mechanism design.

Application of game-theory in CRNs:There is a lot of liter-ature on the applications of game theory to CRNs. Interestedreaders are referred to the following two survey papers anda book and the references therein for more details. Van derSchaar presented a survey of spectrum-access games that arerelevant to DSA CRN in [48], while a more general surveypaper on game-theoretic ideas to CRNs was published byWang et al. [56]. A comprehensive game-theoretic treatmentof cognitive radio networking and security is presented in thebook authored by Liu et al. [53].

5) Game theory for Routing:The framework of game-theory has presented itself as a viable choice for modeling theproblem of routing in a network with some applications beingidentifying and mitigating selfish routing behavior, conver-gence of routing techniques with changing network conditions,and the effects of different kinds of node behavior on routing[29]. Some example works can be seen in the references of[29].

An important aspect of tackling routing problems throughgame theory is precisely how the game is modeled (i.e., howare the players defined, what are the utilities, etc.). This is trueof mathematical modeling in general where it is understoodthat models are mere abstractions of the reality being modeledand the purpose of models is to be useful rather than to beaccurate.8 Various implications of how to model a problemof routing in network is discussed in [29]. To summarizethe discussion in [29], assume a simplesource routingsetup(where the end-to-end path is specified by the source node),chosen for ease of exposition, theplayersin the game can beviewed as the source nodes in the network, although, it canbe more convenient to view a player as a source/destinationpair (since such a formulation can allow for the existence ofmultiple flows from a single source.) Theaction set availableto each player is possibly the set of all possible paths fromthe source to the destination. Depending on how the gameis formulated, a node may choose a single path from allthe possible paths or even choose multiple paths and alsohow much of their flow to send on each route.Preferencesin a routing game can take several forms just like manyrouting metrics exist for routing protocols to determine aroute’s quality. A simple way to formulate preferences canbe to base it on end-to-end delay for a packet to traverse thechosen route with a short delay being preferable to longerdelay. While such a simple example can be solved throughoptimization techniques (especially, if we consider a singlesource and destination pair or if the available routes arecompletely disjoint), the benefit of using game theory kicksin when we consider the interaction between multiple flows

8The statistician George Box famously remarked that “all models are wrong,some are useful”.

Page 11: Artificial Intelligence Based Cognitive Routing for ... - arXiv

11

TABLE ISUMMARY OF THE VARIOUS DECISION AND PLANNING TECHNIQUES DISCUSSED IN SECTIONIII

Decision techniques Application to CRNs Application to Routing

Markov Decision Processes Opportunistic spectrum access: [26]; Routing in ad-hoc CRNs[42];Medium Access Control (MAC): [27]; Routing in communication networks: see references in [28].Cooperative spectrum selection: [43];

Game Theory Resource allocation: see references in [40] [41]; Routing games[44] [45] [46];Spectrum Sharing: [47] [48]; Mitigating selfish routing[49] [50] [51];Medium Access Control (MAC): [52]; Modeling routing: see references in [29].Security: see references in [53].

using common paths through the network.An interesting aspect of game-theoretic models of network

problems is that it can explain certain nonintuitive behavior.For example, it has been shown that in certain cases, addingmore resources (e.g., adding extra links) to a network inequilibrium can actually lead to a new equilibrium in whichall the users are worse off. This phenomena, known asBraess’paradox[58], shows that how the dynamic interaction betweenplayers and resources can lead to counterintuitive resultsandwhy using a mathematical theory like game theory can be auseful tool. Network routing problems also arises in domainsother than telecommunication networks (e.g., transportationnetworks) have been studied for a long time with a commonsolution concept known asWardrop equilibriawhich has beendiscussed earlier.

In algorithmic game theory,selfish-routingin networks isa well-studied problem both in a general network setting(e.g., of transportation networks) [44] and also for Internet-like networks [59]. In general, centralized calculation ofoptimal routes are infeasible for a majority of network rout-ing problems, leading to interest in distributed algorithms.Distributed algorithms can be viewed as ‘selfish routing’since each agent intends to optimize for itself. Researchershave vigorously pursued questions that aim to quantify theperformance degradation due to lack of coordination betweenthe various ‘players’ of thisrouting game. In this regard,concepts of price of anarchy and price of stability, discussedearlier, have been proposed. It has been shown that while theprice of anarchy is unbounded for the case of selfish routing innetworks with general latency functions [44], results are muchmore encouraging for networks with linear latency functions[44] and for actual Internet-like networks [59]. Selfish routingin networks and their equilibria was first formally definedby Wardrop in 1952, and it has been an popular topic forresearchers since.

6) Routing Games:A characteristic of a typical routinggame is that each player is interested in finding a minimumcost path from the origin to the destination in acongestednetwork, where the delay of an edge on some path dependson its congestion which in turn depends on the total of playersusing that edge in their path. Such a dependence on congestionis seen in a class of games known ascongestion games, firstproposed by Rosenthal in 1973. In a congestion game, thepayoff of each player depends not only on the resource itchooses, but also on the number of players choosing the sameresource. Congestion games are a special case ofpotential

games. Fortunately, the equilibria points are guaranteed to beapproximately optimal under best response dynamics [34] forpotential games in general.

Repeated gamesandpotential gameshave been shown to beespecially relevant to the routing problem. In previous work,repeated games have been used to address the problem ofselfish routing with punishment for unsocial behavior [49][50] [51]. The usage of potential games for routing hasbeen well-explored [44]. Potential games encompass manyof the well-studied network routing and congestion games.Potential games have many desirable properties including i)pure equilibria always exists, ii) the best response dynamicsis guaranteed to converge, and iii) the price of stability (or, theratio of the best NE to the optimal solution) can be boundedusing a technique named the potential function method. Po-tential games are especially attractive from the point of viewof analysis, since the incentives of all the players are mappedonto a single function, called the potential function, whoselocal optima correspond to the set of pure NE. There has beena lot of work in modeling wireless networking problems aspotential games (see the references in [38] for more details)with most applications being in the domain of power control,waveform adaptation, and routing and congestion games.

Broadly speaking, there are two popular models of routinggames:nonatomic selfish routingin which there are verylarge number of players each controlling a negligible fractionof overall traffic, andatomic selfish routingin which eachplayer controls a non-negligible amount of traffic. Nonatomicselfish routing was first studied for transportation networksby Wardrop, and equilibrium in such games is known asWardrop equilibria. It has been shown that for nonatomicselfish routing, the price of anarchy is the same as the priceof stability . Nonatomic selfish routing has been applied torouting in communication networks where it is relevant to the‘source routing’ paradigm in which the source node specifiesa complete route for its traffic and in a distributed setting[44]. The paradigm of distributed shortest-path routing, that istypically used on Internet-like networks, cannot be addressedby selfish routing unless the ‘length’ used to define the shortestpaths coincide with the edge cost functions [44]. Atomicselfish routing games were first considered by Rosenthal in1973 who also introduced the concept of congestion games andpotential games. The price of anarchy is also well understoodfor atomic selfish routing game [44].

Interested readers are referred to a detailed survey of game-theoretic methodologies for routing models at [45], details

Page 12: Artificial Intelligence Based Cognitive Routing for ... - arXiv

12

aboutrouting gamesand the analysis of the efficiency of itsequilibria points at [44], and a survey of application of variousnetworking games in telecommunications in [60].

IV. LEARNING TECHNIQUES

Learning is especially crucial when dealing with unknownsor unplanned scenarios and is especially relevant to CRNs[4]. Learning, for the purpose of our discussion, will focusoncomputational processes employed by CRs that can improvetheir behavior through diligent study of their own interactionswith the environment. Learning can also be envisioned inthe perspective of search. In this context, we can envisionlearning as searching through a space of possible hypothesesto determine which hypothesis best fits the available trainingexamples and prior knowledge and constraints [15].

In the remainder of this section, we will discuss hiddenMarkov models, reinforcement learning, learning with gametheory, online learning algorithms, neural networks, evolution-ary algorithms, support vector machines, and finally methodsof Bayesian inference.

A. HIDDEN MARKOV MODELS

Hidden Markov Model (HMM) are stochastic models ofgreat utility, especially in domains where we wish to analyzetemporal or dynamic processes such as speech recognition,PU arrival pattern in CRNs, etc. HMMs are highly relevant toCRNs since many environmental parameters in CRNs are notdirectly observable.

An HMM-based approach can analytically model Marko-vian stochastic processes whose actual states are hidden, butwhich emit observations from states per some probabilitydistribution. It is for this reason that an HMM is defined tobe a doubly stochastic process: first, the underlying stochasticprocess that is not observable, and second, the set of stochasticprocesses, dependent on the embedded underlying stochasticprocess, that produce the sequence of observed symbols [61].

Intuitively, HMMs can be visualized as a Markov chainobserved in noise [62]. In a simple Markov model like aMarkov chain, the state is directly visible to the observer,andthe model is completely specified by describing the parametersdefined through state transition probabilities. In an HMM,on the other hand, a more elaborate model is needed. Therelationship of HMM with other Markov models is depictedin figure 1.

To represent an HMM, we use the notationλ = (A,B, π)to represent an HMM whereA, B and π are three proba-bility distributions.A is the state transition probability,B isthe observation symbolprobability distribution from variousstates [61], whileπ is the initial state distribution. Specifyingan HMM completely requires, in addition toA, B and π,information about the number of statesN and the number ofdiscrete output symbolsM .

1) Key problems in HMMs:Having defined the notation forHMMs above, we can talk about the three key problems thatmust be solved for the HMM to be useful in real world appli-cations [2] [61]. The listing of these three keys problems belowassumes an observation sequenceO = O1, O2, O3, ...OT .

• Evaluation Problem: Given the parameters of the modelλ, this problem deals with how to compute the proba-bility of a particular observation sequencePr(O|λ). Theforward algorithm, backward algorithm, and the forward-backward algorithm solve this problem.9

• Decoding Problem: Given the observation sequenceOand the parameters of the modelλ, this problem dealswith decoding or inferring about the sequence of hiddenstatesI = i1, i2, i3, ...iT that most likely produced theobservation sequence. This task aims at decoding, oruncovering, the hidden part of the HMM and is essentiallyan estimation problem. TheViterbi algorithmsolves thisproblem by providing the most likely sequence and itsprobability.

• Learning Problem: Given an observation sequenceO, thisproblem deals with learning the most appropriate modelλ = (A,B, π) that ‘best’ explains the observed sequence.In other words, we have to learn the most likely set ofstate transitionA and observation symbol probabilitiesBfrom the training data. For many applications, this is themost important task since it allows us to optimally adaptmodel parameters to the training data. TheBaum-Welchexpectation-maximization algorithm solves this problem.The learning problem in HMMs is intuitively related toevaluation problem in the following way. The evalua-tion problem computedPr(O|λ) which represented theprobability of a particular observation sequence givena model.Pr(O|λ) is also the likelihood function forλ given the observationsO. The learning problem isdetermining the HMM parametersλ that maximize thelikelihood function. The Baum-Welch algorithm is aniterative algorithm which solves the learning problemby expectation-maximization to produce maximum like-lihood, or maximum a posteriori, estimates of HMMparameters given only observation sequence as trainingdata.

We have already noted that HMM is a strong generictemporal model for dynamic signals and systems. To honeonto the important problem of inference in such temporalmodels, we note that there are four basic inference tasks thatmay be performed with HMMs [14]. (We use the notationIt and Ot to indicate respectively the hidden state and theobservation during time stept. It is assumed that observationsO0, O1, ..., Ot−1 have been observed till date.)

a) Filtering or Monitoring: This is the task of comput-ing the posterior distribution over thecurrent state, givenall evidence to date. Mathematically, this is calculatingP (It− 1|O0, ..., Ot−1)

b) Prediction: This is the task of computing the posteriordistribution over thefuture state, given all evidence to date.Mathematically, this is calculatingP (It|O0, ..., Ot−1)

c) Smoothing or Hindsight:This is the task of computing theposterior distribution overpast states, given all evidence up to

9While the forward-backward algorithm solve the evaluationproblem (i.e.,it can estimate the most likely state for any point in time), it cannot solve thedecoding problem (of finding the most likelysequenceof states) for whichthe Viterbi algorithm is used.

Page 13: Artificial Intelligence Based Cognitive Routing for ... - arXiv

13

present. Mathematically, this is calculatingP (Ik|O0, ..., Ot−1)for 0 ≤ k ≤ t− 1

d) Most Likely Explanation:This is the task mentionedearlier as thedecoding task. The aim is to find the mostlikely sequence of states that generated the observed sequence.Mathematically, this isargmaxI1:t Pr(I1:t|O1:t)

2) Applying HMMs in CRNs:All the inference tasks listedabove are potentially very useful for CRNs. HMMs have beenextensively used in CRNs for a wide range of problems.They can be used for spectrum prediction, PU detection,signal classification, etc. [2]. A potential drawback when usingHMMs is that a training sequence is needed, with the trainingprocess being potentially computationally complex. OtherAItechniques such as GA are used to improve the model trainingefficiency [63]. We will further discuss the usage of HMMsin section VI-B where we will outline how HMM has been,or can be, used for solving certain modeling, planning andprediction tasks that relate to cognitive routing in CRNs.

B. REINFORCEMENT LEARNING

In reinforcement learning (RL), an agent aims to determinea sequence of actions orpolicy which maps the state of anunknown stochasticenvironment to an optimal action plan. Wenote here that MDPs, on the other hand, address this planningproblem forknown stochasticenvironments. Since RL agentswork in a stochastic environment, they have to balance twopotentially conflicting considerations: on the one hand, itneeds toexplore the feasible actions and their consequences(to ensure that it does not get stuck in a rut) while on theother hand, it needs toexploit the knowledge, attained throughpast experience, of favorable actions which received the mostpositive reinforcement.

RL is distinct from supervised learning in that instead ofbeing presented with training examples of how to select thecorrect output for an input, the system has to learn indirectlyfrom reinforcements (called reward for positive reinforcementand punishment for negative reinforcement) on actions taken.Since reinforcement learning can be used without training dataand because it aims to maximize the long-term online per-formance, it is particularly suitable for CRNs. Reinforcementlearning is also distinct from supervised and unsupervisedlearning in that it focuses on online performance (learningthrough taking actions) rather than onplanningand offline per-formance. Since it programs agents by reward and punishmentwithout needing to specify how the task is to be achieved, anddue to its broad applicability, the RL framework is of profoundinterest to many diverse fields.

We note here that reinforcement learning is also knownby alternate monikers such as neuro-dynamic programming(NDP) [64] and adaptive (or approximate) dynamic program-ming (ADP) [65] [25].

1) Relationship with MDPs:An interesting way to concep-tualize RL is to think of it as a simulation-based technique forsolving large-scale and complex MDPs. We refer to sectionIII-A for an earlier discussion on the relationship betweenMDPs and RL. We also discussed in section III-A that

classical DP techniques are ineffective at solving large-scalecomplex MDPs [25] [66]. Practical RL algorithms that candeal with large-scale complex MDPs (having large state andaction spaces) essentially bank upon two key ideas: firstly,to use samples to compactly represent the dynamics of thecontrol problem, and secondly, to use powerful function ap-proximation methods, including bootstrap methods that buildestimates on other estimates, to compactly represent valuefunctions [24] [66]. It has been stated that understandingthe interplay between dynamic programming, samples andfunction approximation is at the heart of design, analysis andapplication of modern RL algorithms [66].

Crucially, RL can solve MDPs without explicit specificationof the transition probabilities. These values are needed byclassical dynamic programming solutions of value and policyiteration. In RL, instead of explicit specification of the transi-tion probabilities, the transition probabilities can be envisionedto be accessed through a simulator that typically is restartedfrom a uniformly random initial state many times [67]. Inaddition, RL can work with very large number of states whenused along with function approximation [67].

2) Categories of RL algorithms:Most RL algorithms canbe classified into being eithermodel-freeor model-based[22].A model intuitively is an abstraction that an agent can use topredict how the environment will respond to its actions: i.e.,given a state and the action performed therein by the agent,a model can predict the (expected) resultant next state andthe accompanying reward. We will be mostly interested instochastic models which can predict probabilistically possiblenext states and rewards given the current state and action.

In the model-based approach, the agent builds a model ofthe environment through interaction with it typically in theform of a MDP analogous to the approach taken in adaptivecontrol [68]. With a model in hand, given a state and action,the resultant next state and next reward can be predictedallowing planning through which a future course of actioncan be contemplated by considering possible future situationsbefore they are actually experienced. Based on the MDP modelin the model-based approach, a planning problem is solvedto find the optimal policy function with techniques from therelated field ofdynamic programming[14] [22].10 Commonlyused algorithms used to solve MDPs include the celebrateddynamic programming algorithms of value iteration [23] andpolicy iteration [69].

In the model-free approach, on the other hand, the agentaims to directly determine the optimal policy by mappingenvironmental states to actions without constructing a MDPmodel of the environment. Early RL systems were explicitlytrial-and- error learners and were generally devoid of planning.Popular model-free RL techniques include temporal difference(TD) learning (in which a guess is updated on the basis ofanother guess) and Q-learning [22]. Modern reinforcementlearning spans the whole gamut of approaches from low-level,

10The term dynamic programming was originally used in the 1940s byRichard Bellman to describe the mathematical theory of multi-stage decisionprocesses in which one needs to make the best decision one stage after another.The term ‘dynamic’ in ‘dynamic programming’ refers to the temporal aspectof multi-stage decision making while ‘programming’ refersto optimization.

Page 14: Artificial Intelligence Based Cognitive Routing for ... - arXiv

14

trial-and-error learning to high-level, deliberative planning[22].

RL tasks can be also be categorized into two types de-pending on whether the decision making tasks are sequentialor not. In non-sequential tasks, expected immediate payoff ismore important, and the objective is to learn a mapping fromsituations to actions that maximizes the expected immediatepayoff. Such learning has been studied extensively in the fieldof learning automata. In sequential tasks, the objective nowis to maximize the expected long-term payoffs. Sequentialtasks are considered more difficult since the chosen actionmay influence future trajectory of situations and payoffs.Such learning has been the subject of fields such asdynamicprogramming.

3) Major reinforcement learning techniques:It is noted thatRL is best understood as a class of learning problems ratherthan as a fixed set of algorithms or techniques. Indeed, thereis great diversity in the various approaches taken by differentRL algorithms and techniques.

We can broadly categorize RL techniques into two maincategories ofvalue iterationand policy iteration techniques.In value iterating learning techniques, the optimal policyiscalculated on the basis of optimal value function calculatedas described in section III-A1. Inpolicy iterating learningtechniques, on the other hand, the learning is directly in thepolicy space as described earlier in section III-A1. We willpresent representative techniques that belong to these twocategories next. In particular, we will discussQ-learning asan example value-iterating model-free technique, and willthendiscusslearning automataas an example technique that ispolicy-iterating.

Q-LEARNING:

Q-learning, proposed by Watkins in 1992 [70], is a popularvalue-iteration model-free technique with limited computa-tional requirements that enables agents to learn how to actoptimally in controlled Markovian domains. The implicationof being model-free is that Q-learning does not explicitlymodel the reward transition probabilities of the underlyingprocess. Q-learning proceeds instead by estimating the valueof an action by compiled over experienced outcomes using anidea known astemporal-difference (TD) learning.

The TD learning idea has been referred to as the centralkey idea in the theory of RL. TD learning combines ideasfrom Monte Carlo (MC) methodsand dynamic programming(DP). Like MC methods, TD method is a simulation basedmodel-free method that can learn directly from raw experiencewithout a model of the environment’s dynamics. Like dynamicprogramming, TD method used bootstrapping to update esti-mates based in part on other learned estimates. The conceptsofTD, DP and MC are central recurring themes in RL literature.

Q-learning proceeds by incrementally improving its evalua-tions of theQ-valuesthat incorporate the quality of particularactions at particular states. The evaluation of the action-valuepair, or the Q-value, is done by learning theQ-function thatgives the expected utility of taking a given action in a givenstate and following the optimal policy thereafter. The Q-

function is defined as follows:

Q(s, a) =∑

s′

Pa(s, s′)(Ra(s, s

′) + γV (s′)) (6)

The arrayQ is updated directly with experience in thefollowing way. The core of the update algorithm below isbased on value iteration (discussed earlier in section III-A1).Rt+1 is the reward observed after performingat in st, andwhereαt(s, a) (0 < α ≤ 1) is the learning rate (may be thesame for all pairs). The discount factorγ 0 ≤ γ ≤ 1) trades offthe importance of sooner versus later rewards. The Q-functionestimate is refined in every learning step and a new policy isgenerated on its basis which drives the next action to execute.

Qt+1(st, at) = (1− αt(st, at))︸ ︷︷ ︸

inverse learning rate

×Qt(st, at)︸ ︷︷ ︸

old value

+ αt(st, at)︸ ︷︷ ︸

learning rate

× (Rt+1 + γmaxa

Qt(st+1, a))︸ ︷︷ ︸

learned value

(7)

Q-learning in its simplest setting stores data in tables. Thisquickly becomes impractical for complex systems. In suchcases, Q-learning can be combined with function approxima-tion: in particular, (adapted) ANNs have been proposed forfunction approximation for large-scale RL problems [71].

Q-learning does not systematically handle the tradeoff be-tween exploration and exploitation, relying instead on heuristicexplorations. Fortunately, it has been shown that Q-learningdoes eventually find the optimal value of an action (the proofrelies on infinitely many observations for every action andstate [70]). The Markovian environment of MDPs is crucialfor guaranteed convergence, and the convergence guaranteeislost if this assumption is not valid.

In its basic setting, Q-learning is intended for single-agentenvironments, althoughmulti-agent Q-learning, also knownas Q-learning with games, have also been proposed recently.Multi-agent learning is especially challenging since it operatesin non-Markovian environments (as the output of an action nolonger only depends on the current state and agent’s personalaction). As such, the convergence guarantees of MDP dono extend to multi-agent RL environments due to their non-Markovian nature.

Application of Q-learning to routing and CRNs: Boyan et al.showed in 1994 that routing packets through a communicationnetwork is a natural application for RL algorithms [72]. Their‘Q-routing’ algorithm learned a routing policy that minimizestotal delivery time by learning through experimentation withdifferent routing policies. The presented RL based algorithmhad the desirable features that:i) its learning is continualand online,ii) it uses local information only, andiii) it isrobust in the face of dynamic network conditions. This earlypaper showed that adaptive routing is a natural domain forreinforcement learning. Q-learning is perhaps the most popularmodel-free reinforcement learning technique which has beenapplied to CRNs extensively [12]. We refer the interestedreader to a survey paper for more details and references [73].

Page 15: Artificial Intelligence Based Cognitive Routing for ... - arXiv

15

LEARNING AUTOMATA:

Learning automata (LA) is an AI technique that subscribesto the policy iteration paradigm of RL [74] [75] [76]. Incontrast to other RL techniques, policy iterators operate bydirectly manipulating the policyπ. Another example of policyiterators are evolutionary algorithms.11

A learning automaton is a finite state machine that interactswith a stochastic environment and attempts to learn the optimalaction (that has the maximum probability to be rewarded)offered by the environment so that it can ultimately choosethis action more frequently than other actions. Since wirelessnetworks operate in dynamic time-varying environments withpossibly unknown characteristics (e.g., variable link qualities,dynamic topologies, changing traffic patterns, etc.), the ap-plication of LA techniques for building adaptive protocolsinsuch networks is particularly appealing. In this regard, LAhas been used in the design of wireless MAC, routing andtransport-layer protocols [74].

We will now present someexample LA based routing pro-tocols. Torkestani et al. have proposed using LA for multicastrouting in mobile ad-hoc networks or MANETs12 to findroutes with expected higher lifetimes through prediction ofnode mobility [75]. Another LA-based distributed broadcastsolutions can be seen at [76].

4) Central issues in reinforcement learning:Some pressingissues in RL research have been highlighted [77] to be: tradingoff exploration and exploitation, learning from delayed rein-forcement, making use of generalization, dealing with multipleagent reinforcement learning, constructing empirical models toaccelerate learning, and coping with hidden state. Out of theseissues, the issues of exploration and exploitation and thatofmulti-agent reinforcement learning are most relevant to ourwork, and we discuss them next.

Issue of exploration and exploitation:Exploitation wouldentail favoring immediate payoff while exploration wouldrequire tolerating momentaryregret of not using the bestcurrently known policy for the opportunity of potential in-formation about better policies. It should be apparent aftersome reflection that neither exploration nor exploitation canbe pursued exclusively without failing at the task of selectionof the optimal action. The tension between exploitation andexploration is typified in the so-calledmulti-armed banditproblems. Thek-armed bandit problem is the simplest possibleRL problem [77] and represent an MDP with a single state(see figure 1) in whichk actions are available. The problem iscalled ak-armed bandit in a metaphorical reference to predica-ment of a gambler who must select fromk slot machines,colloquially called a 1-armed bandit, in a casino. Interestingly,the conflict between delayed versus immediate gratificationis a dilemma unique not only to RL, the conflict it arises

11We shall discuss in section IV-F how evolutionary algorithms share certainattributes with RL: e.g., both depend on exploration and exploitation.

12MANETs share an important characteristic with CRNs in that both ofthem have highly dynamic topology. The dynamically changing topology inMANETs is due to node mobility while in CRNs it is due PU arrivals.

can be experienced in our own humanness.13 Fortunately, amethod has been devised by Gittins in 1979 for optimallysolving the exploration and exploitation tradeoff for the simplecase of k-armed bandit problem [78] assuming a discountedexpected reward criterion. This method entails providing adynamic ‘allocation index’ to each action for each step ink-armed bandit problems. Gittins showed that it is guaranteedthat choosing the action with the largest index value will leadto optimal balance between exploration and exploitation [78].

For the general case of MDPs, the optimal balance betweenexploration and exploitation is known to be an intractableproblem to solve [22]. Therefore, a lot of interest has focusedon development of heuristic or approximate methods to handlethe tradeoff between exploration or exploitation. To managethe exploration or exploitation dilemma, theǫ-greedystrategyis to select the greedy action (one that exploits prior knowledgeand provides the best value) all butǫ of the time, and to selectan action randomly for the remainingǫ of the time. The valueǫ ranges between 0 and 1 and it is possible to change thisvalue over time. Intuitively, it would be prudent for an agentto be more of an explorer initially (by having a higherǫ)since it has no knowledge to exploit it. With passing time,as good states and actions are learnt, the agent can benefitmore by being an exploiter and taking the greedy approach(with smaller ǫ) which chooses good actions more often.It makes intuitive sense that during explorations, the choiceof actions are not completely random but based on someestimation of their potential value. In this regard, asoft-maxaction selectiontechnique can be used which uses theGibbsor Boltzmann distributionfor selecting the action to explorewhere the probability of selecting an action is proportionalto its perceived value (e.g., its Q-value). We note here thequestion of exploration vs. exploitation is central not only toreinforcement learning, but also to genetic algorithms, and toevolutionary algorithms in general [79].

Multi-agent Reinforcement Learning (MARL): MARL aremore challenging than single-agent RL problems mainly sincethe Markov property does not hold in such environments asan agent’s reinforcement depends not only on its current statebut also on the action taken by the other agents. Accordingly,convergence guarantees that apply to MDP RL tasks donot extend in such non-Markovian MARL settings. Learningautomata based tools have been quite popular in MARLenvironments. A detailed survey of multi-agent reinforcementlearning algorithms is presented in [80].

5) Application of RL to routing and CRNs:The authorsof [42] present the benefits and potential drawbacks of usingRL in CRNs. The main benefits listed are adaptivity, networkawareness, and ease of distributed implementation, while themain drawback is slow convergence (although, it is pointedthat convergence is not a main goal in CRNs since theenvironment is not stationary in any case). This paper [42]also surveys the existing RL schemes in the context of ad-hoc CRNs, and proposes modifications from the viewpoint

13It has been said by a mathematician Peter Whittle that “bandit problemsembody in essential form a conflict evident in all human action: informationversus immediate payoff.”.

Page 16: Artificial Intelligence Based Cognitive Routing for ... - arXiv

16

of routing and link-layer spectrum-aware operations. Anothersurvey paper [73] presents a detailed survey of applicationsof reinforcement learning to routing in distributed wirelessnetworks is presented in [73]. The interested readers arereferred to these papers, and the references therein, for amore exhaustive treatment of RL applications for routing inwireless networks in general. Applications of reinforcementlearning to CRNs in general are explored in [81] while RLtechniques for context awareness and intelligence in wirelessnetworks are reviewed in [82]. Since CRs often have to workin unknown environments, RL seems be a promising solutionto the various learning problems in CRNs and it looks set tobecome a popular tool for future CRN designers.

C. LEARNING WITH GAME THEORY

While game theory is essentially concerned with the de-cisions made by individuals in their interactions with otherdecision makers and their environment, researchers have longrecognized the need to guide future decisions from the historyof past experience. There is a lot of work on the importantrelationship between game theory and learning [83]. A branchof game theory known as ‘learning game theory’ studies thedynamics of individuals who repeatedly play a game, andadjust their behavior over time as a result of their experience(through, e.g., reinforcement, imitation, or belief updating)[84].

It is worth highlighting the work that has been done inidentifying the similarities between inference and learningin the fields of machine learning and game theory [85]. Inthe field of game theory, learning is used implied to meaninference of the correct strategy to play against an opponentwithin a dynamic game (repeated game, stochastic game, orevolutionary game). Some of the models that have been usedfor learning in game theory include reinforcement learning,learning by imitation, myopic response, fictitious play, andrational learning [84]. As examples, we discuss fictitious play,and Q-learning with game theory.

Fictitious play:The main idea in fictitious play is that eachplayer would choose their best strategy in each period, basedon the predicted strategy that each opponent player wouldchoose in that period, to maximize expected payoff.

Q-learning with game-theory:Although Q-learning in itsbasic form is used in a single-agent RL setting, it has beenextended to produce the Nash Q-learning algorithm for multi-agent RL setting based on the concept of stochastic games[86]. In this multi-agent Q-learning algorithm, the Q-value isupdated with the future payoff so that each agent can observeand estimate the payoff for using a particular strategy (notonly for itself but also for the other players).

A detailed survey of strategic learning in CRNs, and variousspectrum access games, is presented in [48].

D. ONLINE LEARNING ALGORITHMS

Online learning algorithmsaddress the task of online se-quential decision-making under partial information. An exam-ple problem which can be addressed through online learningtechniques is determining what route to use to drive to work

everyday in an uncertain environment where the congestionpattern on the various paths is both stochastic and unknown[87]. The basic setting is we have a space ofN actions,from which the algorithm chooses an action (in our example,selecting the route to take) one time step after the other. Theenvironment then makes its ‘move’ (in our example, by settingthe path congestions for that time step). The algorithm thenincurs the‘loss’ for its action chosen (in our example, thisis how long the route took). Online learning algorithms aimto perform well in such tasks of repeated decision making.While this example (which is from [87]) relates to routing ina transportation network, it is analogous and directly extensibleto the problem of routing in a CRN.

A key technique for analyzing the performance of onlinelearning algorithms isregret analysis. This captures our senti-ment that we want our sophisticated online algorithm (whichmay be choosing different actions at different times) to be atleast as good as some simple fixed alternative policyλ thatsticks with just one action at the time of all decisions—thiswill minimize our regretof not choosing the alternative policyλ. More formally, this regret is defined to be the differencebetween the loss of our learning algorithm and the loss usingthe alternative policyλ. This regret is more properly calledexternal regretwhen the alternative policy is astatic policy(i.e., a policy of performing the same action in all time steps).External regret allows us a general methodology for devel-oping online algorithms whose performance is comparable tothat of an optimal static online algorithm. Stronger notions ofregret includeinternal or swap regretwhich allow comparisonof online action sequences in which every occurrence of agiven actioni is changed by an alternative actionj [87]. Therehas been a lot of work by the learning theory and the gametheory communities in this area, and online learning algorithmshave been shown to have strong performance guarantees[87] with decision-making algorithms (such as theweightedmajority algorithm [88]) available that approach zero regreteven against a fully adaptive adversary.

1) Online learning algorithms in CRNs:Han et al. pro-posed a using the solution concept of correlated equilibriumfor opportunistic spectrum access in CRNs using a distributedno-regret learning algorithm. It was shown in their work thattheir correlated equilibrium based solution returns fairer resultswith better performance [89].

2) Online learning algorithms for routing:Awerbuch etal. have formulated the problem of determining a sequenceof routing paths in a network with unknown link delaysvarying unpredictably over time [90] as a generalization oftheonline multi-armed bandit problem. The sequential decision-making under partial information in this multi-armed banditproblem is handled through the framework of a repeated gamewith two players (algorithm and adversary) interacting overtime. They have proposed two randomized online algorithmsas a solution to this problem. Avramopoulos et al. haveproposed using online learning algorithms as a frameworkfor adding adaptivity to routing decisions in realistic Internet-like environments [91]. In another work, Bhorkar et al. havepresented a no-regret routing algorithm for wireless ad-hoc

Page 17: Artificial Intelligence Based Cognitive Routing for ... - arXiv

17

networks [92].

E. NEURAL NETWORKS

Artificial neural networks (ANN or simply NN) are com-posed of artificial ‘neurons’ interconnected together in aprogramming structure that aims to mimic the neural pro-cessing (organization and learning) of biological neuronsandits behavior [93]. More specifically, NNs involve a networkof simple elements that can exhibit complex global behaviordetermined through:i) the way these elements are connectedtogether into a network, andii) the adaptive element parame-ters which are tuned by a learning algorithm. ANNs are mostlyused in supervised learning settings but can also be used inreinforcement learning environments (e.g., it can be used alongwith dynamic programming [64], in what is known as neuro-dynamic programming, to solve RL problems) and in unsu-pervised learning environments (e.g., a self-organizing map(SOM) is a type of ANN that works under the unsupervisedlearning paradigm to produce a low-dimensional map of theinput space of the training samples, called a map).

NNs are essentially “a network of weighted, additive valueswith nonlinear transfer functions” although its coined nameseems to elicit a grander impression14.

The simplest kind of NN is a single-layerperceptronnetwork which is a simple kind of a feed-forward network(i.e., a network in which connections between the units do notform a directed cycle). In such a network, there exists a singlelayer of output nodes which is provided the input directlyvia a series of weights. The sum of the weighted input iscalculated at each node to calculate an overall value which isthen matched against a threshold (typically 0). If the calculatedvalue is more than the threshold, the neuron is fired and it takesan activated value (typically 1), otherwise, the neuron takesa deactivated value (typically -1). Despite having a simpleand efficient learning algorithm, single-layer networks are oflimited utility since they have limited expressive power (i.e.,they can not express complex functions) and can only learnlinear decision boundaries in the input. Multi-layer networks,on the other hand, are much more expressive and can representnon-linear functions. In multi-layer NNs, processing elementsare arranged in multiple layers (typically interconnectedina feed-forward fashion) with each neuron in a layer havingdirected connections to the neurons of the subsequent layer.Such networks have a downside that they are hard to trainbecause of high dimensionality of the weight-space and theabundance of local minima [14].

NN is essentially a black-box statistical modeling techniquethat does not utilize the domain’s subject knowledge but learnsfeature from the data itself. Despite the black-box modelingstyle of NN, it is a remarkably versatile tool and appliesto a wide range of problems and performs fairly well ingeneral. This has led to John Denker to famously remark that“neural networks are the second best way of doing just about

14It has been claimed that the selection of the name “neural network” wasone of the great PR successes of the twentieth century since it sounds muchmore exciting by eliciting a comparison with an actual neural network (i.e.,the brain) [94].

anything.” [14]. Notwithstanding this claim, for certain typesof tasks (e.g., pattern recognition, speech recognition, etc.),NN is arguably the most effective learning method knowncurrently [15]. The price of the generality on NNs, though,can be the need of large amounts of training data and in itsgreater convergence time.

Application of NNs to CRNs: NNs have been successfullyapplied to various problems in CRNs such as spectrum sens-ing, spectrum prediction [97], and dynamic channel selection[98]—these last two applications are especially relevant to ourfocused topic of routing in CRNs. NNs have been directlyemployed for the problem of routing in [99] and [100]. Formore details about application of NN to CRNs, the interestedreader is referred to the following survey papers: [2] [12] [93].

F. EVOLUTIONARY ALGORITHMS

Evolutionary algorithms are a set of machine learning tech-niques that aim to imitate the robust procedures and structuresthat various biological organisms have used for adaptationandlearning in their evolution. Evolutionary algorithms are similarto reinforcement learning algorithms in that they also dependon exploration and exploitation [79].

GENETIC ALGORITHMS:A genetic algorithm (GA) is aparticular class of evolutionary algorithm which uses tech-niques such as inheritance and natural selection which areinspired from evolutionary biology [101]. In particular, GAfundamentally relies on the genetic operators of randommu-tation and recombination throughcrossoverto improve thecurrent solution. Apart from these operators, the design ofGAs also includes other crucial components such as populationinitialization, genetic representation, fitness function, and amechanism for selection.

Genetic algorithms are typically implemented in a com-puter simulation in which evolutionary techniques (mutation,crossover, etc.) are applied on apopulation of candidatesolutions (calledindividuals). Individuals are encoded in anabstract representation known as achromosome(which maybe problem specific although representation in strings of 1sand0s is common). The evolution can start from a population ofcompletely random individuals and can evolve to better solu-tions throughsurvival of the fittestafter application of geneticoperators in everygeneration. In every generation, multipleindividuals are stochastically selected from the current pop-ulation with fitter individuals more likely selections and aregenetically modified (mutated or recombined) to form the nextgeneration of the population. The usage of genetic operatorsand stochastic selection allow a gradual improvement in the‘fitness’ of the solution and allow GAs to keep away fromlocal optima.

Like neural networks, genetic algorithms apply very gen-erally. John Denker’s quote about NNs that “neural networksare the second best way of doing just about anything” can besupplemented with the addition “... and genetic algorithmsarethe third best.” [14]. Neural networks and genetic algorithmscan be thought of as the sledgehammers of the algorithmscraft due to their broad applicability and can be readily

Page 18: Artificial Intelligence Based Cognitive Routing for ... - arXiv

18

TABLE IIRELATIONSHIP BETWEEN SOME OF THE FIELDS WHOSE TECHNIQUES AREPRESENTED IN THIS PAPER

Optimal Control (OC) Genetic Algorithms Game Theory

Game Theory (GT) Multiplayer competitive OC process[4]; Dynamic Games [35]

Field of Evolutionary Game Theory;Evolutionary Algorithms and GT

-

Reinforcement Learning (RL) Direct Adaptive Optimal Control [95];Adaptive Dynamic Programming

‘Exploration and Exploitation’ concept;Evolutionary algorithms for RL [77]

Multiagent RL; Markov games[96]

Markov Decision Process OC problem with a defined model; Dy-namic Programming [64]

‘Exploration and Exploitation’ concept Stochastic or Markov games[4] [96]

invoked when more specialized methods fail. Predictably, thisgenerality can come sometimes at a cost in performanceand time to convergence. However, these tools are worth aninitial try and may perform very well for certain problems.Therefore, Denker’s remark must be construed to corroboratethe observation that GA are widely applicable, but it mayundersell some desirable features of GA (and NN) which canmake it an ideal tool for certain problems.

We have noted earlier that evolutionary algorithms, and byextension GA, are related to reinforcement learning in thatboth depend on exploration and exploitation [79]. Evolutionaryalgorithms based learning also illustrates how learning can beviewed as a special case of optimization. These algorithmspursue the ‘optimization problem’ of finding the optimal hy-pothesis according to a predefined fitness function [15]. Withthe insight that learning is ultimately related to optimization,we can apply other optimization and heuristic techniques tomachine learning problems. For a discussion about heuristicoptimization techniques (such as simulated annealing, tabusearch, hill climbing) and their application to CRNs, readersare referred to [2].

A recurrent theme in this paper is that most of the machinelearning fields, and techniques from kindred disciplines, aremore closely related than immediately apparent. Some rep-resentative connections between various fields discussed inthis paper (such as game theory, MDPs, RL, GA and optimalcontrol) are tabulated in table II for easy reference.

Application of GAs to Routing: Ahn et al. proposed tacklingthe shortest path routing problem through GAs [105]. Thepaper discussed the issues of path-oriented encoding, and path-based crossover and mutation, which are relevant to the issueof routing [105].

Application of GAs to CRNs: An early application of GAtechniques to CRNs is documented in a paper authored byRondeau et al. [112]. This paper presented the adaptationmechanism of a cognitive engine implemented by the authorswhich used GAs to evolve a radio’s parameters to a setof parameters that optimize the radio for the user’s currentneeds. This paper also proposed a GA approach, called the‘wireless system genetic algorithm’ (WSGA), to realize cross-layer optimization and adaptive waveform control [112].

SWARM INTELLIGENCE:Swarm intelligence refers to aclass of machine learning techniques in which it is aimedthat intelligence shown in social cooperative animals (such

as ants and bees) be replicated in a distributedcomputationalsetting. Such animals are well known to form communities thatdisplay emergent behavior as the simple limited individualscollaborate to display complex intelligent behavior. Swarmintelligence techniques in general emphasize distributedimple-mentation and coordination through communication. Solutionsbased on swarm intelligence techniques have been proposedboth for CRNs [106] and for routing problems [107].

ANT COLONY OPTIMIZATION:While typical ‘shortestpath’ routing protocols may have significant computationaland message complexity, the humble biological ants, in amarvel of nature, are able to shortest routes to food sourcesinthe dynamics of ant colony with extremely modest resources.A lot of research effort has been focused on imitating theperformance of biological ants to produce optimized andefficient distributed routing behavior [108] [109].

G. SUPPORT VECTOR MACHINE

Support vector machines (SVM) is a supervised learningtechnique used mainly for tasks such as pattern recognitionandclassification. SVMs belong to the general family of learningmethods known askernel machines[14]. SVMs can bothi)use an efficient training algorithm and,ii) represent complex,non-linear functions. SVMs typically outperform ANNs forsmall training examples, but require prior knowledge of theobserved process’ distribution and labeled data [12].

Application of SVMs to CRNs: SVMs have been applied toCRNs, but its application has been mostly limited to problemsof signal classification [12]. Due to its supervised learningstyle, it does not seem like SVMs will have a direct role toplay in the design of AI-based routing protocols for CRNs.

H. BAYESIAN INFERENCE

Bayesian analysis accords significant importance to theprior distribution which is supposed to represent knowledgeabout unknown parameters before the data becomes available.While it is a common assumption that the agent has noprior knowledge about what it is trying to learn, this is notan accurate reflection of reality in many cases. Frequently,an agent will have some prior information, and the learningprocess should ideally exploit this available information.

Bayesian learning can be viewed as a form of uncertainreasoning from observations [14]. Bayesian learning is used tocalculate the probability of each hypothesis, given the data, and

Page 19: Artificial Intelligence Based Cognitive Routing for ... - arXiv

19

TABLE IIISUMMARY OF THE VARIOUS LEARNING TECHNIQUES DISCUSSED IN SECTION IV

Learning techniques Application to CRNs Applications to routing

Hidden Markov model Spectrum occupancy prediction: [102] [103] [104]; Spectrumsensing, primary signal detection (see references in [2])

Can indirectly utilize spectrum occupancy and channel qualitypredictions

Reinforcement learning Many applications: see survey paper [12] Q-routing algorithm [72]; Learning automata [76] [75]; seefurther references in survey papers [12] [73] [81].

Learning with games Spectrum access games [48] Various routing games: see [45]Online learning Opportunistic spectrum access [89] No regret routing for adhoc networks [92]Genetic algorithms Modeling wireless channel: [63] Shortest path routing [105]Swarm algorithms Adaptive optimization [106] Shortest path routing [107]Ant colony optimization Cognitive engine design [108] Routing with ACO in CRNs [108]and MANETs [109]Neural networks Spectrum occupancy prediction [97]; Dynamic channel selec-

tion [98]; Radio parameter adaptation (see references in [2]).Routing with NNs [99] and [100]

Support vector machines Spectrum sensing, signal classification and pattern recognition(see references in [12])

-

Bayesian learning Establishing PU’s activity pattern [37] [46]; Channel estima-tion [4]; Channel quality prediction [110]

Bayesian routing in delay-tolerant-networks [111]

to make predictions on that basis. It has been shown that thetrue hypothesis eventually dominates in Bayesian prediction[14].

Bayesian analysis is appealing since it provides a math-ematical formulation of how previous knowledge can beincorporated with fresh evidence to create new knowledge.However, choosing the right prior distribution is not trivialand an incorrect assumption can skew the inference. It is forthis reason that some statisticians feel uneasy about the useof prior distributions fearing that it may distort “what thedataare trying to say.” [113]. We can model the prior distributionto prior knowledge or use a ‘noninformative’ prior to modelignorance about prior information.

Bayesian networkscan be used for computing how much aset of mutually exclusive prior events contributes to a posteriorcondition, which can be a prior to yet another posterior,and so on. Bayesian networks can be used for reasoningand for tracing chain of conditional causation back from thefinal condition to the initial causes [14]. Previous work onusing Bayesian networks for reasoning in CRNs has beensummarized in [114].

While it was noted by Rondeau in 2009 [115] that littleresearch attention has focused on using Bayesian methods ofstatistical inference in CRNs, a lot of Bayesian inference basedwork have recently been proposed for CRNs. In particular,Bayesian non-parametric modelshave been applied to CRNsby various researchers [37] [46] due to their desirable char-acteristics such as its ability to flexibly model an unknownenvironment with model complexity growing as warranted bynew data. Parametric models typically assume some finiteset of parametersθ and assume thatθ captures everythingthere is know about the data. Non parametric models, onthe other hand, do not assume that the data distribution canbe explained on the basis of a finite set of parameters—instead, an infinite dimensional set of parametersθ, envisionedas a function, is assumed. Bayesian non-parametric modelstypically exploit in their formulation decades of researchonGaussian processes (which defines a distribution on functions)and Dirichlet process (which defines a distribution on dis-tributions). Popular Bayesian nonparametric techniques that

use these processes include Gaussian process regression, inwhich the correlation structure is improved as the sample sizeincreases, and Dirichlet process mixture models for clustering,which adapt the number of clusters to the complexity of thedata.

Applications of Bayesian non-parametric methods in CRNs:Saad et al. have proposed a cooperative Bayesian nonparamet-ric framework for primary user activity monitoring in CRNs[37]. In another work, spectrum access in CRNs was modeledas a repeated auction game and a Bayesian nonparametricbelief update scheme was constructed based on the Dirichletprocess [46].

To conclude this section on learning algorithms, a repre-sentative summary of this section on learning techniques forCRNs is captured in table III.

V. ROUTING IN CRNS

A. TRADITIONAL WIRELESS ROUTING PROTOCOLS:

While our focus is on surveying techniques useful forcognitive routing protocols in the context of CRNs, it isalso prudent to exploit and leverage the huge amount ofprevious work on routing protocols for wireless networks ingeneral. While wireless networks include both wireless LANsand multi-hop wireless networks, our focus is going to bedominantly on multi-hop wireless networks such as mobile ad-hoc networks, wireless mesh networks and CRNs. We focus onthese networks to build upon the insights that we can leveragefor the design of effective routing protocols for CRNs.

Previous work on routing in multi-hop wireless networkscan be noted for the most part for the lack of learning fromenvironment. Most of the classical wireless routing protocolstend to use instantaneous online parameters and do not utilizeenvironment history and learn from it to predict about linksand parameters that are more likely to result in better qualityroutes. These protocols also do not learn about parameterhistory and therefore cannot prioritize higher-quality linksover links of poor quality. While primitive protocols such asAODV, DSDV, and DSR have typically relied on basic metricssuch as hop count or delay, other metrics were developed

Page 20: Artificial Intelligence Based Cognitive Routing for ... - arXiv

20

TABLE IVSUMMARY OF REPRESENTATIVE ROUTING PROTOCOLS THAT HAVE BEEN PROPOSED FORCRNS

Reference Type PU model Comments

Throughput maximizingCacciapuoti et al. [116] Reactive Markov on-off process Reactive routing for mobile ad-hoc CRNsDing et al. [117] Not described Cross-layer routing and dynamic spectrum allocation algorithmSAMER [118] Reactive Bernoulli trial everyt Routes with highest spectrum availability (“least-used spectrum first”)SPEAR [119] Reactive Not described Joint spectrum and routediscovery with distributed path reservations to

minimize inter- and intra-flow interference.Delay minimizingHow et al. [120] Reactive 2-state Semi Markov Model Multi-metric (delay and stability) routing providing differentiated serviceSEARCH [121] Reactive Not described Designed for mobile CRNs based on geographic forwarding principlesCRP [122] Reactive Markov on-off process Distributed jointroute and spectrum selection protocol that explicitly

protects PU receivers, and allows multiple classes of routes.Stability maximizingCoolest-first [123] Reactive Markov on-off process Proposed new routing metrics to capture the time-varying effects of

spectrum availabilityTuggle [124] Proactive Not considered Proposes proactive multi-path routingGymkhana [125] Reactive Markov on-off process Path connectivity based distributed protocol that avoids poorly connected

zonesMaintenance minimizingZhu et al. [126] Hybrid Not described Combines proactive routing and on-demand route discoveryFilippini et al. [127] Ergodic random binary process Optimal centralized, along with, distributed algorithms proposed both for

exactly and statistically known PU activity.

for wireless networks over time such as those that targeted:maximizing throughput [128], minimizing interference [129],load balancing [130], and choosing more reliable links [128].Since metrics designed for traditional wireless networks donot sufficiently capture the time-varying spectrum availabilityfound in CRNs, some recent works have proposed morenuanced spectrum aware routing metrics [118] [123] [126][127] [131].

B. ROUTING PROTOCOLS FOR CRNs:

Somechallengesfor effective routing in CRNs have beenhighlighted by Akyildiz et al. [3]. It was highlighted that whilespectrum sensing techniques and spectrum sharing solutionshave received considerable attention by the CRN researchcommunity, routing remains yet an important unexplored areain CRNs. Akyildiz et al. go on to highlight that the uniquecharacteristics of the open spectrum phenomenon necessi-tate development of novel routing algorithms. Some otherchallenges includei) intermittent connectivity with neigh-bors in DSA networks causing a highly dynamic topology,ii ) heterogeneous channels with diverse channel propertieswhose availability is time-varying [3], andiii ) potential non-availability of common control channel. The challenges andissues related to common control channel are covered at depthin [132]. Another potential problem is the fact that CRs wouldtypically have to work in unknown, or incompletely known,environments. With the strong assumption of availability offull spectrum knowledge, optimization based routing solutionshave been proposed [133] [134]. Such works are applicableonly where this assumption is justified: an example scenariobeing TV band whitespace networking where the SUs canquery databases storing the spectrum map. For the moregeneral case, solutions need to be devised that work withlimited local spectrum knowledge.

A wide variety of routing protocols have been proposedfor CRNs and a representative summary can be seen at table

IV. These routing protocols have used a diverse set of routingmetrics and objectives: e.g.,throughput maximizing protocols([116] [117] [118] [119] ),route-stability maximizing protocols( [123] [124]), delay minimizing protocols( [120] [121] [122][135]), and route-maintenance minimizing protocols( [126][127]).

The most commonly used approach in literature is toincorporate these metrics into some variant of a reactive oran on-demand routing protocol15 to avoid the overhead ofmanaging dynamic topologies proactively. With dynamic spec-trum access (DSA) being envisioned as a prime applicationof CRNs, it is important for routing protocols for CRNsto incorporate PU traffic dynamics into its design. Some ofthe CRN routing protocols have conspicuously not catered toPU dynamics in their design [116] [119] [121] [135] [136],although more recent work [117] [118] [120] [127] haveimportantly incorporated PU awareness.

Sun et al. [137] have conducted a detailed performanceevaluation of three representative CRN routing protocols:SAMER [118], Coolest Path [123], and CRP [122] using bothsimulations (on the NS2 simulator) and an empirical evaluation(on a testbed of 6 node testbed based on USRP2 platform). Thethree protocols evaluated (SAMER [118], Coolest Path [123],and CRP [122] all have different design objectives. SAMERaims mainly at finding the highest throughput path while con-sidering both the PU/ SU activities and the link quality. CoolestPath is designed to prefer paths that more stable since it preferspath with the highest spectrum availability. CRP is designedto either find a path with minimum end-to-end delay alongwith satisfactory PU protection, or to offer more completeprotection to PU receivers at the cost of some performancedegradation to SUs. It has been shown in their simulation andtestbed results that SAMER provides the highest throughputunder low PU activity (since SAMER aims to calculate

15See table IV to see the preponderance of reactive routing protocolsproposed in literature.

Page 21: Artificial Intelligence Based Cognitive Routing for ... - arXiv

21

throughput maximizing paths explicitly) and is also shownto be robust to packet loss; however, its performance underhigh PU activity deteriorates, particularly in the simulationresults. Sun et al. also provide qualitative insights into thedesign of CRN routing protocols. Their findings suggests thattaking link-quality and interference between SUs into accountcan great improve routing performance particularly under lowPU activity. For high PU activity, however, path stabilityand path length become more important. Another importantfinding is that estimating spectrum availability based onlyonlocal observations cannot guarantee path stability thereforesuggesting improvements can be made through cooperation.

Relatively few studies in the literature have addressed themulticast routingproblem in CRNs. Kim et al. have proposeda multicast routing protocol (COCAST) for mobile ad-hocnetworks with nodes equipped with CRs [138]. Their workaimed at improving the scalability of the traditional ODMRPmulticasting protocol in an environment using CRs. In anotherwork, Almasaeid et al. have addressed the problem of assistedmulticast scheduling in cognitive wireless mesh networks[139], and have proposed two approaches for cooperativemulticasting: the first depending on the assistance of multicastreceivers in delivering multicast data to other receivers,whilethe second is network-coding based. Some other works thathave addressed the joint problem of routing and channelassignment for multicast communication in multihop CRNshave also been proposed [140] [141].

Broadcasting is a commonly used networking primitive usedboth in control and data traffic. The problem ofbroadcastrouting in CRNs is challenging as noted in [3]. In CRNs,channel heterogeneity of channels, intermittent connectivity,and lack of a common control channel can constrain the abilityto perform effective broadcast routing [3]. Recently, a workhas been proposed for fully distributed broadcast routing inCRNs without requiring a common control channel [142].An adaptive channel assignment scheme that modifies theassignment to suit broadcast routing when the broadcastingtraffic volume is significant is presented in [143]. Some otherworks that have addressed the problem of broadcasting inCRNs include [144] [145].

It is noted here that while most of the proposed routingprotocols do include certain adaptive features, relatively littlework has been done to integrate AI-based machine learningtechniques into the routing solutions for CRNs. This is apromising new subfield ripe for future research exploration.To realize the vision of next-generation cognitive networks, itis imperative that due attention be given to this central pieceof the overall cognitive architecture.

Interested readers are referred to the following survey paperson routing in CRNs and the references therein to find moreinformation about the various routing protocols proposed forCRNs [146] [147] [148].

VI. COGNITIVE ROUTING TASKS IN CRNS

As noted earlier, although CRN routing protocols do mostlyincorporate spectrum-awareness into their design, futurecog-nitive networks will require greater architectural support from

fully ‘cognitive routing protocols’ that will seamlessly incor-porate AI-based techniques such as learning, planning, andreasoning in their design.

Someinference and reasoningandmodeling and predictioncognitive tasks that future cognitive routing protocols mustincorporate are described next in subsections VI-A and VI-B,respectively.

A. INFERENCE AND REASONING TASKS

Reasoning is an important aspect of CRN behavior and isnecessary for cognitive behavior. Knowledge can be repre-sented using anontology which provides shared vocabularyuseful for modeling a domain, e.g., it can be used to model thetype of objects and concepts existing in a system or domain,and their mutual relationship and properties [6]. A rule basedsystem can make use of a knowledge base and some meansof inference through aninference engine.

It is also possible to reason by analogy. This involves thetransferring of knowledge from a past analogous situation toanother similar present situation. Case-based reasoning (CBR)is a well-known kind of analogy making which has beenexploited in CRN research [2]. In case-based reasoning adatabase of existing cases is maintained and used to drawconclusions about new cases. The CBR reasoning method canutilize procedures like pattern matching and various statisticaltechniques to find which historical case to relate to the currentcase.

Fuzzy logic is another tool that is useful for reasoning insystems and situations having inherent uncertainty or ambigu-ity. Since complete environmental knowledge is difficult, oreven impossible, to obtain in CRNs. Fuzzy logic is naturalfit to the CRN environment where there is limited or noinformation about certain environment factors. Fuzzy logicbased reasoning has been used commonly in CRNs [149][150].

While reasoning is an extremely important part of cognition,a comprehensive treatment of various reasoning tools andtechniques useful for CRNs is outside the scope of thiswork. We refer the interested readers to a recent survey onlearning and reasoning in CRNs for a comprehensive accountof methods, techniques, issues and challenges in implementingreasoning [6].

B. MODELING, PREDICTION, AND LEARNING TASKS

Future cognitive routing protocols can benefit from thefollowing tasks: i) channel quality modeling and prediction,ii) PU activity modeling and prediction, andiii) detecting andmitigating selfish behavior. We will discuss these in turn nextunder their respective headings.

1) Channel quality modeling and prediction:In [63], Ron-deau et al. proposed using HMM to model the wireless channelonline with the HMM being trained using a genetic algorithm.In [151], techniques for modeling wireless network channelusing Markov models are presented along with techniques forefficient estimation of Markov model parameters (includingthe number of states) to aid in reproducing and/or forecasting

Page 22: Artificial Intelligence Based Cognitive Routing for ... - arXiv

22

channel statistics accurately. In another work, Xing et al.haveproposed to perform channel quality prediction using Bayesianinference [110]. Channel estimation problem has also beenaddressed in [4] in which the use of particle filters, rooted inBayesian estimation, were proposed as a device for trackingstatistical variations in a wireless channel.

Researchers have proposed using an ANN-based cognitiveengine for learning how various channel’s quality status affectsperformance and thereby dynamically selecting a channel thatimproves performance. The dynamic selection of channels hasan obvious implication for network-layer functionality and therouting algorithm for such networks should be able to keep upwith the channel changes so that best performing routes areselected.

2) Spectrum occupancy modeling:A satisfactory model ofspectrum occupancy (or, of spectrum white spaces) shouldincorporate:i) states of the channel along with their transitionbehavior, andii) the sojourn timeor the time duration thesystem resides in each of the states [152].

Since many DSA environments (e.g., contention basedprotocols such as IEEE 802.11) do not have a slotted structure,it is more appropriate to use a continuous-time (CT) model. ACT model that is especially relevant to DSA, and one that ispopularly used for modeling spectrum occupancy, is the semi-Markov model (SMM) which generalizes the concept of CTMarkov chains (CTMCs). Although both the semi-Markov andCTMC models have the Markovian property and they describethe transition behavior in the same way, a SMM allows forspecifying the occupancy periods, or the sojourn time, for eachstate arbitrarily. In particular, the occupancy time does needhave to be necessarily exponentially distributed as must bethecase for CTMCs by definition [153] [154]. Specification of aSMM therefore requires both the statistical specification of thetransition behavior and of the sojourn time within each state[153] [154].

It has been posited that for practical purposes of analyzingDSA/ CRNs, a simple two-state semi-Markov ON-OFF modelis adequate for modeling spectrum usage [155] (table IV maybe referred to see the popularity of this model). The OFFstate represents an idle channel, while the ON state indicatesa busy channel not available for opportunistic access, withthe length of ON and OFF periods being random variables(RVs) following some specified distribution. Such a model isalso known as a stochastic duty cycle model [156]. The usageof this simple semi-Markov ON-OFF model is quite popular[152], although other more elaborate models are also available[157]. Geirhofer et al. showed in [158] that such a modelcan be used to empirically model the spectrum use in IEEE802.11b WLAN-systems. It was noted that their results shouldalso extend to other systems having multi-access protocolssimilar to CSMA/ CA.

An important aspect of using such semi-Markov modelsis specifying the state sojourn or stay times, and to studyif successive period lengths are correlated. The simplest ap-proach is to assume the state sojourn time is exponentiallydistributed and that successive stay times are not correlated.Such an approach is interesting due to its simplicity and

tractability. Unfortunately, studies have shown that thissimpleapproximation does not tally up well with empirical studiesonactual systems [152]. Nonetheless, exponential distributions isstill used heuristically [159], although such an approach isnot entirely justified statistically, since this assumption makesthe model earlier to apply in practice. Empirical studies haveshown that state sojourn times typically have larger variabilitythan suggested by the exponential distribution. In fact, thedistributions of the ON and in particular the OFF period wereoften found to be heavy-tailed [152]. These results motivatedthe need of simple models featuring correlated ON and OFFperiods with heavy-tailed marginal distributions. In thisre-gard, Pareto distributions have been used in literature [160].Since, heavy-tailed distributions have the disadvantage ofbeing difficult to analyze analytically [161], other approximatedistributions have also been explored. In particular, the flexibleyet tractablephase-type distributions16 and Beta distributions[156] have been used to capture the data statistically.

Measurements have shown that lengths of vacant periods ina given frequency band can be correlated in addition to havinga heavy-tailed distribution [152]. As simple semi-MarkovON-OFF models cannot reproduce this effect, Wellens et al.proposed producing correlated sequences using an aggregationof multiple semi-Markov ON-OFF processes [160] similar tohow certain self-similar traffic models work [163].

The spectrum occupancy model should incorporate not onlythe temporal aspect of PU activity but also itsspatial aspect.The impact of PU activity pattern onspatial spectrum reuseopportunities have been studied in [164].

A lot of studies have focused on empirical modeling ofspectrum usage and have proposed various models for PUtraffic pattern [156] [160] [165] [166] [167] [168]. For furtherdetails, interested readers are referred to the survey papers[155] [169] in which various statistical models proposed inliterature for modeling temporal and spatial spectrum occu-pancy are reviewed in detail.

3) PU activity modeling and prediction:In DSA CRN net-works, being the licensed incumbent user, a primary user (PU)has prioritized access to the wireless spectrum. Therefore, onthe arrival of a PU, a SU must either vacate the relevantchannel by switching to another channel or by terminating itsconnection; alternatively, the PU must reduce its transmissionpower to ensure that PU does not face any interference.Since the arrivals of PU are non-deterministic, and randomfrom the point-of-view of a SU, frequent PU arrivals canlead to frequent temporal connection losses for secondaryusers thereby seriously impacting its performance. However, aPU can probabilistically model the arrival process and trafficpattern of PU and avoid the channels that will be claimed byPU with a high probability. This can help reduce the temporalconnection loss faced by SUs and potential interference facedby PUs due to any delays in vacation of channel by SUs.

16Phase-type distributions result from a network of one or more inter-relatedPoisson processes. The distribution can be represented as arandom variablemodeling the time to absorption in a Markov process with one absorbingstate. Due to its great flexibility, it can be used to model anypositive valueddistribution. Furthermore, efficient algorithms exist forestimating such amodel’s parameters [162].

Page 23: Artificial Intelligence Based Cognitive Routing for ... - arXiv

23

A cognitive radio that manages to learn the behaviouralpatterns of a primary user by modeling it can optimize itsperformance by exploiting the learned model. For example, aSU can exploit information, potentially gleaned from spectrumsensing data, and select white spaces (that emerge due toabsence of PUs) that tend to be longer lived at certain timesof day and at certain locations. Knowing something about PUpatterns can also be helpful for advanced planning when a SUhas to decide the channel to switch to on the arrival of a PU[170].

A number of techniques have been proposed for spectrumprediction including techniques that are: a) HMM based, b)NN based, Bayesian inference based, moving-average based,autoregressive-model based, and static-neighbor-graph based(which is able to incorporate PU mobility pattern) [110].

HMMs have been popularly used for spectrum occupancyprediction [102] [103]. Akbar et al. utilized HMM models forpredicting spectrum occupancy of the licensed radio bands forCRNs in their proposal of an HMM-based DSA algorithm[102]. Choi et al. proposed a channel learning scheme basedon HMM and also proposed a partially observable Markovdecision process (POMDP) based framework for channel ac-cess to opportunistically exploit frequency channels a primarynetwork operates on [26]. Choi et al. have another follow upwork on using HMM to model the traffic pattern on PUs [104].

Various models for traffic pattern prediction for PU arepresented in [171]. Wang et al. [172] have proposed modelingthe interaction between the PUs and SUs through continuous-time Markov chains (CTMC). Saad et al. have proposedusing cooperation between CR devices that are observingthe availability pattern of PUs, and the usage ofBayesiannonparametric techniquesto estimate PU activity pattern’s dis-tributions. Spectrum prediction has also been tackled throughNeural Networksin [97]. For more details about spectrumprediction techniques, the interested readers are referred to adetailed survey on this topic [110] and the references therein.

4) Detection of Selfish Behavior:Network-layer behaviorentails both the problems of routing and forwarding. In wire-less networks, selfish behavior can manifest itself when nodesengage in unsocial behavior—i.e., they utilize the networkresources but do not pay back the favor by providing necessaryservices to the other network nodes. For correct networkbehavior, it is important that such behavior be arrested. Thefollowing papers have addressed the problems of identifyingand mitigating selfish network behavior [50] [51]. This prob-lem has been studied through the tools provided by gametheory in [44] [173].

VII. OPEN RESEARCH ISSUES AND FUTUREWORK

In this section, we will outline some of the major open re-search issues in building cognitive networks and in developingAI-enabled cognitive routing protocols. We will also discusspotential future work.

A. Incorporation of user preferences and context-awareness

While most of the CRN research focus has been on solvingthe engineering challenges in building the artifact of a cog-

nitive radio network, the role of users, their preferences andcontext-awareness seem to unfortunately have taken a back-seat. If a ‘wireless society’ [174] vision is to materializeinthe not so distant future, it is imperative that researchersfocuson seamless integration of user preferences, and awarenessof(identity, location, time, activity-based) context into cognitivenetworking design. There has been some initial work done inthis area [175] but a lot more work needs to be done.

B. Application of novel machine-learning techniques

Game theory, reinforcement learning, neural networks, andgenetic algorithms, due to their natural fit to the kind ofproblems faced in CRNs, are understandably the most usedAI techniques in CRNs. However, as listed in this surveypaper, there are various other machine-learning techniquesthat can be plausibly applied to tasks much more diversethan their current application. In particular, it is anticipatedthat Bayesian techniques will find increasing usage in CRNs.It is an open research question that which machine-learningtechniques, apart from the current popular approaches, wouldprove to be most successful in solving problems in CRNs.

C. Interworking with other modern technologies

An initial promise of software defined radio (SDR) wasseamless interworking with a plethora of technologies throughsoftware defined adaptations. The vision of CRNs has evolvedfrom the foundations of SDRs and aims to provide userswith seamless holistic experience that integrates potentiallyheterogeneous technologies. The interplay of cognitive radioswith the software defined networking (SDN) architecture,which allows a standards based interface [176] between acentralized ‘network controller’17 and networking devices,should be explored. It is possible that interesting use cases willemerge that will synergize the mainly centralized operationalparadigm of SDNs with the mainly distributed operationalparadigm of CRs. While the emphasis of SDN architecture hasbeen on the separation of control and data planes, it is worthexploring if a combined SDN and CR architecture can helprealize the vision of having a ‘knowledge plane’ for networksas envisioned by Clark et al. [11]. Also, it is worth exploringhow cognitive networks may seamlessly integrate moderntechnologies like internet-of-things, pervasive and ubiquitoustechnology.

D. Cross layer optimization for cognitive networks

The overarching focus of the CRN research community todate has been on problems such as spectrum sensing, signalclassification and other issues that relate to PHY and MAClayers. Relatively less attention has been paid to problemson the networking and higher layers. To realize the visionof cognitive networks, it will be important to focus moreholistically across the networking stack. Future researchersneed to focus more on cross-layer optimization and to studythe implications of subtle interplay between various layers.

17The centralized SDN network controller can itself be built as a distributedsystem to be scalable and avoid a single point of failure.

Page 24: Artificial Intelligence Based Cognitive Routing for ... - arXiv

24

E. Challenges in modeling CRNs

While it is quite common to use simplistic assumptions(such as the Markovian assumption or the perfect knowledgeassumption) to keep our models tractable, real systems are infact quite complex and CRs often have to work in unknown RFenvironments. As Benoit Mandelbrot famously lamented ‘theworld unfortunately has not been designed for the convenienceof mathematicians’, there is a lot of scope of new research inareas of decision making and learning in non-Markovian, par-tially observed, or unknown environments. In such unknownenvironments, the usage of model-free methods would becomeincreasingly important.

The cognitive radio networking environment is naturallyamenable to distributedmulti-agent decision making ratherthan centrally controlled optimization. We have seen earlierhow multi-agent environments are much more challenging todesign than their single-agent counterparts. Ideas from game-theory and economic market design will become increasinglyimportant as multi-agent learning becomes commonplace inCRN design. With AI-based cognitive networks becomingmainstream, it will be important to understand the behaviorofthe overall CRN system in terms of equilibria and dynamicsfor large distributed networks with multiple learning CRs,eachtaking self-serving decisions with access to limited informa-tion.

F. Eliciting and modeling cooperative behavior

To effectively perform distributed learning and decisionmaking in wireless networks, cooperative behavior is veryimportant, even for agents interested in personal utility max-imization [177]. Already, there has been a lot of work oncooperative spectrum sensing and decision making throughcoalitions [37]. In a recently proposed cooperative paradigm,named ‘docitive networks, it is proposed that agents willlearn more efficiently through enhanced cooperative knowl-edge transfer [178]. Docitive networks draw their etymologyfrom the Latin root worddoceremeaning ‘to teach’. In [178],three distinct docitive approaches were proposed for CRNs:i) startup docition,ii) adaptive docition,iii) iterative docition.In future work, eliciting and encouraging cooperative behaviorthrough incentives and mechanism design will become impor-tant and looks promising to be an important area of furtherresearch.

G. Understanding the dynamics of CRNs

Emergent behavior of CRNs, composed of multiple self-interested CR servicing users with distinct context, can becomplex. This can manifest itself when slight changes in oneor more of the system parameters result in dramatic changes insystem behavior [4]. Researchers can exploit advances in thestudy of complexity to understand the dynamics of such CRNs[179]. The interplay between cooperation, competition, andexploitation has also been explored in [4] in which etiquettesand protocols to manage the tradeoff were emphasized.

VIII. CONCLUSION

Learning is at the core of the vision of cognitive radio andcognitive radio networks. While a lot of previous researchattention has focused on general AI techniques for optimizingPHY and MAC layer parameters for CRs, scant attention hasbeen given to utilizing learning techniques at the networklayer particularly for the problem of routing. We have arguedin this paper that incorporating learning from the past andpresent conditions can be very productive and can lead toimproved CRN performance. In this paper, we have surveyedthe set of techniques that can be used to embed learning inthe routing framework of CRNs, and provided a tutorial onthe various relevant techniques from a wide variety of fields.Open research issues and potential directions for future workhave also been identified.

REFERENCES

[1] J. Mitola III, Cognitive Radio Architecture: The Engineering Founda-tions of Radio XML. John Wiley & Sons, 2006.

[2] A. He, K. K. Bae, T. R. Newman, J. Gaeddert, K. Kim, R. Menon,L. Morales-Tirado, J. J. Neel, Y. Zhao, J. H. Reed,et al., “A survey ofartificial intelligence for cognitive radios,”Vehicular Technology, IEEETransactions on, vol. 59, no. 4, pp. 1578–1592, 2010.

[3] I. F. Akyildiz, W.-Y. Lee, M. C. Vuran, and S. Mohanty, “Nextgeneration/dynamic spectrum access/cognitive radio wireless networks:a survey,”Computer Networks, vol. 50, no. 13, pp. 2127–2159, 2006.

[4] S. Haykin, “Cognitive radio: brain-empowered wirelesscommunica-tions,” Selected Areas in Communications, IEEE Journal on, vol. 23,no. 2, pp. 201–220, 2005.

[5] B. A. Fette, Cognitive radio technology. Access Online via Elsevier,2009.

[6] L. Gavrilovska, V. Atanasovski, I. Macaluso, and L. DaSilva, “Learningand reasoning in cognitive radio networks,”IEEE CommunicationsSuveys and Tutorials, 2013.

[7] R. W. Thomas, D. H. Friend, L. A. DaSilva, and A. B. MacKenzie,Cognitive networks. Springer, 2007.

[8] C. Fortuna and M. Mohorcic, “Trends in the development ofcommu-nication networks: Cognitive networks,”Computer Networks, vol. 53,no. 9, pp. 1354–1376, 2009.

[9] P. Mähönen, M. Petrova, J. Riihijärvi, and M. Wellens, “Cognitive wire-less networks: your network just became a teenager,” inProceedingsof IEEE INFOCOM, vol. 2006, Citeseer, 2006.

[10] P. Mahonen, “Cognitive trends in making: future of networks,” inPersonal, Indoor and Mobile Radio Communications, 2004. PIMRC2004. 15th IEEE International Symposium on, vol. 2, pp. 1449–1454,IEEE, 2004.

[11] D. D. Clark, C. Partridge, J. C. Ramming, and J. T. Wroclawski,“A knowledge plane for the internet,” inProceedings of the 2003conference on Applications, technologies, architectures, and protocolsfor computer communications, pp. 3–10, ACM, 2003.

[12] M. Bkassiny, Y. Li, and S. Jayaweera, “A survey on machine-learningtechniques in cognitive radios,”Communications Surveys and Tutorials,IEEE, vol. 15, 2013.

[13] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin,Learning fromdata. AMLBook, 2012.

[14] S. J. Russell, P. Norvig, J. F. Canny, J. M. Malik, and D. D. Edwards,Artificial intelligence: a modern approach, vol. 74. Prentice hallEnglewood Cliffs, 1995.

[15] T. M. Mitchell, “Machine learning. wcb,” 1997.[16] T. M. Mitchell, The discipline of machine learning. Carnegie Mellon

University, School of Computer Science, Machine Learning Depart-ment, 2006.

[17] C. Clancy, J. Hecker, E. Stuntebeck, and T. O’Shea, “Applications ofmachine learning to cognitive radio networks,”Wireless Communica-tions, IEEE, vol. 14, no. 4, pp. 47–52, 2007.

[18] T. W. Rondeau,Application of artificial intelligence to wireless com-munications. PhD thesis, Virginia Polytechnic Institute and StateUniversity, 2007.

[19] S. Kulkarni and G. Harman,An Elementary Introduction to StatisticalLearning Theory, vol. 853. Wiley, 2011.

Page 25: Artificial Intelligence Based Cognitive Routing for ... - arXiv

25

[20] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchicaldirichlet processes,”Journal of the american statistical association,vol. 101, no. 476, 2006.

[21] N. Shetty, S. Pollin, and P. Pawelczak, “Identifying spectrum usage byunknown systems using experiments in machine learning,” inWirelessCommunications and Networking Conference, 2009. WCNC 2009.IEEE, pp. 1–6, IEEE, 2009.

[22] R. S. Sutton and A. G. Barto,Reinforcement learning: An introduction.Cambridge Univ Press, 1998.

[23] R. Bellman,Dynamic Programming. 1957.[24] “Talk on ‘Deconstructing Reinforcement Learning’ by Richard Sut-

ton at ICML 2009.” http://videolectures.net/icml09_sutton_itdrl/ . Ac-cessed: 2013-08-17.

[25] A. Gosavi, “Reinforcement learning: A tutorial surveyand recentadvances,”INFORMS Journal on Computing, vol. 21, no. 2, pp. 178–192, 2009.

[26] K. W. Choi and E. Hossain, “Opportunistic access to spectrum holesbetween packet bursts: a learning-based approach,”Wireless Commu-nications, IEEE Transactions on, vol. 10, no. 8, pp. 2497–2509, 2011.

[27] Q. Zhao, L. Tong, A. Swami, and Y. Chen, “Decentralized cognitiveMAC for opportunistic spectrum access in ad hoc networks: A POMDPframework,” Selected Areas in Communications, IEEE Journal on,vol. 25, no. 3, pp. 589–600, 2007.

[28] E. Altman, “Applications of markov decision processesin communica-tion networks,” inHandbook of Markov decision processes, pp. 489–536, Springer, 2002.

[29] A. B. MacKenzie and L. A. DaSilva, “Game theory for wirelessengineers,”Synthesis Lectures on Communications, vol. 1, no. 1, pp. 1–86, 2006.

[30] M. Felegyhazi and J.-P. Hubaux, “Game theory in wireless networks: Atutorial,” tech. rep., Technical Report LCA-REPORT-2006-002, EPFL,2006.

[31] S. Keshav, Mathematical Foundations of Computer Networking.Addison-Wesley, 2012.

[32] J. Y. Halpern, “Beyond nash equilibrium: Solution concepts for the21st century,” inProceedings of the twenty-seventh ACM symposiumon Principles of distributed computing, pp. 1–10, ACM, 2008.

[33] V. Raghunathan and P. Kumar, “Wardrop routing in wireless networks,”Mobile Computing, IEEE Transactions on, vol. 8, no. 5, pp. 636–652,2009.

[34] N. Nisan, Algorithmic game theory. Cambridge University Press[Available Online], 2007.

[35] T. Basar, G. J. Olsder, G. Clsder, T. Basar, T. Baser, andG. J. Olsder,Dynamic noncooperative game theory, vol. 200. SIAM, 1995.

[36] K. Leyton-Brown and Y. Shoham, “Essentials of game theory: Aconcise multidisciplinary introduction,”Synthesis Lectures on ArtificialIntelligence and Machine Learning, vol. 2, no. 1, pp. 1–88, 2008.

[37] W. Saad, Z. Han, H. V. Poor, T. Basar, and J. B. Song, “A cooperativebayesian nonparametric framework for primary user activity monitoringin cognitive radio networks,”Selected Areas in Communications, IEEEJournal on, vol. 30, no. 9, pp. 1815–1822, 2012.

[38] E. Hossain, D. Niyato, and Z. Han,Dynamic spectrum access andmanagement in cognitive radio networks. Cambridge University Press,2009.

[39] A. Neyman and S. Sorin,Stochastic games and applications, vol. 570.Springer, 2003.

[40] S. Maharjan, Y. Zhang, and S. Gjessing, “Economic approaches forcognitive radio networks: A survey,”Wireless Personal Communica-tions, vol. 57, no. 1, pp. 33–51, 2011.

[41] Y. Zhang, C. Lee, D. Niyato, and P. Wang, “Auction approaches forresource allocation in wireless systems: A survey,”CommunicationsSurveys and Tutorials, 2013.

[42] M. Di Felice, K. R. Chowdhury, C. Wu, L. Bononi, and W. Meleis,“Learning-based spectrum selection in cognitive radio ad hoc net-works,” in Wired/Wireless Internet Communications, pp. 133–145,Springer, 2010.

[43] M. Di Felice, K. R. Chowdhury, and L. Bononi, “Learning with thebandit: A cooperative spectrum selection scheme for cognitive radionetworks,” in Global Telecommunications Conference (GLOBECOM2011), 2011 IEEE, pp. 1–6, IEEE, 2011.

[44] T. Roughgarden, “Routing games,”Algorithmic Game Theory, vol. 18,2007.

[45] F.-N. Pavlidou and G. Koltsidas, “Game theory for routing modelingin communication networks-a survey,”Communications and Networks,Journal of, vol. 10, no. 3, pp. 268–286, 2008.

[46] Z. Han, R. Zheng, and H. V. Poor, “Repeated auctions withbayesiannonparametric learning for spectrum access in cognitive radio net-works,” Wireless Communications, IEEE Transactions on, vol. 10,no. 3, pp. 890–900, 2011.

[47] Z. Han, D. Niyato, W. Saad, T. Basar, and A. Hjorungnes,Game theoryin wireless and communication networks. Cambridge University Press,2012.

[48] M. Van der Schaar and F. Fu, “Spectrum access games and strategiclearning in cognitive radio networks for delay-critical applications,”Proceedings of the IEEE, vol. 97, no. 4, pp. 720–740, 2009.

[49] M. Felegyhazi, J.-P. Hubaux, and L. Buttyan, “Nash equilibria of packetforwarding strategies in wireless ad hoc networks,”Mobile Computing,IEEE Transactions on, vol. 5, no. 5, pp. 463–476, 2006.

[50] S. Eidenbenz, G. Resta, and P. Santi, “Commit: A sender-centrictruthful and energy-efficient routing protocol for ad hoc networkswith selfish nodes,” inParallel and Distributed Processing Symposium,2005. Proceedings. 19th IEEE International, pp. 10–pp, IEEE, 2005.

[51] W. Wang, X.-Y. Li, and Y. Wang, “Truthful multicast routing in selfishwireless networks,” inProceedings of the 10th annual internationalconference on Mobile computing and networking, pp. 245–259, ACM,2004.

[52] K. Akkarajitsakul, E. Hossain, D. Niyato, and D. I. Kim,“Gametheoretic approaches for multiple access in wireless networks: Asurvey,” Communications Surveys & Tutorials, IEEE, vol. 13, no. 3,pp. 372–395, 2011.

[53] K. R. Liu and B. Wang,Cognitive radio networking and security: Agame-theoretic view. Cambridge University Press, 2010.

[54] V. Srivastava, J. O. Neel, A. B. MacKenzie, R. Menon, L. A. DaSilva,J. E. Hicks, J. H. Reed, and R. P. Gilles, “Using game theory toanalyze wireless ad hoc networks.,”IEEE Communications Surveysand Tutorials, vol. 7, no. 1-4, pp. 46–56, 2005.

[55] M. Naserian and K. Tepe, “Game theoretic approach in routing protocolfor wireless ad hoc networks,”Ad Hoc Networks, vol. 7, no. 3, pp. 569–578, 2009.

[56] B. Wang, Y. Wu, and K. Liu, “Game theory for cognitive radio net-works: An overview,”Computer Networks, vol. 54, no. 14, pp. 2537–2561, 2010.

[57] R. Mahajan, M. Rodrig, D. Wetherall, and J. Zahorjan, “Experiencesapplying game theory to system design,” inProceedings of the ACMSIGCOMM workshop on Practice and theory of incentives in networkedsystems, pp. 183–190, ACM, 2004.

[58] D. Braess, A. Nagurney, and T. Wakolbinger, “On a paradox of trafficplanning,” Transportation science, vol. 39, no. 4, pp. 446–450, 2005.

[59] L. Qiu, Y. R. Yang, Y. Zhang, and S. Shenker, “On selfish routing ininternet-like environments,”IEEE/ACM Transactions on Networking(TON), vol. 14, no. 4, pp. 725–738, 2006.

[60] E. Altman, T. Boulogne, R. El-Azouzi, T. Jiménez, and L.Wynter,“A survey on networking games in telecommunications,”Computers& Operations Research, vol. 33, no. 2, pp. 286–311, 2006.

[61] L. Rabiner and B. Juang, “An introduction to hidden Markov models,”ASSP Magazine, IEEE, vol. 3, no. 1, pp. 4–16, 1986.

[62] O. Cappé, E. Moulines, and T. Rydén,Inference in hidden Markovmodels. Springer, 2005.

[63] T. W. Rondeau, C. J. Rieser, T. M. Gallagher, and C. W. Bostian,“Online modeling of wireless channels with hidden Markov modelsand channel impulse responses for cognitive radios,” inMicrowaveSymposium Digest, 2004 IEEE MTT-S International, vol. 2, pp. 739–742, IEEE, 2004.

[64] D. P. Bertsekas and J. N. Tsitsiklis, “Neuro-dynamic programming: Anoverview,” in Decision and Control, 1995., Proceedings of the 34thIEEE Conference on, vol. 1, pp. 560–564, IEEE, 1995.

[65] D. P. Bertsekas, “Approximate dynamic programming,” 2011.[66] C. Szepesvári, “Algorithms for reinforcement learning,” Synthesis Lec-

tures on Artificial Intelligence and Machine Learning, vol. 4, no. 1,pp. 1–103, 2010.

[67] L. Busoniu, D. Ernst, B. De Schutter, and R. Babuska, “Approximatereinforcement learning: An overview,” inAdaptive Dynamic Program-ming And Reinforcement Learning (ADPRL), 2011 IEEE Symposiumon, pp. 1–8, IEEE, 2011.

[68] P. Kumar, “A survey of some results in stochastic adaptive control,”SIAM Journal on Control and Optimization, vol. 23, no. 3, pp. 329–380, 1985.

[69] R. Howard, “Dynamic Programming and Markov Processes.1960,”Cambridge Mass.: MITP, 1960.

[70] C. J. Watkins and P. Dayan, “Q-learning,”Machine learning, vol. 8,no. 3-4, pp. 279–292, 1992.

Page 26: Artificial Intelligence Based Cognitive Routing for ... - arXiv

26

[71] G. Tesauro, “Programming backgammon using self-teaching neuralnets,” Artificial Intelligence, vol. 134, no. 1, pp. 181–199, 2002.

[72] J. A. Boyan and M. L. Littman, “Packet routing in dynamicallychanging networks: A reinforcement learning approach,”Advances inneural information processing systems, pp. 671–671, 1994.

[73] H. A. Al-Rawi, M. A. Ng, and K.-L. A. Yau, “Application ofreinforce-ment learning to routing in distributed wireless networks:a review,”Artificial Intelligence Review, pp. 1–36, 2013.

[74] P. Nicopolitidis, G. I. Papadimitriou, A. S. Pomportsis, P. Sarigianni-dis, and M. S. Obaidat, “Adaptive wireless networks using learningautomata,”Wireless Communications, IEEE, vol. 18, no. 2, pp. 75–81,2011.

[75] J. Akbari Torkestani and M. R. Meybodi, “Mobility-based multicastrouting algorithm for wireless mobile ad-hoc networks: A learning au-tomata approach,”Computer Communications, vol. 33, no. 6, pp. 721–735, 2010.

[76] J. Akbari Torkestani and M. R. Meybodi, “An intelligentbackboneformation algorithm for wireless ad hoc networks based on distributedlearning automata,”Computer Networks, vol. 54, no. 5, pp. 826–843,2010.

[77] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcementlearning: A survey,”arXiv preprint cs/9605103, 1996.

[78] J. Gittins,Multi-armed bandit allocation indices. John Wiley and Sons,1989.

[79] M. Crepinšek, S.-H. Liu, and M. Mernik, “Exploration and exploita-tion in evolutionary algorithms: a survey,”ACM Computing Surveys(CSUR), vol. 45, no. 3, p. 35, 2013.

[80] L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive surveyof multiagent reinforcement learning,”Systems, Man, and Cybernetics,Part C: Applications and Reviews, IEEE Transactions on, vol. 38, no. 2,pp. 156–172, 2008.

[81] K.-L. Yau, P. Komisarczuk, and P. D. Teal, “Applications of rein-forcement learning to cognitive radio networks,” inCommunicationsWorkshops (ICC), 2010 IEEE International Conference on, pp. 1–6,IEEE, 2010.

[82] K.-L. A. Yau, P. Komisarczuk, and P. D. Teal, “Reinforcement learningfor context awareness and intelligence in wireless networks: Review,new features and open issues,”Journal of Network and ComputerApplications, vol. 35, no. 1, pp. 253–267, 2012.

[83] D. Fudenberg,The theory of learning in games, vol. 2. MIT press,1998.

[84] L. R. Izquierdo, S. S. Izquierdo, and F. Vega-Redondo, “Learningand evolutionary game theory,” inEncyclopedia of the Sciences ofLearning, pp. 1782–1788, Springer, 2012.

[85] I. Rezek, D. S. Leslie, S. Reece, S. J. Roberts, A. Rogers, R. K. Dash,and N. R. Jennings, “On similarities between inference in game theoryand machine learning.,”J. Artif. Intell. Res.(JAIR), vol. 33, pp. 259–283, 2008.

[86] J. Hu and M. P. Wellman, “Nash q-learning for general-sum stochasticgames,”The Journal of Machine Learning Research, vol. 4, pp. 1039–1069, 2003.

[87] A. Blum and Y. Monsour,Learning, regret minimization, and equilib-ria. Cambridge University Press, 2007.

[88] N. Littlestone and M. K. Warmuth, “The weighted majority algorithm,”in Foundations of Computer Science, 1989., 30th Annual Symposiumon, pp. 256–261, IEEE, 1989.

[89] Z. Han, C. Pandana, and K. R. Liu, “Distributive opportunistic spectrumaccess for cognitive radio using correlated equilibrium and no-regretlearning,” in Wireless Communications and Networking Conference,2007. WCNC 2007. IEEE, pp. 11–15, IEEE, 2007.

[90] B. Awerbuch and R. Kleinberg, “Online linear optimization and adap-tive routing,” Journal of Computer and System Sciences, vol. 74, no. 1,pp. 97–114, 2008.

[91] I. C. Avramopoulos, J. Rexford, and R. E. Schapire, “From optimizationto regret minimization and back again.,” inSysML, 2008.

[92] A. Bhorkar and T. Javidi, “No regret routing for ad-hoc wirelessnetworks,” in Signals, Systems and Computers (ASILOMAR), 2010Conference Record of the Forty Fourth Asilomar Conference on,pp. 676–680, IEEE, 2010.

[93] K. Tsagkaris, A. Katidiotis, and P. Demestichas, “Neural network-basedlearning schemes for cognitive radio systems,”Computer Communica-tions, vol. 31, no. 14, pp. 3394–3404, 2008.

[94] “A brief history of neural networks [online].”http://www.dtreg.com/mlfn.htm. Accessed: 2013-08-17.

[95] R. S. Sutton, A. G. Barto, and R. J. Williams, “Reinforcement learningis direct adaptive optimal control,”Control Systems, IEEE, vol. 12,no. 2, pp. 19–22, 1992.

[96] M. L. Littman, “Markov games as a framework for multi-agentreinforcement learning.,” inICML, vol. 94, pp. 157–163, 1994.

[97] V. K. Tumuluru, P. Wang, and D. Niyato, “A neural networkbasedspectrum prediction scheme for cognitive radio,” inCommunications(ICC), 2010 IEEE International Conference on, pp. 1–5, IEEE, 2010.

[98] N. Baldo, B. R. Tamma, B. Manojt, R. Rao, and M. Zorzi, “A neuralnetwork based cognitive controller for dynamic channel selection,” inCommunications, 2009. ICC’09. IEEE International Conference on,pp. 1–5, IEEE, 2009.

[99] S. Ju and J. B. Evans, “Scalable cognitive routing protocol for mobilead-hoc networks,” inGlobal Telecommunications Conference (GLOBE-COM 2010), 2010 IEEE, pp. 1–6, IEEE, 2010.

[100] J. Barbancho, C. León, J. Molina, and A. Barbancho, “SIR: A newwireless sensor network routing protocol based on artificial intelli-gence,” inAdvanced Web and Network Technologies, and Applications,pp. 271–275, Springer, 2006.

[101] D. E. Goldberg and J. H. Holland, “Genetic algorithms and machinelearning,” Machine learning, vol. 3, no. 2, pp. 95–99, 1988.

[102] I. A. Akbar and W. H. Tranter, “Dynamic spectrum allocation incognitive radio using hidden Markov models: Poisson distributed case,”in SoutheastCon, 2007. Proceedings. IEEE, pp. 196–201, IEEE, 2007.

[103] C.-H. Park, S.-W. Kim, S.-M. Lim, and M.-S. Song, “HMM basedchannel status predictor for cognitive radio,” inMicrowave Conference,2007. APMC 2007. Asia-Pacific, pp. 1–4, IEEE, 2007.

[104] K. Choi and E. Hossain, “Estimation of Primary User Parameters inCognitive Radio Systems via Hidden Markov Model,”IEEE Transac-tions on Signal Processing, 2013.

[105] C. W. Ahn and R. S. Ramakrishna, “A genetic algorithm for shortestpath routing problem and the sizing of populations,”EvolutionaryComputation, IEEE Transactions on, vol. 6, no. 6, pp. 566–579, 2002.

[106] A. H. Mahdi, J. Mohanan, M. A. Kalil, and A. Mitschele-Thiel,“Adaptive discrete particle swarm optimization for cognitive radios,”in Communications (ICC), 2012 IEEE International Conferenceon,pp. 6550–6554, IEEE, 2012.

[107] A. W. Mohemmed, N. C. Sahoo, and T. K. Geok, “Solving shortest pathproblem using particle swarm optimization,”Applied Soft Computing,vol. 8, no. 4, pp. 1643–1653, 2008.

[108] N. Zhao, S. Li, and Z. Wu, “Cognitive radio engine design based onant colony optimization,”Wireless Personal Communications, vol. 65,no. 1, pp. 15–24, 2012.

[109] G. Di Caro, F. Ducatelle, and L. M. Gambardella, “Anthocnet: an adap-tive nature-inspired algorithm for routing in mobile ad hocnetworks,”European Transactions on Telecommunications, vol. 16, no. 5, pp. 443–455, 2005.

[110] X. Xing, T. Jing, Y. Huo, H. Li, and X. Cheng, “Channel qualityprediction based on bayesian inference in cognitive radio networks,”in IEEE INFOCOM, 2013.

[111] S. Ahmed and S. S. Kanhere, “A Bayesian routing framework fordelay tolerant networks,” inWireless Communications and NetworkingConference (WCNC), 2010 IEEE, pp. 1–6, IEEE, 2010.

[112] T. W. Rondeau, B. Le, C. J. Rieser, and C. W. Bostian, “Cognitiveradios with genetic algorithms: Intelligent control of software definedradios,” inSoftware defined radio forum technical conference, pp. C3–C8, Citeseer, 2004.

[113] G. E. Box and G. C. Tiao,Bayesian inference in statistical analysis,vol. 40. John Wiley & Sons, 2011.

[114] E. Adamopoulou, K. Demestichas, P. Demestichas, and M. Theologou,“Enhancing cognitive radio systems with robust reasoning,” Interna-tional Journal of Communication Systems, vol. 21, no. 3, pp. 311–330,2008.

[115] T. W. Rondeau and C. W. Bostian,Artificial intelligence in wirelesscommunications. Artech House, 2009.

[116] A. S. Cacciapuoti, M. Caleffi, and L. Paura, “Reactive routing formobile cognitive radio ad hoc networks,”Ad Hoc Networks, vol. 10,no. 5, pp. 803–815, 2012.

[117] L. Ding, T. Melodia, S. Batalama, and M. J. Medley, “Rosa: distributedjoint routing and dynamic spectrum allocation in cognitiveradio ad hocnetworks,” in Proceedings of the 12th ACM international conferenceon Modeling, analysis and simulation of wireless and mobilesystems,pp. 13–20, ACM, 2009.

[118] I. Pefkianakis, S. H. Wong, and S. Lu, “Samer: spectrumaware meshrouting in cognitive radio networks,” inNew Frontiers in DynamicSpectrum Access Networks, 2008. DySPAN 2008. 3rd IEEE Symposiumon, pp. 1–5, IEEE, 2008.

[119] A. Sampath, L. Yang, L. Cao, H. Zheng, and B. Y. Zhao, “Highthroughput spectrum-aware routing for cognitive radio networks,” Proc.of IEEE Crowncom, 2008.

Page 27: Artificial Intelligence Based Cognitive Routing for ... - arXiv

27

[120] K. C. How, M. Ma, and Y. Qin, “Routing and QoS provisioning incognitive radio networks,”Computer Networks, vol. 55, no. 1, pp. 330–342, 2011.

[121] K. R. Chowdhury and M. D. Felice, “Search: A routing protocol formobile cognitive radio ad-hoc networks,”Computer Communications,vol. 32, no. 18, pp. 1983–1997, 2009.

[122] K. R. Chowdhury and I. F. Akyildiz, “Crp: A routing protocol forcognitive radio ad hoc networks,”Selected Areas in Communications,IEEE Journal on, vol. 29, no. 4, pp. 794–804, 2011.

[123] X. Huang, D. Lu, P. Li, and Y. Fang, “Coolest path: Spectrum mobilityaware routing metrics in cognitive ad hoc networks,” inDistributedComputing Systems (ICDCS), 2011 31st International Conference on,pp. 182–191, IEEE, 2011.

[124] R. E. Tuggle, “Cognitive multipath routing for mission critical multi-hop wireless networks,” inSystem Theory (SSST), 2010 42nd South-eastern Symposium on, pp. 66–71, IEEE, 2010.

[125] A. Abbagnale and F. Cuomo, “Gymkhana: a connectivity-based routingscheme for cognitive radio ad hoc networks,” inINFOCOM IEEEConference on Computer Communications Workshops, 2010, pp. 1–5,IEEE, 2010.

[126] G.-M. Zhu, I. F. Akyildiz, and G.-S. Kuo, “Stod-rp: A spectrum-tree based on-demand routing protocol for multi-hop cognitive radionetworks,” in Global Telecommunications Conference, 2008. IEEEGLOBECOM 2008. IEEE, pp. 1–5, IEEE, 2008.

[127] I. Filippini, E. Ekici, and M. Cesana, “Minimum maintenance costrouting in cognitive radio networks,” inMobile Adhoc and SensorSystems, 2009. MASS’09. IEEE 6th International Conferenceon,pp. 284–293, IEEE, 2009.

[128] D. S. De Couto, D. Aguayo, J. Bicket, and R. Morris, “A high-throughput path metric for multi-hop wireless routing,”Wireless Net-works, vol. 11, no. 4, pp. 419–434, 2005.

[129] A. P. Subramanian, M. M. Buddhikot, and S. Miller, “Interferenceaware routing in multi-radio wireless mesh networks,” inWireless MeshNetworks, 2006. WiMesh 2006. 2nd IEEE Workshop on, pp. 55–63,IEEE, 2006.

[130] A. Raniwala and T.-c. Chiueh, “Architecture and algorithms for anIEEE 802.11-based multi-channel wireless mesh network,” in INFO-COM 2005. 24th Annual Joint Conference of the IEEE Computer andCommunications Societies. Proceedings IEEE, vol. 3, pp. 2223–2234,IEEE, 2005.

[131] M. Caleffi, I. F. Akyildiz, and L. Paura, “Opera: optimal routing metricfor cognitive radio ad hoc networks,”Wireless Communications, IEEETransactions on, vol. 11, no. 8, pp. 2884–2894, 2012.

[132] B. F. Lo, “A survey of common control channel design in cognitiveradio networks,”Physical Communication, vol. 4, no. 1, pp. 26–39,2011.

[133] M. Ma and D. H. Tsang, “Joint spectrum sharing and fair routingin cognitive radio networks,” inConsumer Communications and Net-working Conference, 2008. CCNC 2008. 5th IEEE, pp. 978–982, IEEE,2008.

[134] Y. Shi and Y. T. Hou, “A distributed optimization algorithm for multi-hop cognitive radio networks,” inINFOCOM 2008. The 27th Con-ference on Computer Communications. IEEE, pp. 1292–1300, IEEE,2008.

[135] Z. Yang, G. Cheng, W. Liu, W. Yuan, and W. Cheng, “Local coordi-nation based routing and spectrum assignment in multi-hop cognitiveradio networks,”Mobile Networks and Applications, vol. 13, no. 1-2,pp. 67–81, 2008.

[136] S. Deng, J. Chen, H. He, and W. Tang, “Collaborative strategy forroute and spectrum selection in cognitive radio networks,”in FutureGeneration Communication and Networking (FGCN 2007), vol. 2,pp. 168–172, IEEE, 2007.

[137] L. Sun, W. Zheng, N. Rawat, V. Sawant, and D. Koutsonikolas,“Performance comparison of routing protocols for cognitive radionetworks,” inMASCOTS 2013. IEEE, IEEE, 2013.

[138] W. Kim, S. Y. Oh, M. Gerla, and J.-S. Park, “Cocast: multicast mobilead hoc networks using cognitive radio,” inMilitary CommunicationsConference, 2009. MILCOM 2009. IEEE, pp. 1–7, IEEE, 2009.

[139] H. M. Almasaeid and A. E. Kamal, “Exploiting multichannel di-versity for cooperative multicast in cognitive radio mesh networks,”IEEE/ACM Transactions on Networking, 2013.

[140] H. M. Almasaeid, T. H. Jawadwala, and A. E. Kamal, “On-demandmulticast routing in cognitive radio mesh networks,” inGlobalTelecommunications Conference (GLOBECOM 2010), 2010 IEEE,pp. 1–5, IEEE, 2010.

[141] W. Ren, X. Xiao, and Q. Zhao, “Minimum-energy multicast treein cognitive radio networks,” inSignals, Systems and Computers,

2009 Conference Record of the Forty-Third Asilomar Conference on,pp. 312–316, IEEE, 2009.

[142] Y. Song and J. Xie, “A distributed broadcast protocol in multi-hopcognitive radio ad hoc networks without a common control channel,”in INFOCOM, 2012 Proceedings IEEE, pp. 2273–2281, IEEE, 2012.

[143] A. K. Mir, A. Akram, E. Ahmed, J. Qadir, and A. Baig, “Unifiedchannel assignment for unicast and broadcast traffic in cognitive radionetworks,” inLocal Computer Networks Workshops (LCN Workshops),2012 IEEE 37th Conference on, pp. 799–806, IEEE, 2012.

[144] Z. Htike and C. S. Hong, “Broadcasting in multichannelcognitiveradio ad hoc networks,” inWireless Communications and NetworkingConference (WCNC), 2013 IEEE, pp. 733–737, IEEE, 2013.

[145] M. Fahad, J. Qadir, and A. Baig, “Broadcasting in cognitive wirelessmesh networks with dynamic channel conditions,” inEmerging Tech-nologies (ICET), 2010 6th International Conference on, pp. 400–404,IEEE, 2010.

[146] M. Cesana, F. Cuomo, and E. Ekici, “Routing in cognitive radionetworks: Challenges and solutions,”Ad Hoc Networks, vol. 9, no. 3,pp. 228–248, 2011.

[147] H. A. Al-Rawi and K.-L. A. Yau, “Routing in distributedcognitiveradio networks: A survey,”Wireless Personal Communications, pp. 1–38, 2013.

[148] S. Abdelaziza and M. ElNainay, “Survey of routing protocols incognitive radio networks,” 2012.

[149] M. Erman, A. Mohammed, and E. Rakus-Andersson, “Fuzzylogicapplications in wireless communications.,” inIFSA/EUSFLAT Conf.,pp. 763–767, 2009.

[150] H. S. Shatila,Adaptive Radio Resource Management in CognitiveRadio Communications using Fuzzy Reasoning. PhD thesis, VirginiaPolytechnic Institute and State University, 2012.

[151] I. A. Akbar, Statistical analysis of wireless systems using Markovmodels. PhD thesis, Virginia Polytechnic Institute and State University,2007.

[152] S. Geirhofer, L. Tong, and B. M. Sadler, “Cognitive radios for dynamicspectrum access-dynamic spectrum access in the time domain: Mod-eling and exploiting white space,”Communications Magazine, IEEE,vol. 45, no. 5, pp. 66–72, 2007.

[153] L. Kleinrock, “Queueing systems. volume 1: Theory,” 1975.[154] S. M. Ross,Applied probability models with optimization applications.

Courier Dover Publications, 1970.[155] M. López-Benítez and F. Casadevall, “An overview of spectrum occu-

pancy models for cognitive radio networks,” inNETWORKING 2011Workshops, pp. 32–41, Springer, 2011.

[156] M. Wellens and P. Mähönen, “Lessons learned from an extensivespectrum occupancy measurement campaign and a stochastic duty cyclemodel,”Mobile networks and applications, vol. 15, no. 3, pp. 461–474,2010.

[157] C. Ghosh, S. Pagadarai, D. Agrawal, and A. Wyglinski, “A frameworkfor statistical wireless spectrum occupancy modeling,”Wireless Com-munications, IEEE Transactions on, vol. 9, no. 1, pp. 38–44, 2010.

[158] S. Geirhofer, L. Tong, and B. M. Sadler, “Dynamic spectrum accessin wlan channels: empirical model and its stochastic analysis,” inProceedings of the first international workshop on Technology andpolicy for accessing spectrum, p. 14, ACM, 2006.

[159] W.-Y. Lee and I. F. Akyildiz, “Optimal spectrum sensing frameworkfor cognitive radio networks,”Wireless Communications, IEEE Trans-actions on, vol. 7, no. 10, pp. 3845–3857, 2008.

[160] M. Wellens, J. Riihijarvi, and P. Mahonen, “Modellingprimary systemactivity in dynamic spectrum access networks by aggregatedon/off-processes,” inSensor, Mesh and Ad Hoc Communications and NetworksWorkshops, 2009. SECON Workshops’ 09. 6th Annual IEEE Commu-nications Society Conference on, pp. 1–6, IEEE, 2009.

[161] A. Feldmann and W. Whitt, “Fitting mixtures of exponentials to long-tail distributions to analyze network performance models,” Performanceevaluation, vol. 31, no. 3, pp. 245–279, 1998.

[162] A. Thümmler, P. Buchholz, and M. Telek, “A novel approach for fittingprobability distributions to real trace data with the em algorithm.,” inDSN, vol. 5, pp. 712–721, 2005.

[163] K. Park and W. Willinger,Self-similar network traffic and performanceevaluation. Wiley Online Library, 2000.

[164] J. Riihijarvi, J. Nasreddine, and P. Mahonen, “Impactof primary useractivity patterns on spatial spectrum reuse opportunities,” in WirelessConference (EW), 2010 European, pp. 962–968, IEEE, 2010.

[165] M. Wellens, J. RiihijäRvi, and P. MäHöNen, “Empiricaltime andfrequency domain models of spectrum use,”Physical Communication,vol. 2, no. 1, pp. 10–32, 2009.

Page 28: Artificial Intelligence Based Cognitive Routing for ... - arXiv

28

[166] T. Harrold, R. Cepeda, and M. Beach, “Long-term measurements ofspectrum occupancy characteristics,” inNew Frontiers in DynamicSpectrum Access Networks (DySPAN), 2011 IEEE Symposium on,pp. 83–89, IEEE, 2011.

[167] C. Ghosh, S. Roy, M. B. Rao, and D. P. Agrawal, “Spectrumoccupancyvalidation and modeling using real-time measurements,” inProceedingsof the 2010 ACM workshop on Cognitive radio networks, pp. 25–30,ACM, 2010.

[168] M. Hoyhtya, S. Pollin, and A. Mammela, “Classification-based predic-tive channel selection for cognitive radios,” inCommunications (ICC),2010 IEEE International Conference on, pp. 1–6, IEEE, 2010.

[169] M. Masonta, M. Mzyece, and N. Ntlatlapa, “Spectrum decision in cog-nitive radio networks: A survey,”Communications Surveys & Tutorials,IEEE.

[170] L. Doyle, Essentials of cognitive radio. Cambridge University Press,2009.

[171] X. Li and S. A. Zekavat, “Traffic pattern prediction andperformanceinvestigation for cognitive radio systems,” inWireless Communicationsand Networking Conference, 2008. WCNC 2008. IEEE, pp. 894–899,IEEE, 2008.

[172] B. Wang, Z. Ji, K. R. Liu, and T. C. Clancy, “Primary-prioritizedMarkov approach for dynamic spectrum allocation,”Wireless Commu-nications, IEEE Transactions on, vol. 8, no. 4, pp. 1854–1865, 2009.

[173] I. Zakiuddin, T. Hawkins, and N. Moffat, “Towards a game theoreticunderstanding of ad-hoc routing,”Electronic Notes in TheoreticalComputer Science, vol. 119, no. 1, pp. 67–92, 2005.

[174] M. Schwartz, “Some Thoughts on the Communications Field—ThePast and the Present,”Proceedings of the IEEE, vol. 100, no. 12,pp. 3150–3151, 2012.

[175] A. Bantouna, V. Stavroulaki, Y. Kritikou, K. Tsagkaris, P. Demestichas,and K. Moessner, “An overview of learning mechanisms for cognitivesystems,”EURASIP Journal on Wireless Communications and Net-working, vol. 2012, no. 1, pp. 1–6, 2012.

[176] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson,J. Rexford, S. Shenker, and J. Turner, “Openflow: enabling innovationin campus networks,”ACM SIGCOMM Computer CommunicationReview, vol. 38, no. 2, pp. 69–74, 2008.

[177] F. H. Fitzek and M. D. Katz,Cooperation in wireless networks:principles and applications: real egoistic behavior is to cooperate!Springer, 2006.

[178] L. Giupponi, A. Galindo-Serrano, P. Blasco, and M. Dohler, “Docitivenetworks: an emerging paradigm for dynamic spectrum management[dynamic spectrum management],”Wireless Communications, IEEE,vol. 17, no. 4, pp. 47–54, 2010.

[179] M. M. Waldrop,Complexity: The emerging science at the edge of orderand chaos, vol. 12. Simon & Schuster New York, 1992.

Junaid Qadir He is an Assistant Professor in theElectrical Engineering department of the Schoolof Electrical Engineering and Computer Sciences(SEECS), National University of Sciences and Tech-nology (NUST), Pakistan. He is also the Lab Di-rector of the Cognet Lab at SEECS. He completedhis BS in Electrical Engineering from UET, La-hore, Pakistan and his PhD from University ofNew South Wales, Australia. His research interestsinclude networking/ algorithmic issues in cognitiveradio networks, wireless networks, and software-

defined networks.