Traffic Engineering in Software-defined Networks using Reinforcement Learning: A Review

With the exponential increase in connected devices and its accompanying complexities in network management, dynamic Traffic Engineering (TE) solutions in Software-Defined Networking (SDN) using Reinforcement Learning (RL) techniques has emerged in recent times. The SDN architecture empowers network operators to monitor network traffic with agility, flexibility, robustness and centralized control. The separation of the control and the forwarding plane in SDN has enabled the integration of RL agents in the networking architecture to enforce changes in traffic patterns during network congestions. This paper surveys major RL techniques adopted for efficient TE in SDN. We reviewed the use of RL agents in modelling TE policies for SDNs, with agents’ actions on the environment guided by future rewards and a new state. We further looked at the SARL and MARL algorithms the RL agents deploy in forming policies for the environment. The paper finally looked at agents design architecture in SDN and possible research gaps. Keywords—Software defined networking; reinforcement learning; machine learning; traffic engineering


I. INTRODUCTION
The emergence of fifth generation (5G) networks has propelled the growth of Internet of Things (IoT) in recent times. IoT is a rapid evolving technology that connects billions of devices to the internet [1]. With 5G, the rapid deployment of new and smart IoT applications are expected to reach 22.3 billion by 2024 and generate about 163 zettabyes (ZB) of data by 2025 [2] [3]. These new and dynamic applications are expected to benefit from the services 5G networks will provide: ultra-reliable and low latency communication (URLLC) [4], enhanced mobile broadband (eMBB) [5] and massive machine type communication, massive MIMO [6].
As shown in Fig. 1 and Fig. 2, the dynamic nature and requirements of IoT devices has necessitated a network deployment shift from the traditional networking architecture which are difficult to configure and manage to a more flexible programmable domain [7]. Software-Defined Networking (SDN) is a new networking paradigm that separates the data plane from the control plane [8] [9]. This separation makes the network more agile with centralized responsibility given to the controller [10]. The controller communicates with the application plane via the northbound APIs and the forwarding devices via the southbound APIs (OpenFlow). The automation and programmability of the SDN architecture helps to configure, secure, and optimize network resources [11] quickly whiles maintaining a good Quality of Service (QoS) [12] and Quality of Experience (QoE) [13].
Traffic Engineering (TE) in SDN involves the analysis of the networks state by the SDN controller to act on flow data through the rapid change in flow table information for forwarding devices [14]. Rerouting flows periodically to balance the loads on the network minimizes congestion and improves the overall network performance. A network experiences two kinds of traffic flows: elephant flows and mice flows [15]. The elephant flows are heavy traffic flows that requires more network resources whiles the rapid aggregation of the mice flows can equally degrade the network. These traffic flows continuously needs dynamic resource allocation for the efficient utilization of scarce network resources through TE.
With the advent of machine learning, port-based [16] and payload-based [17] flow classification techniques have become ineffective due to the dynamic port usage of IoT devices. The negative impact of packet out of order and packet loss in traditional TE techniques even worsens the case for the network operator.   Currently for TE, machine learning algorithms are adopted for intelligent flow re-routing with an efficient feature selection criterion [18] [19] in network flow analysis. Deploying these machine learning algorithms in the SDN controller will efficiently allocate network resources and formulate policies for optimal network performance with low overheads.
In this survey as outlined in Fig. 3, we reviewed popular Reinforcement Learning (RL) techniques used in SDN architecture for Traffic Engineering with limitations on parameters chosen and approaches for future research. The rest of the paper is organized as follows: Section II discusses the justification of RL for TE; Section III analysed the TE architecture integration in SDN based on policies and performance. Finally, Section IV looked at the research gaps identified from the survey.

II. MACHINE LEARNING WITH REINFORCEMENT
ALGORITHMS With the advent of Machine Learning [101] where automation modelling using data remains relevant, traditional algorithms [102][103] [104] used in solving SDN-IoT related task is unfeasible. In supervised learning [103] agents are trained with a labeled dataset and later tasked to make predictions out of the learned data. Increased complexity in a dynamic environment with new IoT devices and variance in data will negatively affect the accuracy of supervised learning algorithms and predictions. Even worse is the time factor in retraining and relabeling of new data variance in an attempt to still adopt classification algorithms. With unsupervised learning [104] that uses unlabeled dataset, there is no guidance regarding the accuracy of the clustered dataset. Clustering algorithms alone is inefficient in an SDN-IoT environment that requires efficiency in diverse IoT applications. Reinforcement Learning (RL) defines the true automation of agents in an environment [20] [105] with rewards as guidance on how well the agent is performing. Though complex, RL agents adapts to changing conditions in the environment by learning to solve tasks through trial-and-error approach. As the episodes progresses, agents adapt to successful actions through exploration and exploitation [106][107] on the stochastic environment. With the recent success of DeepMinds AlphaGo RL agent [108] that defeated the Go champion in 2016, the application dimensions of RL have become enormous. The only way packets can be routed intelligently in a network with varying and emerging IoT devices is to deploy RL agents to learn varying network state patterns with no exclusive data labels but with policies and actions.
III. TE USING REINFORECEMENT LEARNING IN SDN Reinforcement Learning (RL) is an area of machine learning where an agent is modeled to take sequence of actions informed by policies [20]. As shown in Fig. 4, the agent learns in an interactive environment and receives a reward through its actions [21,22]. The set of actions presents a new state with corresponding reward to the agent. Unlike supervised learning [23] where a set of correct actions are provided as feedback to the agent, RL uses rewards and punishment as signals for positive and negative decisions. The goal is to use trial and error methods in getting positive rewards or build a suitable model that will maximize cumulative rewards for the RL agent. 332 | P a g e www.ijacsa.thesai.org SDN provides centralised control with a unique advantage for intelligent TE framework implementation using RL. Network policies can easily be generated from the centralized control with corresponding TE rules to forwarding devices. With RL, the modelling of the agent's action on the environment with rewards fits into the network architecture of SDN and this expedites network control and management.

A. RL Agents Design
This section details the mathematical modelling of the state space with respect to actions and rewards. Agent design requires the environment to be monitored.
An agent based on the monitored metrics takes actions informed by policy decisions with a new state and a corresponding reward in guiding the next policy.
1) Action-State-Reward: RL agents are implemented in RL frameworks and modelled in SDN to learn critical network packet flow policies and provide routing solutions to forwarding devices. The agent takes an action on the environment and evaluates the actions based on rewards. Using its policy π , the agent performs an action a, which alters the environment state s to s' [24]. Based on the reward r, the agents policy is updated. In arriving at optimal policy, RL agents use Markov Decision Process (MDP) [25] to model actions on the environment with corresponding rewards. MDP is an intuitive and fundamental formalism for decisiontheoretic planning (DTP) [26] and RL in stochastic domains. The MDPs have become the de facto standard formalism for learning sequential decision control problems [27].

Algorithm 1 Markov Decision Process (MDP)
An MDP is a 5-tuple ( , , , , ), where;  is a set of states is a set of actions ( , , ′ ) is the probability that action in state at time will lead to state ′ at time + 1 ( , , ′ ) is the immediate reward received after a transition from state to ′ , due to action is the discounted factor which is used to generate a discounted reward For TE in SDN, RL agents are implemented differently based on the agents policy and the metrics for measuring TE success. The actions of the agents on the environment are rated by the rewards associated with it as the episode progresses.
CFR-RL agent [28] State Space = (1) The CFR-RL agent resides in the controller of the SDN architecture. The RL agent uses a traffic matrix that contains the traffic demand of each flow as state. The objective is to avoid packet link congestion. As shown in equations 1 -3, the CFR-RL agent samples K critical flows for a given stage within N nodes. The CFR-RL agent then reroutes these critical flows and obtains maximum value in link utilization U as reward.
Q-DATA RL agent [29] State Space The Q-DATA agent resides in the application plane of SDN. As shown in equations 4 -6, the Q-DATA RL agent has a defined state space where is the current total number of flow entries in switch i; ∆ is the number of flow entry changes between two consecutive observations and is the maximum number of flow entries in switch i. ( , ∆ ) represents the state of an SDN switch i, as a tuple. For its action space, a represents a traffic flow matching scheme change related to a destination host and ℱ denotes a list of all feasible match field combinations. With the reward function, is the current total number of flow entries in the switch i; and is an integer number representing the number of enabled match fields in flow entry x. An action has no reward if that action leads to the total number of current flow entries in the SDN switch i reaching the limit .

Reward Function
In [30], the RL agent resides in the controller of the SDN. As shown in equations 7-9, _ represents the frequency of matched flows and _ , an indication of flow duration in the memory of the switch. These are defined for the state space. The action space denotes an increase action on the flow frequency parameters. With the reward function the ℎ denotes the current best network control overhead obtained. A configuration with less overhead returns a positive reward, 1 to the RL agent otherwise a negative value -1 is returned. If ℎ and ℎ are equal, a reward value of 0 is given. Huang  With [31], the objective of the RL agent is to maximize the cumulative QoE of customers by dynamically allocation traffic in a multimedia environment. The RL agent resides in the controller of the SDN architecture. As shown in equations 10 -12, the state of the environment refers to the state of flows and covers the following metrics: allocated bandwidth, the delay, the jitter and the packet loss rate of flows. The action includes: the path chosen (routing path) and the bandwidth adaptation of flows. The mean opinion score (MOS) [32] used to evaluate the QoE represents the reward function. A multi-layer deep neural network (DNN) is used to map the network and application metrics to the MOS.

Reward Function
In [59] RL framework is modelled to minimize the number of overflow occurrences. As shown in equations 13 -15, the state space represents the size of the sampling period with a unit size of 500 ms. This ranges to 5,000 ms with a total of 10 states. The action space has three options: (i) increase sample period by unit size; (ii) decrease sampling period by unit size; (iii) maintain the sample period. Based on the percentage of table hits, three rewards are given. A reward of 1 is given when the measured hit rate is higher than the hit rate pre-action. If low, a reward of -1 is assigned. A reward of 0 is assigned if there is no change in the hit rate.
In [71], flow table state and port state are responsible for collecting network statistics. The channels of the network represent the flow table utilization and its respective port rate of switches at current and previous states. For the state space modelling n, m and z respectively identifies the number of switches, moments and ports of a single switch. , represents the flow table utilization rate of switch i at the moment and ranges from 0 to 1. , , represents the port rate of port k in switch i at the moment . The action space comprises of 1 which indicates all paths in the network, ∈ {0, 1}. If = 1, the current flow is assigned to path k else = 0. For the reward function, the elephantflows , represents the average packet loss rate of elephantflows in the network, TP is the average throughput of elephantflows after processing. and are the weights of the and TP respectively. With the mice-flows, 2 indicates the average packet loss rate of mice-flows and DL represents the normalized average delay. and µ identifies the weight of the 2 and DL , respectively.
Zhang [86] State Space = ( , , , ) Reward Function In [86] the state comprises of four components; name of the requested content, source, destination and available link bandwidth.
With the action, denotes the i th destination node split ratio and relates to the content request sent to that destination node using selected transmission links. The reward is meant to improve load balance and throughput. The reveals throughput impact in relation to available normalized bandwidth. The -( 2 )arctan( ) + 1 indicates the load balance with normalized variance of available bandwidth. A value close to 1 signals a preferred action with a reverse value close to -1, a penalty. β = 1 is a factor used to balance the throughput and the load balance.

B. RL Algorithms
In this section, we reviewed the algorithms the RL agents use to formulate policies that informs the action taken by the agent on the environment as the episode progresses. For effective TE and policy enforcement on the environment, RL agents learns to take the best actions for traffic optimization in respect to cumulative future rewards. RL algorithms are distinguished into two main classes: the model-free (direct) and model-based (indirect) methods [33,34,35].

1) Model-based RL methods:
Model-based RL algorithms utilizes a model when the RL agent interacts with the environment. The model keeps track of transition dynamics of the network to derive optimal actions and rewards [35]. When the model is referenced, the RL agent can make predictions about the next state and reward before an action is taken. Model-based RL methods are data efficient but struggles to achieve asymptotic performance for real-world applications [36]. For model-based RL methods, the interaction between the RL agent and the environment is modeled as a discretetime Markov Decision Process (MDP) ℳ and defined by the tuple [36]: ( , , , , , 0 , ) . Where is the set of states, the action space, ( +1 | , ) the transition distribution, ∶ × → ℝ as a reward function, 0 : → ℝ + represents the initial state distribution, the discount factor, and the horizon of the process. The return function is defined as the sum of rewards ( , ) along a trajectory τ: = ( 0 , 0 , … , −1 , −1 , ). The goal of the reinforcement learning is to find a policy π: × → ℝ + that maximizes the www.ijacsa.thesai.org expected return. The model-based learns the transition distribution from the observed transitions using parametric approximator ṕ ø ( ′| , ) . The parameter ø of the dynamic model are optimized to maximize the log-likelihood of the state transition distribution. Though model-based RL methods are data efficient, they have high computational complexity and the degree of potential error in maximizing a reward is compounded..

2) Model-free RL methods:
Model-free RL algorithms do not utilize a model and thus the rewards and the optimal actions are derived through trial-and-error approach with the environment [37]. These set of algorithms operate over an unordered list of actions, with a positive or negative reward value. The RL agents that utilizes model-free algorithms increases the value associated with a positive action which helps the agent to learn from direct experience. Agents in model-free RL are represented with policy optimization and Q-learning approaches [38]. With policy optimization, the agents learns directly the policy function that maps state to action without a value function. The Q-learning approach learns the action-value function ( , ); how good to take an action at a particular state. A scalar value is assigned over an action a, given the state s [39]. Model-free RL methods have low computational complexity but more data dependent. For TE, model-free RL methods are frequently used for RL agent sequencing and to implement policies on the environment.

3) Single Agent Reinforcement Learning (SARL):
In a SARL, there is only one agent that interacts with the environment to maximize rewards. The SARL implementation is suitable for simple network management with slower convergence and learning experience. The SARL implemented algorithms are either value-based, policy-based or both [48]. As shown in Fig. 5, the SARL through the SDN controller collects information from the environment through the forwarding devices.
The agent upon receiving the state information performs a set of actions on the environment through the SDN controller. These actions are guided by policy algorithms. The episode results in a new state and rewards. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 5, 2021 335 | P a g e www.ijacsa.thesai.org a) Q-learning Algorithm: Q-learning [40] is an offpolicy, value-based algorithm that takes a random actions based on the − policy, where the probability of a random decision is determined by the value of epsilon . During the learning phase, the Q-learning agent initializes the Q-table for all state-action pairs and updates it using: The Q-learning agent generates the optimal policy * ( ) for a state s representing an action a that needs to be taken to maximize the value of the * ( , ) function, * ( ) = * ( , ) . Outputs * ( ) = arg * ( , ).
Phan et al. [29] proposed the Q-learning algorithm in maximizing traffic flow monitoring in SDN switches. It embeds a Support Vector Machine (SVM) [49] algorithm in the application plane of the SDN architecture to predict the performance degradation of the switches as the episode progresses. To reduce the long-term control plane overhead capacity limitation of Ternary Content Addressable Memory (TCAM) in OpenFlow switches, [30] proposed a Q-learning algorithm for SDN flow entry management. The framework determines the forwarding rules that remains in the flow table of the SDN switches and those processed by the controller in case of a table-miss on the switches. In [50] a Q-learning algorithm is proposed to reduce the latencies and improve the bandwidth utilization in the UbuntuNet Alliance National Research and Education Network (NRENs) SDN switches. The proposed framework adapts forwarding devices by learning from experience using multipath propagation. In dealing with bandwidth overhead caused by Dijkstra's shortest path first module [51] in an OpenDayLight (ODL) architecture meant for efficient packets delivery, [52] proposed a congestion prevention mechanism using Q-learning in SDN. With [52], the set threshold values are defined in SDN controllers to enable threshold bandwidth detections. The optimal path chosen is delivered to the OpenVSwtiches (OVS) after Qrouting by the controller during network congestion. To balance the network load in SDN, [53] proposed a Q-learning approach to reduce the number of unsatisfied users in a 5G network architecture. The researchers used a flow admission control technique with a fairness function to enhance the perflow resource allocation in the network. In [54] a load balancing architecture is proposed for SDN networks that uses supervised Bayesian Network (BN) to solve the problem of Q value local maximum [55] in a Q-learning RL algorithm. The combination of the BN in Q-learning helps the controller select the most optimal strategy for network load balancing during congestion. For TE load balancing optimization in master controllers, [56] proposed a dynamic switch migration algorithm to slave controllers using Q-learning in SDN. The switch migration problem (SMP) is modeled and used to redefine the Q-learning parameters. The Q-learning is then used to learn the current status of SDN to select the best switches for load migration. For an efficient path selection technique in load balancing, [57] proposed a Q-learning algorithm for path selection and flow forecasting [58]. It has an integrated centre that uses Deep Neural Networks (DNNs) to process uncertain network traffic and uses Q-learning to resolve the optimal path based on the results of the DNN. The DNN path selection are obtained from the bandwidth utilization ratio, packet loss rate and transmission latency which forms the inputs to the DNN. The output which is fed into the Q-learning is derived from the corresponding link score. For timely eviction of inactive flow entries and to avoid overflows in the memory of SDN switches, [59] proposed a Qlearning User Datagram Protocol (UDP) [60] flow eviction strategy for UDP flows. The Q-learning is used to dynamically resize the sampling period as the most critical parameter in the RL architecture. This advertently maximizes the table hit rates of the UDP flows in the SDN.
b) State-Action-Reward-State-Action (SARSA): SARSA [61] is an on-policy algorithm which uses the action performed by the current policy to learn the Q-value. As shown in Eq. 23 [61] and Eq. 24 [40], the update rule for SARSA varies from that of Q-learning algorithm in the execution of actions. In SARSA, update estimates are based on the same action taken whiles in Q-learning, the update estimates are based on the number of possible actions that maximizes the post-state Q function, ( +1 , ′ ).
For dynamic load balancing in multiple controllers due to switch migration conflicts, [62] proposed a SARSA-Bayesian RL algorithm for a multi-controller cluster design in SDN. With knowledge of the real-time load and controller's communication consumption, a request response model using the Bayesian [63] algorithm is combined with the SARSA RL mini-framework in a switch migration technique to the lighter controller. For a multi-layer hierarchical SDN to be effective in handling traffic, [64] proposed the SARSA algorithm for QoS provisioning. With each pre-flow, the switch contacts the SDN controller. The controller uses the SARSA algorithm to implicitly detect the QoS requirement of each flow and computes the corresponding optimum traffic path based on the needed QoS requirement. The next hop in the switch forms the basis for the next action from the source to the destination switch. To convey a massive IoT data through a limited bandwidth efficiently [65] proposed a SARSA algorithm for resource allocation through cognitive communications in the SDN-enabled environment. The SARSA agent communication is modelled with a buffer metric that manages the aggregator's www.ijacsa.thesai.org output queue transmissions and reflects dynamically in the IoT data demands. This modification targeted at Publish/Subscribe (Pub/Sub) paradigms preserves the Pub/Sub bandwidth with less computational resources. In order to adapt VS-routing [67] optimization to SDN networks, [66] proposed a network hop count technique to improve the reward function of SARSA algorithm. The VS-routing introduces an − function in the network hop count which is calculated to select the optimal route and avoid the long package queue of network links in the SDN architecture.

c) Deep Q-Network Algorithm (DQN):
With the advent of Artificial Neural Networks, (ANNs) a class of RL agents that utilizes Q-learning with Deep Neural Networks (DNNs) [41] in discrete domains for TE has emerged. DQN uses feedforward neural networks with three components: (i) Neurons that are interconnected using direct links to form a network, (ii) Weights associated with each connection, (iii) Layers consisting of a number of neurons and multiple hidden layers. Algorithm  for (step = 0; step < learning_iteration; step++) 4: Get action from using − policy 5: Get parameter from using − policy 6: = − (step / learning_iteration)* 7: Take action on and receive reward r, control overhead 8: Observe new state +1 9: Store experienced memory ( , , , +1 ) into 10: Sample random transitions ( ′, ′, ′, ″ ) from 11: Update ⟵ ′ + * max( ″ − ″ ) 12: Update the of 13: Train the Q network using = ( − ( ′ , ′ )) 2 14: = get_improvement ( , ) 15: end for 16: until improvement >  The DQN has an experience memory for storing experienced transitions ( , , , ′) unlike the Q-learning. The discount factor and the state of the Q-Network in the ith iteration, are used to update the experienced transitions with a training principle using a loss function. The − policy helps select the action based on the highest Q-value associated with that action after the training. For the RL agent to choose random actions, the value is set to 1 at the start of the learning process but decreased over time in order to maintain a fixed exploration rate. The DQN keeps track of the chosen parameter corresponding to the Q-value of each action with a terminal, ..
In [28], the DQN is used to learn a policy to select critical flows based on a given traffic matrix. The Critical Flow Rerouting-Reinforcement Learning (CFR-RL) agent then reroutes the selected flows for a balanced link utilization using Linear Programming (LP). For an efficient SDN flow entry level management with a TCAM enabled OpenFlow switches [30] proposed a DQN algorithm to obtain the flow entries and reduce the long-term control plane overhead between the SDN switch and the controller. The DQN agent automatically finds the values of decision parameters that effectively selects the candidates rule in the switch's flow table for a higher table-hit rate. For flexible network management through TE, [68] proposed a DQN based dynamic controller placement caused by flow fluctuations in SDN. The D4CPP agent in [68] integrates historical network data into the controller deployment. The real-time switch-controller mapping decisions is then triggered with inherent adaptation to the dynamic flow fluctuations in the network. For effective TE among distributed controllers in SDN, [69] proposed a DQN based switch and controller selection scheme for switch migration and switchaware reinforcement learning-based load balancing (SAR-LB). The SAR-LB adopts the utilization ratio of diverse resource types in both controllers and switches as inputs to the neural network for a dynamic load distribution among the controllers in the network. Yao et al. [70] proposed a DQN-based energyefficient routing solution for full load software-defined data centers. The optimization is for the DQN to find energyefficient routing paths and load-balancing between controllers in reducing energy consumption in the network. The enhanced DQN-based energy-efficient routing (DQN-EER) algorithm learns directly from experience. At the same coordinated time, it selects the arriving flows and the energy-saving control path at the in-band control mode whiles detecting the energy-saving routes for the data center. Fu et al. [71] proposed the detection of mice and elephant flows in an SDN-enabled data center using two DQNs. The DQNs are built and trained to generate efficient routing strategies using convolutional neural networks (CNNs) [72][73] to avoid possible network congestion. For efficient latency management in SDN, [74] proposed a DQN agent that inherently predicts optimal traffic paths and future traffic demands through the SDN switches. Whiles formulating the flow rules placement policy as an Integer Linear Program (ILP), [74] used a traffic prediction module with a long shortterm memory (LSTM) [75][76] neural networks algorithm. To further minimize network delay, a proposed DQN-TP (traffic prediction)-based heuristics defect-tolerant routing (DTR) [77] algorithm interacts dynamically with the DQN agent module in the controller of the SDN architecture.

d) Deep Deterministic Policy Gradient (DDPG):
In combining policy gradient and Q-learning, Deep Deterministic Policy Gradient (DDPG) [42][79] is used as an off-policy, actor-critic technique consisting of two modes; actor and critic as shown in Fig. 6. The actor is the policy network and the critic, the Q-value for training the actor network. www.ijacsa.thesai.org Observe state and select action = clip ( ( ) + , , ℎ ), where ~ 5: Executive in the environment 6: Observe next state ′ , reward , and done signal to indicate whether ′ is terminal 7: Store ( , , , ′ , ) in replay buffer Ɗ 8: If ′ is terminal, reset environment state 9: if it's time to update then 10: for however many updates do 11: Randomly

14:
Update policy by one step of gradient ascent using DDPG uses DQNs replay buffer to gather offline unrelated experiences obtained by the agents whiles performing actions on the environment. At each time step, the actor and the critic are updated by uniformly sampling a minibatch from the replay buffer. DDPG uses soft target, targ updates rather than directly copying the weights to the target network. DDPG further utilizes batch normalization which helps normalize each dimension across the samples in a mini-batch to have unit mean and variance. DDPG algorithm is suitable for continuous action space and state representations. In [31], the DDPG algorithm is used for multimedia traffic control with the objective of maximizing cumulative Quality of Experience (QoE) for network users. The DDPG agent enforces bandwidth adaptation and path chosen actions for all multimedia flows in the SDN-enabled environment. To maximize the QoE for users, a multi-layer deep neural network is used to map the network and application metrics to the mean opinion score (MOS) [78] obtained from users. Stampa et al. [80] proposed a DDPG agent for dynamic routing in SDN. The architecture embeds an integrated fully-connected feed-forward neural network (FFNN) [81] in the framework to re-define the feature extraction of the actor-critic network. To improve the learning rate of DDPG for effective routing optimization, [82] proposed a dynamic planning of the experience pool capacity with respect to the current iteration number. This accelerates the growth rate of the previous pool by reducing its capacity in affecting subsequent learning rates. In [83] a deepreinforcement-learning-based quality-of-service (QoS)-aware secure routing protocol (DQSP) is proposed using DDPG algorithm. The DQSP adds an intelligent layer above the control layer which generates the routing policy and evaluates the network performance through the rewards obtained by the DDPG policy. The DQSP protocol guards against gray hole attack [84] and DDoS [85] whiles ensuring an efficient routing planning through the environment-aware module of the control layer. Zhang et al. [86] proposed a DDPG-based intelligent content-aware TE (iTE) which leverages on information centric networking (ICN) [87] to optimize traffic distribution in SDN. The DDPG agent together with other TE algorithms are embedded in a parallel decision-making (PDM) module in the controller. This module receives the cache information and the link bandwidth from the switches to activate and update its neural networks with a reward feedback. In [88] a DDPG-based network scheduler for deadline-specific SDN heterogenous networks is proposed. The DDPG agent receives a deadline-ware data transfers from the SDN switches and schedules the flows by initializing a pacing rate at the source of the deadline flows. The actor-critic model in the DDPG agent handles larger and a more generalized scheduling problem that maximizes and assigns the aggregated utility value to each flow if the deadline is met. For intelligent routing in softwaredefined data-centers (SD-DCN), [89] proposed a deep reinforcement learning based routing (DRL-R) consisting of DDPG-DQN agent to perform a reasonable routing adapted to the network state. DRL-R agent efficiently allocates cache and bandwidth in the network to improve routing performance by delay reduction. This is done through the quantification of the overall contribution score in the network and a change in the routing metric from a single link state to the resourcecombined state.

4) Multi-Agent Reinforcement Learning (MARL):
In MARL systems, multiple agents collectively learn and collaborate in a deterministic or a stochastic environment [90,91,92]. Multi-agent systems are seen in domain applications including: network resource management, computer games, distributed networking, cloud computing and intrusion detection systems. Experience sharing and faster convergence has necessitated a shift in research direction from SARL to MARL in recent times. With a coordinated policy, multiagents learn and optimize towards an accumulated global reward [93,94] in the network framework. As a result, the dynamics in state transitions in MARL are dependent on the joint action of all active agents as shown in Fig. 7. 339 | P a g e www.ijacsa.thesai.org  [95,96,97] is an actor-critic multiagent extension of DDPG where the critic network is augmented with information from other agents in a decentralized execution.
In MADDPG actor-critic architecture, each agent has its own actor and critic network. The critic network of each agent has full visibility of the actions and observation of other agents.
The actor network on the other hand only executes the action for its local agent given the state. In Fig. 8, the actor takes an observation, as state to give an action, whiles the critic network, takes an observation and the action of the actor, to train the actor. The critic has dependent view from other critic networks whiles training the actor network. for episode = 1 to do 2: Initialize a random process for action exploration 3: Receive initial state 4: for = 1 to max-episode-length do 5: for each agent , select action . . . the current policy and exploration 6: Executive actions = ( 1 , . . . . , and observe reward and new state ′ 7: Store , , , ′ in replay buffer Ɗ 8: ⟵ ′ 9: for agent = 1 to do 10: Sample a random minibatch of samples , , , ′ ) from Ɗ 11: In [98], a MADDPG-based traffic control and multichannel reassignment (TCCA-MADDPG) algorithm is proposed for the core backbone network in SDN-IoT. The TCCA-MADDPG algorithm reduces the channel interference between links by considering the policies of other neighbouring agents using a cooperative multi-agent strategy. To maximize network throughput and minimize packet loss rate and time delay, the TCCA-MADDPG uses a joint traffic control mechanism modelled with a partially observable markov decision process (POMDP) to optimize traffic performance. Yuan et al., [99] proposed a dynamic controller assignment using MADDPG for effective TE in Software Defined Internet of Vehicles (SD-IoV) [100]. For controllers to make local decision in coordination with neighboring controllers, a real-time distributed cooperative assignment approach is used via the actor-critic model of the MADDPG. To get a faster MARL global convergence whiles minimizing delay, a centralized training approach using global information to attain optimal local assignment is adopted in the model development.

C. TE Architecture in SDN
In this section, we looked at the design placement of the RL agents in the SDN architecture and the communication principles adopted with the controller. The architecture of RL systems varies based on the RL agent policy algorithms, the actions selected and the environment. The agent frameworks are designed to enhance positive rewards and proactively prevent network performance degradation through forwarding devices. Different components of the RL agents design in SDN are situated in the application plane, control plane and the data plane.  [89] situate the RL agent in the control plane of the SDN architecture. In [28], the CFR-RL agent resides in the controller and uses a neural network trained with reinforcement algorithm [43] to map a traffic matrix to a combination of critical flows. After training, the CFR-RL applies the critical flow selection policy to each real time traffic matrix provided by the controller. The SDN controller then reroutes the selected critical flows by installing and updating flow entries of the switches whiles the remaining www.ijacsa.thesai.org flows continue the normal route using Equal-Cost Multi-Path (ECMP) [44] TE technique by default. In [30], the RL agent is deployed in the controller and utilizes the flow match frequency and the flow duration to determine the flow entries that should be kept on the switch. To maximize the long term reward, the RL agent lowers the configuration overhead and the number of table-miss events. To achieve the expected reward, the RL agent splits the pool of flow entries into two parts: the local switch entries and the remote controller entries. This will reduce the control plane overhead given the Ternary Content-Addressable Memory (TCAM) [45] size of the SDN switches. With [31] the RL agent is the controller and serves as the centralized control to collect stats, make decision and take actions. The state reflects the situation in the environment and covers metrics: allocation of bandwidth, delay, jitter and the packet loss rate of flows. The action involves the path chosen and the bandwidth adaption for multimedia flows. The reward is the QoE received from the environment. To evaluate the QoE, the multi-layer deep neural network is used to map the network and application metrics to MOS [46]. [52] also proposed the controller is the RL agent and programmed with the Q-learning algorithm to detect network congestion and find optimal path to be delivered to the OpenVSwitch (OVS). In [57] the control layer has an intelligent center connected to the SDN controller. For efficient load balancing, the intelligent center uses the Q-learning algorithm to find optimal paths and returns aggregated path routing decisions to the controller. The DQN-EER architecture [70] has the RL agent programmed in the SDN controller using the DQN algorithm. The DQN is modified with deep convolutional neural networks (CNNs), empirical replay to train the agent and independent target networks to train the primary critic network. In [86] the intelligent content-aware traffic engineering (iTE) RL agent is deployed in the controller of the SDN architecture. It received cache information from the ICN-enabled switches and uses parallel execution module embedded with multiple DRL-based TE algorithms to determine the best routing paths for the flows in the network.

1) RL agent in control plane
2) RL agent in Application Plane: For easier system failure checks in SDN, [29] [71] [83] TE frameworks situate the RL agent in the application plane. The Q-DATA [29] framework architecture has a built-in forwarding application located in the control plane and a Q-DATA application residing in the SDN application plane. Initially, the built-in forwarding application module is instructed by the Q-DATA application through a REST API to apply the Full Matching Scheme (FMS) strategy at the switches. The Q-DATA application has a statistics collector module which periodically collects raw information about traffic flows at the SDN switches from the SDN controller. The statistics is then forwarded to a statistics extractor and distributor module for extraction and distribution to other modules. The SVM based performance degradation prediction module anticipates the performance degradation of the SDN switches before it occurs and provides the prediction results to the Q-learning based traffic flow matching policy creation module and the MAC matching only scheme control module. The MAC matching only scheme control module monitors and checks conditions for a traffic flow matching scheme change to FMS in the SDN switches. In [71] the AI Plane is used as the Application Plane in the SDN architecture. The RL agent is embedded in the AI Plane and uses the DQN to learn the best optimal routing paths for the mice and elephant flows by obtaining the flow type, network state information and network performance evaluation from the control plane of the SDN architecture. In [83], the DQSP architecture has an agent layer that is embedded in the application layer of the SDN architecture. The DQSP agent through the controller is aware of the underlying network environment and generates routing policies for the controller to executive. It receives the reward evaluation and adjusts policy parameters until optimal routing strategy is achieved. The − policy should have given more value for exploration to balance the dynamics of the action taken.
Controller [50] Q-learning The authors improved bandwidth utilization and reduced flow latencies -NRENs case study network

No
Since MDP was not used to mathematically define the network parameters, the measuring metrics for success is not well defined.
Not stated [52] Q-learning The authors addressed network congestion in SDN by reselecting flow paths and changing flow table using predefined threshold

No
Since MDP was not used to mathematically define the network parameters, the measuring metrics for success is not well defined. Controller www.ijacsa.thesai.org research [80] DDPG The authors adopted the DDPG agent for dynamic routing in feature extraction with FFNN in the actor-critic network of the agent.

No
Since MDP was not used to mathematically define the network parameters, the measuring metrics for success is not well defined.
Not stated [82] DDPG The authors proposed a DDPG-EREP algorithm with dynamic planning of the experience pool capacity using the current iteration number of the sampling size

No
Since MDP was not used to mathematically define the network parameters, the measuring metrics for success is not well defined.
Not stated [83] DDPG The authors proposed a DQSP using DDPG algorithm with added intelligent layer above the control layer for routing policy optimization in SDN

No
Since MDP was not used to mathematically define the network parameters, the measuring metrics for success is not well defined.
Application [86] DDPG The authors proposed an iTE which leverages on ICN to optimize traffic distribution in SDN through the PDM module in the controller

Yes
The action space definition should have included the flow path selection procedure aside the split ratio for the i-th destination node.
Controller [88] DDPG The authors used a DDPG agent to receive a deadline-aware data transfers from SDN switches and schedules subsequent flows by initiating a pacing rate at the source of the flows

Yes
This research can be extended to multipath routing using AOMDV protocol No stated [89] DQN, DDPG The authors proposed a DRL-R based on DDPG-DQN agent to allocate cache and bandwidth in the SDN to improve routing performance

No
Since MDP was not used to mathematically define the network parameters, the measuring metrics for success is not well defined.
Controller [98] MADDPG The authors proposed TCCA-MADDPG algorithm to reduce the channel interference between links by considering the policies of neighbouring agents using multi-agent strategy

Yes
The TCCA-MADDPG should have been compared with DDPG and not DQN since both TCCA-MADDPG and DDPG work in continuous environment.
Not stated [99] MADDPG The authors proposed a MADDPG for effective traffic load engineering in SDN-IoV using a real-time distributed cooperative assignment approach via the Actor-Critic network

No
Since MDP was not used to mathematically define the network parameters, the measuring metrics for success is not well defined.

IV. OPEN RESEARCH ISSUES
In this section, we looked at the research gaps identified after the review. From the review summary shown in Table I, it is conclusive that, SDN-based TE solutions using RL agents has the potential to eliminate completely network degradation and provide a network recommender system for end users. From this review, some future research issues exist.

A. RL Agent Implementation
From the review RL agents are designed and situated at the control or application plane of the SDN architecture. For a more efficient and pro-active TE solutions, new SDN design architectures can situate the RL agent as mini-embedded applications adapted to dedicated forwarding devices with oversight from the SDN controller. With performance comparison based on end-to-end delay and response time [47], data plane based RL agents will enable a faster network congestion detection and prevention since the agents are closer to the forwarding devices.

B. RL Agent Algorithm
For TE, most RL agents use model-free based algorithms for policy enforcement and rewards. Though model-based algorithms have high computational complexities, a hybrid architecture that enables the RL agent to select either algorithm based on reward has a research value. Using trial-or-error and referencing a model will give more intelligence to the RL agent. The agent will have the capacity to decide the algorithm to activate based on network complexity and the priority of applications.

C. Multi-Agent Reinforcement Learning
For faster convergence and collaborative learning, MARL solutions in TE though complex is the future in solving network related routing and load-balancing in SDN architecture. The advent of connected devices will only increase with time. MARL agents from review have limited research [98][99] TE solutions in SDN. MARL when proposed efficiently can segment the network into smaller units with multi-agent capabilities.

V. CONCLUSION
Software-Defined Networks (SDN) has emerged to give more control in network management by separating the control layer from the forwarding devices. This separation has given a centralized programmable supervisory role to the controller and a flexible management of network flows in forwarding devices. In regulating the behaviour of data transmitted over the network, we discussed the relevance of Reinforcement Learning in SDN for Traffic Engineering. This paper explained major reviews using RL techniques in network traffic management and the action of agents on the environment for www.ijacsa.thesai.org rewards and new states. The review further detailed the mathematical modelling of agents and environment using the Markov Decision Process (MDP). We illustrated with diagrams SARL and MARL agents and detailed their importance in regards to TE.
With Reinforcement Learning, agents are modelled in a controlled loop to take sequence of actions on the environment to receive future rewards and a new state. The agent must exploit and explore the stochastic environment through determined actions that will lead to a faster convergence. From the review, the paper offers future research options for optimal Traffic Engineering solutions in SDN.