Sparse Distributed Memory Approach for Reinforcement Learning Driven Efficient Routing in Mobile Wireless Network System

In recent years, researchers have explored the applicability of Q-learning, a model-free reinforcement learning technology towards designing QoS-aware, resource-efficiency, and reliable routing techniquesin a dynamically changing network environment. However, Q-learning is based on tabular representation to characterize learned policies that frequently encounter a dimension disaster problem when introduced to the uncertain and dynamically changing network environment. In addition, the time required for agent learning in the training phase is too long, which makes it difficult for the agent to generalize the observation state efficiently. To this end, this paper attempts to overcome the overhead memory problems encountered in Q-learning-based routing techniques. In this paper, the study presents a novel memory-efficient intelligent routing mechanism based on adaptive Kanerva coding, which minimizes the storage cost required for storing large action and a state value. Unlike existing schemes, the proposed method optimizes memory requirements. Also, it enables better generalization by storing the learnable parameters of the function approximator present in the agent in a Kanerva-coding data structure. The Kanerva-coding is a sparse memory with distributed reading and writing mechanism which enables optimal compression and state abstractions for learning with fewer parameterized components making it highly memory efficient. The design and implementation of the proposed technique are done on the Anaconda tool. Simulation results demonstrate that the proposed technique can adaptively adjust the routing policy according to the varying network environment to meet the transmission requirements of different services with low memory requirements. Keywords—Mobile wireless network; reinforcement learning; Q-learning; Kanerva coding; routing; memory optimization


A. Background
A mobile wireless network can be regarded as a transient system that is inherently dynamic, decentralized, and formed via randomly deployed several wireless and mobile communicating sensor nodes to perform the distribution of the sensory information to the end node [1]. The ad-hoc feature in this transient system ensures fast and cost-effective network deployment. The sensory nodes operate as a router by receiving and forwarding the traffic of their nearby sensor nodes [2]. The salient features of mobile wireless networks are multi-hop communication, dynamic topology, bandwidth, and resources constraints. Interruption due to uncertain and dynamic topology changes affects the efficiency of the node resources. It also compromises the transmission of data packets from the source to the end node. In this regard, efficient routing in wireless networks has been extensively studied in the literature [3][4][5]. Therefore, various routing mechanisms have been introduced, mainly divided into reactive, proactive, and location-based routing protocols. The routing scheme of proactive type is a table-driven approach where information regarding the entire network topology is maintained at each sensor node. However, updating the table introduces a huge overhead problem due to the large control traffic in the dynamic network. In the reactive routing mechanism, the route discovery executes on on-demand [6]. However, it requires collecting adjacent information, which is a costly procedure, and, in many instances, it may not be able to determine the end-to-end path. In location-based routing, the selection of the next-hop nodes is carried based on the predefined parameters but not suitable to dynamic networks as it has restricted adaptability. Although the routing protocol of these types is advantageous in many specific situations, it has several limitations when introduced to the dynamic networking scenario [7][8].
Recently, machine learning (ML) has been widely employed to solve network problems. Incorporating the potential of machine learning technology in routing mechanisms helps to optimize network resources. In general, there are three particular types of ML techniques viz. supervised, unsupervised, and reinforcement learning. In supervised learning (SL), both input and output variables are required to train the models [9]. In un-supervised learning (UL), the model learns explicit features and generalizes the data category with only input variables. Reinforcement learning (RL) is the agent and environment interaction mechanism that enables a system to automatically explore, learn, through a trial-and-error process. However, RL is more suitable and dominant in literature when focusing on routing problems because it does not require any dataset like other ML models such as SL and UL [10].

B. Reinforcement Learning
The Reinforcement Learning (RL) technique is a specific type of ML method that comprises agent function and its interaction with the environment. RL has illustrated great In typical RL, the principal of the agent interaction with the environment adopts a mathematical framework of the discrete-time stochastic control process, which is represented as a tuple such that: { }, where refers to a state, denotes action, refers to feedback, i.e., reward ( ) or penalty provided by the environment, refers to the state transition such that:( ) , -and discount factor concerned with rewards. The modeling of RL is concerned with episodes,i.e., set of timesteps during which the agent performs the and interacts with an environment by learning a policy ( ) determined based on the current state. Then the agent gets an immediate from the environment based on its taken and transfer to the next state In this regard, the can be represented as follows: ( ). The ultimate purpose of the agent is to determine a most suitable to maximize the discounted and sum of rewards received so far from any given state such that: ( ), where ( ) refers to the value function of with an input argument numerically expressed as ( ) [∑ ] indicating the discounted progressive achieved from based on . However, the value of is determined using Q-function such that: ( ) after performing inany given numerically expressed as follows: Where, A denotes action space and S denotes state space.

C. Motivation
Many researchers have explored the effectiveness of the RL in network problems. The literature has shown that that the RL-driven schemes perform well in a specific context. However, it suffers from huge overhead and performance issues in the context of dynamic networking scenarios where topology changes uncertainly and dynamically due to the mobility of sensor nodes. Although, the Q-learning and its customized variant have been widely employed for designing routing schemes to improve the data packet transmission performance and resource efficiency. However, the issue with Q-learning is that it is not able to determine the optimal solution for the path selection in an appropriate time in the complex and dynamic large-scale networking scenario. Basically, in the large network, the state and action spaces are large, and since the Q-learning utilizes table-lookup mechanisms, it is usually subjected to the issue of dimensional disaster. In the real-time scenario, the network topology changes dynamically, and accordingly, the size of the network also changes. Therefore, an infinite process in the actual sense means that the mobile sensor nodes often leave and join the network dynamically. In this context, when there are more sensor nodes in the network, there will be a large state and action space, and the Q-table occupies a lot of memory. In this regard, the dimension of the state space increases exponentially with state variables, resulting in a proportional upsurge in the dimension of Q-table required to store the value of taking action in a state based on the current policy. Also, the agent requires a large time to explore the environment to learn the policy. Therefore, in a dynamic networking scenario like MANET, computing all possible states is challenging and impractical. However, few researchers suggested integrating deep learning with strong adaptability and generalization ability to solve many practical problems. However, some challenges still remain, which motivates us to introduce an effective solution regarding memory optimization without much affecting the routing performance and network resources.
Therefore, this paper introduces a unique modelling of the reinforcement learning driven routing technique that utilizes Knerva coding mechanims in the agent modelling, which enables abstraction in the action policy learning towards exploration of optimal routuing in the dynamic network. Another significant aspect of the proposed work is the usage of customized environment designed using Open AI Gym function Employing sparse distributed scheme in the routing design clould efficiently optimizes the memory requirement and offers a better routing policy according to the varying network environment to meet the transmission requirements of different services.
The remaining section of this paper is organized as follows: Section II presents the related work and highlights the problem statement for the proposed work. Section III discusses the proposed system followed by its design and methodology. Section IV presents the environment modelling for the agent interaction to explore optimal route; Section V presents agent modelling for routing using Kanerva coding; Section VI presents the experimental evaluation and performance analysis of the proposed algorithm. Finally, Section VII concludes overall contribution of this paper.

II. RELATED WORK
In the literature, the application of the RL techniques has been widely employed to address the limitations of traditional approaches to the networking domain. However, the existing routing protocols based on RL can be classified into three different categories, viz. i) context-specific criteria, ii) designwww.ijacsa.thesai.org specific criteria, and iii) performance-specific criteria. In the context-specific criteria, the RL addresses networking problems such as related issues, such as routing, channel selection QoS, and resource optimization. The work carried out by Saleem et al. [11] implemented RL to address the problem associated with channel selection and routing in the cognitive radio network (CRN). In this study, the authors have designed a model-based intelligent system to optimize routing and QoS in the context of the cluster-based and packet delivery ratio (PDR), respectively. In Debowski et al. [12], Qlearning-based path selection mechanisms are developed to optimize the node resources. The work of Jung et al. [13] presented a data packet-driven efficient routing scheme in the un-manned robotic network, a kind of mobile ad-hoc network (MANET). In this study, a modified Q-learning is adopted to formula tea location-based routing technique considering the mobility factor of the sensor nodes. The work of Zeng et al. [14] presented a hybrid scheme formulated based on Qlearning and fuzzy logic system to minimize and achieve an efficient balancing scheme in the MCA collision in the flying Ad-hoc network. Here, fuzzy logic is employed tochoose leader nodes considering the mobility pattern, and Q-learning is used to stimulate member node-rewarding to learn and evaluate multi-hop routes. The research work by Varshini et al. [15][16] presented a significant contribution in the networking, where the authors in [15] suggested a customized environment, namely NetAI-Gym, to evaluate RL agent for routing. In [16], the authors have presented a routing protocol based on Q-learning to select optimal routes. Also, the performance of presented routing schemes is evaluated with a rule-based agent algorithm. Hence, there are many RL-based approaches, but the existing studies lack modeling of a suitable environment to evaluate agent performance. However, there are few significant research works towards agent design and modeling. In the design-specific criteria, the researchers attempt to customized and enhance the design of agent mechanisms such as model-free approaches and model-based to achieve efficiency and accuracy in the model performance. The work carried out by Shen et al. [17] modeled a load balancing protocol based on the model-free approaches to minimize the congesting in peer-to-peer networking systems. The concept of the RL is mechanized to observe the environment state, such as processing capacity, queries, and resources associated with each peer. Further, the algorithm determines the suitable peers to relay queries based on the state observation. The study of Hendriks et al. [18], designed a Q-routing mechanism to perform optimal path selection and overhead reduction in the Ad-hoc wireless network. This study utilizes the AODV protocol for the route discovery process, and Q-learning is used to optimize the path discovery concerning QoS requirements. Johnston et al. [19] have introduced an intelligent routing scheme for battel networks to meet the real-time requirements. In this scheme, an approach of Q-learning is utilized to generalize and learn the next-hops to perform successful transmission of unicast-packets to the end nodes. The study considers duplication of the packets during unstable paths, and the packets are forwarded securely through multi-hop routes. The study uses cost-metric for the case of duplication, where if the cost factor is closer to zero, then more possibly that path is broken; if closer to 1, the path is stable. The researchers presented techniques emphasizing state overhead, action overhead, control packet overhead, and performance optimization in the performance-specific criteria. In the study of Wang et al. [20], the RL is utilized in the software-defined networking (SDN) enabled Internet of Things to improve routing performance. The SDN controller has a global view of the nodes and adapts routing based on mobility and traffic conditions. Further, an optimal route is determined based on the Q-learning approach. In Lin et al. [21], an adaptive routing scheme is suggested based on Qlearning concerning QoS optimization, including delay, loss, and bandwidth factor. The study of Tang et al. [22] adopted RL to develop opportunistic routing to support video streaming in the application of multi-hop wireless networks. The researchers also adopted the deep RL concept to achieve efficiency in the routing protocol [23]. The deep RL technique is used in Lan et al. [24] to perform efficient routing in the SDN. Table I summarizes the above-discussed literature to provide a quick insight for the readers.  It has been found that the majority of the study lacks modeling of a suitable environment that supports the function of Open AI Gym to assess RL agent algorithm.  Open-AI Gym is a toolkit for benchmarking the agent algorithm. However, it is not considered in the existing approaches in the context of network problem.  Due to the ever-increasing requirements for accuracy and efficiency in decision-making process for network routing, various approaches have been suggested over the years that can only approximate the complexity of the routing problem.  The applicability of the existing methods is limited to the specific context and is not much effective in dealing with dynamic scenario, where the network topology and size changes dynamically.  Very few research studies concerning Q-routing are found to emphasize the overhead memory problem.
The problem statement for the proposed study can be stated as "it is a very challenging task to integrate reinforcement learning function in the memory-efficient routing mechanism in an uncertain and dynamically-changing network environment."

III. PROPOSED SYSTEM
The proposed study suggests a memory-efficient RLdriven routing mechanism. The proposed routing is based on the RL agent which is developed using function approximator that uses the Kanerva (K) coding scheme to store learnable parameters (weight and bias) to represent the learned policy for the action being performed by the agent towards exploration of better route establishment. The proposed algorithm searches for a near-optimal prototype set that provides a significant level of abstraction in memory consumption. The proposed algorithm is introduced in a dynamic network environment to perform path establishment for reliable data transmission. The schematic architecture of the proposed system is shown in Fig. 2, where environment modelling is carried out considering Ad-network with mobile nodes using Open-AI gym function. On the other hand, agent modelling is carried out to perform routing operation based on Kanerva coding technique and also the proposed study implements two other algorithm such as Q-learning and radial basis function (RBF) for the comparative analysis.

IV. ENVIRONMENT MODELLING
The proposed study performs environment modeling that imitates the scenario of the mobile wireless networking system. The design and development of the environment are inspired by the work carried out by [15], in which a customized environment is developed, namely Net-AI-Gym. The network is composed of mobile sensor nodes with Ad-hoc features. The network as the environment is modeled as G(V,E), where V indicates vertices, i.e., sensor nodes, and E is the link for connecting the sensor nodes in the network. In this regard, the environment is represented as a collection of n vectors as in set: { ⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗⃗⃗ } , where, ⃗⃗⃗⃗ denotes a sensor node with * + ⃗⃗⃗⃗ , where is the Node id and denotes set of link and k ,and , where . The study considers a mobile sensor node ⃗⃗⃗⃗ can be linked with many of the other sensor nodes within its proximity such that -* ⃗⃗⃗⃗⃗⃗ +, therefore, is represented as follows: { ⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗⃗⃗ } s.t ⃗⃗⃗⃗ including , and , where the denotes the weight of the . Fig. 3 shows a flow diagram of the environment with the Open-Ai-Gym function.    Fig. 3 [15]. Open-AI gym enables the environment to satisfy the Markov process. The environment implemented in this study is scalable and flexible to N number of sensor nodes comprising Ad-hoc and mobility features. Thus, the proposed RL environment is the dynamic and uncertain imitating scenario of a mobile ad-hoc network. However, ensuring efficient routing is quite challenging due to the mobile and adhoc nature of the network. Therefore, the proposed study presents an efficient and sparse distributed memory-based agent model to perform routing operations in dynamic networks, discussed in the next section.

V. AGENT MODELLING
The prime objective of the study is to build an agent for solving the routing problem and optimizing the memory with the Kanerva coding. In the present study, the proposed agent mechanism uses function approximator as the Q function. However, the weights and biases of this function approximator are stored using the Kanerva coding. Due to which the storage space remains constant all the time. However, before discussing the proposed routing algorithm, it is better to understand the routing problem, its formulation, and the role of the function approximator in RL agent modeling.

A. Routing Problem
To determine the finest path from the source node to the destination node, context-adaptive and efficient routing mechanisms increase the probability of reaching the destination's data packets. Since the environment considered in the proposed study has a completely random and dynamic networking scenario, selecting the optimal number of intermediate sensor nodes for transmitting data packets is challenging. Thus, the routing process in a dynamic networking environment can be formulated as a Markov decision problem. Let us considered the sensor node characterized by MDP tuple * +.The element of this tuple refers to set of states in . Let us considered as a set of sensor nodes within the proximity or range ( ) of . In this regard, the state in comprises ⃗ , where the vector d is the distance value of all sensor nodes such that: and represents the energy value of all nodes in the range of such that: . The ⃗ is obtained by computing the Euclidean distance between and as follows: Where, and is the positioning coordinates of the sensor nodes and . Moreover, the transmission or proximity range of can be determined into different intervals (I) of length expressed as follows: (3) According to the above numerical equation (3), the distance between and is a positive integer * + computed based on and real distance value resides at time t. Here, the time 't' is considered because of the random nature of the network where the sensor nodes leave or join the network dynamically. Also, it is to be noted that represents a unit of d and state interval is subjected to the . Furthermore, the remaining energy of at can be computed as follows: Where, indicates the amount of energy utilized by at time . The illustration of is shown in Fig. 5. Considering all the above notions, the entire state can be expressed as follows: . / (5) The second element of the tuple represents a set of actions that executes. The tuple element refers to the return function of such that: a Cartesian product of state-action space. The symbol is the state transition probabilities for performing an action such that: , -. This actually represents the transition from (eq. 5) at time t such that: to next state at time such that . However, computing the precise value of is usually impractical due to the absence of prior information about the network model, its random parameters, and its dynamic nature. In the proposed study, the development of agent is carried out based model-based www.ijacsa.thesai.org approach, and each mobile sensor node approximates its using maximum likelihood approach and the possible * + and the path establishment process is completely in the control of the agent [15].

B. Agent Modeling based on Function Approximation
As discussed in previous section, the RL algorithm encounters communication and memory overhead problems when action space is very large in dynamic state spaces. In order to address this problem, the researchers have suggested the implementation of the function approximator, which is a basically an approach of neural network that the RL agents utilize to improvise its learning performance when introduced to dynamic and complex state spaces. Fig. 6 exhibits modeling of the agent using function approximator for dynamic network environment. In this work, the study utilizes simple artificial neural network (ANN) architecture that has parametrize function concerned with the learnable parameters, namely weight and biases, to represent approximated and value such that ̂( ). The values for all and pairs extracted from the large space are mapped into abstract components in ⃗ . The approximated ̂( ) concerning ⃗ expressed as follows: Where, ⃗⃗ denotes vector that consists of prototypes with N components that are constructed by state representation. In the proposed study, Kanerva coding is used as a state representation technique. The ideology behind using Kanerva coding [25] in the proposed agent modeling is that the Kanerva coding considers a small state as a prototype to store the value functions. Kanerva coding maintains a set of prototypes as parameterized elements for the approximation and a value ( ) is stored and updated for each prototype concerning . The approximation of state-action ( , ) is computed by a linear combination of ⃗ values of all adjacent prototypes of , expressed as follows: where denotes adjacent with respect to . The mechanisms of Kanerva coding in the proposed agent is designed based on the following algorithm.

C. Kanerva Coding
Kanerva coding (K-coding) deals with an architecture of sparse distributed memory [25] that utilizes prototype states to characterize the input sample states. The implementation of Kcoding in the proposed agent for performing routing operation has multiple advantages viz. i) with the increase in network dimension (state space) due to increase in the number of sensor nodes in the network does not exponentially increase the prototypes required to learn the policy. Thus, facilitating efficient storage utilization and better scalability, i.e., constant memory, even the network size is increasing. The prime objective of implementing K-coding is to optimize the prototype set required to characterize a state space in an uncertain, dynamic, and large network system with minimal memory cost. The Block diagram of a proposed agent with Kcoding is shown in Fig. 7. The proposed method uses K-coding as storage in order to store the values of weights and biases of the function approximator. The model works with the help of an ANN which employs regular weights and biases to find the best solution at the given state. In the case of networking, both state and action represent the current node in which the packet is present. The route is always decided with the help of ICMP packets. The agent will be present in all nodes, and the same will be updated everywhere as well. In the proposed method, Kanerva is used purely as a storage to the ANN, where it can store its weights and biases. Following is the complete algorithm for the efficient routing based on the K-coding function approximator agent mechanism. www.ijacsa.thesai.org ( ) 17. Store

End for 19. End for End
The algorithm takes parameters as input values K a set of n (ratio), closer prototype (C), where * + utilized to determine the C to the current based on the distance function. The proposed study considers the distance function as a Euclidean distance. In the first step of the algorithm randomly initializes p and takes input as a . For each set of i.e., K, the algorithm performs the computation of Euclidean distance between the prototype ( ) and state . The algorithm further stores the identity of computed distanced in vector format. In the next step of the algorithm, constructs a matrix Indm for the first 3 features and then performs the mechanism of offsetting by (m-1) x K. Basically, the Kcoding mechanisms compute the length space between a state variable its actual distance. Further, the obtained data is then merged with a better-quality state similarity to achieve higher accuracy in its computations. K-coding diminishes the requirement for reallocation and resizing of prototypes, which significantly shortens the large dependencies of storing large action-space value in the Q-learning. Due to the strong learning ability and reduced computational complexity, the Kcoding mechanism also improvises the entire learning experience.

VI. EXPERIMENTAL EVALUATION
The proposed work's design and implementation are carried out using Python programming language in the Anaconda development environment. The experiment analysis is carried out considering comparative analysis, where the performance of the proposed agent mechanism is compared with other algorithms such as Q-learning and RBF. Both Qlearning and RBF are implemented in the study in RL agent design and evaluated on the same environment designed using the net-Ai gym environment proposed in a previous paper [15]. The following assumptions are considered in the simulation setting and the experimental analysis:  The weights in the network represent the difficulty of packets being transferred.
 The weight is a composition of signal interruptions, battery, and distance.
 The weights keep varying to simulate dynamic or mobile networks.
 The number of nodes considered is 6 to 100 nodes.
 The various parameters shown here are recorded for networks with a various number of nodes.
 In this study, each network is trained for 4000 episodes.
 An episode is nothing but a simulation of a single packet from source to destination.
 The episode ends when the packet either drops or reaches the destination.
For the comparative study, the proposed study considered multiple performance metrics such as memory utilization, throughput analysis, average throughput, the processing time for route establishment, and analysis of the pathlength. Fig. 8 presents performance analysis regarding memory utilization.
The graph trend of Fig. 8 exhibits that the memory in Q learning increases exponentially, linearly in the case of RBN, and stays constant in the case of Kanerva coding.
The graph trend in Fig. 9 indicates that Q learning has low throughput, whereas RBN and K code has achieved higher throughput. Since K-code uses a function approximator in order to store the values, it underperforms a little bit compared to RBN; however, this difference is insignificant compared to Q learning. Even though K-coding underperforms slightly compared to RBN, it saves a lot of memory.     Here Kanerva coding offers the least routing time, which is desirable. This is because the K-coding consumes lower memory and is faster to get trained. Fig. 12 represents a change of path length along with nodes. It can be observed here that the K-coding always finds the shortest path. It is better than Q learning and RBN all the time.

A. Result Implication
 It is seen that the Q learning is not showing a good throughput. This is because the Q learning has less trainable parameters, and the table decides the reward. Even though the rewards are stored and stored aptly, the mechanism to calculate the future reward isn't as robust as the other two methods.
 The Q learning fails to perform in the case of memory management as well since the number of actions and states increases with an increase in nodes. To be specific, the memory consumption increases exponentially since the rewards are stored in the form of a table.
 The purpose of the Kanerva coding is to maintain a constant memory throughout.
 As observed from the above results, Kanerva coding underperforms in only one aspect: throughput. However, it does not pose a significant disadvantage as compared to RBN.
The proposed routing is designed based on the RL agent that utilizes K-code to achieve abstraction in the state space. Therefore, the proposed agent mechanism dynamically establishes the best node pathswith a low computational burden under uncertain and dynamic network traffic conditions.  (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 11, 2021 152 | P a g e www.ijacsa.thesai.org

VII. CONCLUSION
The proposed work is an extension of our previous research works, where in the first work, a suitable environment is designed to solve routing problems in network using the RL agent. On the other hand, the effectiveness of the proposed environment is evaluated in the second work by implementing a routing algorithm based on Q-learning and rule-based methods. In this paper, the proposed study improvises Q-routing performance to have better time and memory efficiency in the current work. The Q-learning consumes much memory and is not time efficient. RBN gives higher accuracy and time efficiency; however, the memory required for the algorithm will still increase with an increasing number of nodes. Hence, this work Kanerva coding is implemented to store the weights and biases of the function approximators used to build an agent for solving the routing problem and optimizing the memory. The benchmarking of the proposed system is carried out based on the comparative analysis concerning multiple network performance metrics. The study outcome proves the effectiveness of the proposed agent mechanism for routing operation under any given traffic condition in the network. In the future work, the proposed work can be extended in the context of multi-agent modeling of energy and security aware routing protocol in the dynamic networking environment.