Smart Jamming Attacks in Wireless Networks During a Transmission Cycle : Stackelberg Game with Hierarchical Learning Solution

Due to the broadcast nature of the shared medium, wireless communications become more vulnerable to malicious attacks. In this paper, we tackle the problem of jamming in wireless network when the transmission of the jammer and the transmitter occur with a non-zero cost. We focus on a jammer who keeps track of the re-transmission attempts of the packet until it is dropped. Firstly, we consider a power control problem following a Nash Game model, where all players take action simultaneously. Secondly, we consider a Stackelberg Game model, in which the transmitter is the leader and the jammer is the follower. As the jammer has the ability to sense the transmission power, the transmitter adjusts its transmission power accordingly, knowing that the jammer will do so. We provide the closed-form expressions of the equilibrium strategies where both the transmitter and the jammer have a complete information. Thereafter, we consider a worst case scenario where the transmitter has an incomplete information while the jammer has a complete information. We introduce a Reinforcement Learning method, thus, the transmitter can act autonomously in a dynamic environment without knowing the above Game model. It turns out that despite the jammer ability of sensing the active channel, the transmitter can enhance its efficiency by predicting the jammer reaction according to its own strategy. Keywords—Wireless networks; jamming attacks; game theory; reinforcement learning


I. INTRODUCTION
Technology and system requirements in the telecommunications domain are changing very rapidly.Over the previous years, since the transition from analogue to digital communications, and from wired to wireless networks, different standards and solutions have been adopted, implemented and modified, often to deal with new and different business requirements.However, in the development of the wireless Next Generation Networks (NGNs) in which the layered architecture is adopted the common challenge of how further improve the resource utilization efficiency and provide better quality-of-service (QoS) is conditioned by the capacity of systems to accommodate changes quickly and with minimum impact on the services already implemented.Furthermore, the flexible topology and the low cost in term of use and setup have motivated the exploration of the wireless NGNs with increasingly higher data rates to meet the rapidly growing demand for wireless access.
Distributed protocols would be required to improve the radio resource utilization and provide high performance for wireless NGNs.In particular, an integrated design of Medium Access Control (MAC) based on Wireless Random Access (WRA) mechanism may lead to an efficient solution.This is why it is important to design distributed algorithms which can be used by the mobiles to compute the equilibrium strategy and simultaneously achieve the optimal operation points.On the other hand, the basic underlying assumption in legacy WRA protocols is that any concurrent transmission of two or more users causes all transmitted packets to be lost [2].However, this model does not reflect the actual situation in many practical wireless networks where some information can be received correctly from a simultaneous transmission of several packets.This result is due to the fact that the packet arriving with the highest power has a good chance to be detected accurately, even when other packets are present.The effect of capture on Aloha [9], [10], [11], [18] and on IEEE 802.11 protocol (Carrier Sense Multiple Access-Collision Avoidance (CSMA/CA)) [19], [20], [21] has been studied extensively in the literature and new MAC protocols for channels with capture have been proposed.Furthermore, the full system utilization requires coordination among users which may be impractical given the distributed nature and arbitrary topology changes of wireless collision channels.However, while seeking ways to increase the performance of wireless network, there are increasing number of critical security issues that need to be addressed in order to make these wireless NGNs safer [7], [24], [25] (e.g., time-critical services, military operations, etc.).Note that wireless networks are vulnerable to security threats such as distributed denial of service attacks (DoS), spoofing attacks, Sybil attacks, faked sensing attacks and smart jamming attacks [7].Thus the study of jamming problem in the context of wireless networks is an important challenge since it's easy to destroy communications due to the fact that the jammer can create dynamic and intelligent jamming attacks [23], [5].
The Game theory provides a convenient framework for approaching the power control in wireless based distributed MAC protocols.In fact, given the broadcast nature of the wireless MAC, the users are considered as selfish transmitters [2], and each transmitter seeks to maximize its payoff, while a malicious user tries to degrade the performance of the whole system.In this paper, we consider the IEEE 802.11MAC CSMA/CA mechanism which is used by a large number of wireless systems, therefore, the problem of jamming can occurs during the transmission duty.In addition, the adversary or the jammer has to expend a significant amount of energy to jam the selected frequency bands, also the continuous presence of unusual high interference levels makes these attacks easy to detect.Thus, the main challenge in this paper is to derive the optimal strategy defense against the DoS attacks [16], [3], [4], [8], knowing the fact that the behavior of a malicious user may jam the network by sending abnormal packets to another user to block the channel from doing any things useful (Fig. 1).
It is well known that the Game theoretical approach is an appropriate concept to dealing with the competitive situation.Compared to the approaches used in previous works [12], [13], [14], [15], [17], etc. we are interested here in the impact of a smart jammer on the transmitter power levels during the period that starts at the first attempt of a packet transmission until the next packet transmission first attempt, due to the fact that when re-transmissions are used, the jammers cause the effective network activity factor (and hence the interference among the Receiver Sides (RSs) to be doubled [24].In particular, we consider a scenario where a single transmitter (player 1) and a single jammer (player 2) coexist.The case of several transmitters/jammers is a subject of future research.Namely, the strategies of both the jammer and the transmitter are their transmission power levels during the packet transmission cycle.Since each packet transmission attempt incurs a cost in term of power, we consider that the Game objective utilities of both players are functions of the Signal to Interference plus Noise Ratio (SINR) value and the transmission cost.Under this antijamming Game based on power control problem, we propose two Game formulations, Nash Game where all players act simultaneously and Stackelberg Game where the transmitter is considered as leader (i.e.first to determine its transmit power) while jammer is considered as follower.At first, we derive the Nash Equilibrium (NE) expression, thereafter, we prove the existence of the Stackelberg Equilibrium (SE) and by using the Simulated Annealing Algorithm we sort out the SE measurement.From the comparison of the two schemes, we deduce that the transmitter can efficiently enhance the system performance.The main limitation with regard to the proposed power control-based anti-jamming problem is that there may be information loss for unknown jamming patterns.Thus, we consider a worst case scenario where the transmitter has an incomplete information while the jammer has a complete information.We introduce a Reinforcement Learning method, thus, the transmitter can act autonomously in a dynamic environment without knowing neither the estimating jamming patterns and parameters nor the above Game model.The rest of this paper is outlined as follows.We briefly describe the related work in Section II.Then, we introduce the system model and the Game formulation in Section III.In Sections IV and V, we analyze the system in the presence of a regular and a smart jammer.In Section VI we propose a hierarchical learning solution.Simulation results are provided in Section VII.Finally, we conclude the paper and give some perspectives for future research.

II. RELATED WORK
Designing mechanisms that can be able to detect wireless network jamming as well as avoid it has been widely studied under several works.In [26], authors investigate the antijamming problem with discrete power strategies, they formulate a Stackelberg Game to model the competitive interactions between the user and jammer.Then, they analyzed the asymptotic convergence by proposing a hierarchical power control algorithm (HPCA).In [27], a smart jammer can quickly learn the transmission strategies of the legitimate transmitters, and then he would adjust his strategy to damage the legitimate transmission.Meanwhile, the transmitters are aware of the existence of the smart jammer.The difference from [28] is that they consider relay nodes which help the source counteract a smart jammer.Furthermore, in [29] reinforcement learning can be applied to determine transmission powers against a jammer in a dynamic environment without knowing the underlying Game model.In [1], authors propose an anti-jamming Bayesian Stackelberg Game with incomplete information.In all the previous works on anti-jamming, the authors consider the problem transmitter-jammer during only one transmission attempt.
In this paper, we study the power control problem during a packet transmission cycle in the presence of a smart jammer, which has energy-efficiency and keeps track of the retransmission attempts of the packet until that it is dropped.We suppose that the power level set is continuous and we consider a non-zero Game by introducing a transmission power cost.

III. SYSTEM MODEL
Let a mobile use IEEE 802.11CSMA/CA standard which is the most widely known standard in wireless networks.We assume that a transmission fails with probability that depends on the SINR.If a transmission fails then it is attempted again after some back-off time.After a certain number of attempts K the packet is dropped.Let's assume that the power is controlled.Hence, the power of the mobile user used at the ith transmission attempt can be denoted by T i ∈ [0, T ] .Assume that: where p 0 ≥ 0 is the initial transmission power and x > 1 is the power multiplier factor for each re-transmission attempt.In this paper, we examine a scenario with one transmitter, which has its own traffic to send, and one jammer, which does not have its own traffic and simply wants to jam the transmitter attempts.As the mobile user spreads its signal over a common frequency band and treats interference as noise, thus, the signal to interference plus noise ratio at the ith transmission attempt SIN R i at the receiver side is given by where N is the background noise level on the channel, J i ∈ [0, J max ] is the jammer power at the ith transmission attempt, α > 0 and β > 0 denote the fading channel gain of the mobile user and the jammer, respectively.Since a jammer chooses which transmission or retransmission to jam, we assume that it jams all packets that are in the back-off stage k ≥ K 2 , where K 2 is an integer, that means that the competition starts from the back-off stage K 2 .Since the quick detection of the start of a packet is becoming very harder for the jammer and this is due to the large bandwidths and the widely spread signals, we assume the worst situation in which the jammer can jam the communication from the first transmission attempt despite the fact that it arrives at a completely unpredictable time and frequency.
On the other hand, Let's define a cycle as the period that starts from the first attempt of a packet transmission to the first attempt of the next packet transmission.During a cycle, we consider a Game in which the two mobiles are players.Moreover, we consider that each transmission occurs a certain cost and let C > 0 and D > 0 be the transmission costs per unit power of the mobile user and jammer respectively.We assume that players have perfect knowledge of the environment state and costs constraint at the beginning of each cycle.
} the feasible set of the power multiplier and the initial transmission power of the mobile user and S j = {(J 1 , J 2 , ...J K )| J i ≥ 0; J i ≤ J max } the feasible set of the jammer power vector.We consider the following power control problem where (T, J) is to be determined, where J = (J 1 , J 2 , ...J K ) and T = (p 0 , x) .
The mobile user objective is to achieve the maximum K i=1 SIN R i with the minimum cost.Intuitively, from ( 1) and ( 2), the utility function of the mobile user during a cycle denoted as U (T, J) is given by: The jammer objective is to achieve the minimum K i=1 SIN R i with the minimum cost.From ( 1) and ( 2), the utility function of the jammer during a cycle denoted as V (T, J) is given by:

IV. NASH GAME
In this section, we assume the presence of a regular jammer, and we consider a Game G n = ({Transmitter, Regular jammer}, {T, J}, {U, V}).Since the regular jammer does not have the capability to sense the ongoing transmission power, all players take actions simultaneously.We focus on finding a Nash equilibrium in which neither the transmitter nor the jammer can increase its utility function by unilaterally changing its strategy.we define the Nash Equilibrium by the following formulation: Theorem 1: Let a jammer without the intelligence of learning the transmitter strategy.There exists a NE (T N E , J N E ) in the Game, in addition, T N E = (0, 1) Proof: By (3) we have: The first order partial derivative of V (T, J) with respect to The second order partial derivatives of the jammer objective function are: Therefore, the Hessian matrix of V (T, J) with respect to the vector J is negative and V (T, J) is strictly concave in J. Thus we consider the following cases: • C > α/N : As ∂U ∂T < 0 ∀T ∈ S t , thus x N E = 1 and p N E 0 = 0 yielding T N E = (0, 1) .By using the concavity of V in J and setting ∂V (T,J) ∂Ji to zero, we have As ∂U ∂T > 0 ∀T ∈ S t , then, p N E 0 = p max 0 and x N E = x max .By using the concavity of V in J and setting ∂V (T,J) ∂Ji to zero, we have . According to Fig. 1, we have ∀J ∈ S j : V (T N E , J) ≤ V (T N E , J ).Thus J N E = J .The assumption of V i (T N E , J i ) with respect to J i .
, then U (T, J ) = 0 ∀T ∈ S t .By using the concavity of V in J and setting ∂V (T,J) ∂Ji to zero, we have In order to have J 0 i = J i for i ∈ [1, K] we must have x = 1 and p 0 = αD βC 2 , without loss of generality we assume that αD βC 2 ≤ p max 0 .As result, we get, T N E = ( αD βC 2 , 1) and

V. STACKELBERG GAME
We assume the presence of a smart jammer.Since this kind of jammer has the capability to sense the ongoing transmission power, we model this problem as a Stackelberg Game denoted as: G s = ({Transmitter, Regular jammer}, {T, J}, {U, V}), where the leader is the transmitter and the follower is the jammer.Thus, the leader fixes its optimal strategy based on the reaction of the follower, then the follower optimizes its own utility according to the leader strategy, namely, we define the Stackelberg Equilibrium by the following formulation: A. Jammer's Optimal Reaction Assume that the two players have a complete information about the environment.
Lemma 1: Let T be a given strategy of the transmitter.There exists a unique J * (T ) such that J * (T ) = Arg max j V (T, j).In addition, the optimal jammer reaction is given by: The conditions are given by: Proof: According to (4), V (T, .) is a continuous function on the compact set S j and it can achieve its maximum value at some point J ∈ S j .Since the first order partial derivative of the jammer objective function with respect to and the second order partial derivatives of the jammer objective function are: Therefore, the Hessian matrix of V (T, J) with respect to the vector J is negative and V (T, J) is strictly concave in J, [30].Thus there exists a unique solution J * (T ) such that J * (T ) = Arg max J∈Sj V (T, J).
On the other hand, by resolving the following equation ∂V (T,J) ∂Ji = 0, we have * (T ) = J 0 i .Thus, we deduce the property of the optimal jammer strategy given the strategy of the transmitter given in lemma 1.

B. Stackelberg Equilibrium
Let's now focus on analyzing the transmitter objective function given the reaction of the jammer.
Theorem 2: There exists T SE ∈ S t such that (T SE , J * (T SE )) is a Stackelberg Equilibrium of the Game.
Proof: To do so, we begin by proving the continuity of J * on S t .It's obvious that J * is continuous in , and Sb i is the set of couple www.ijacsa.thesai.org The continuity of J * on all S t is proved.Since U (T, J) is a continuous function on S t × S j , thus U (T, J * (T )) is continuous in T .
Since the set S t is compact, U (T, J * (T )) achieves its maximum at some point T SE ∈ S t .This prove the existence of T SE ∈ S t such that (T SE , J * (T SE )) is a Stackelberg Equilibrium of the Game.U(T, J * (T)) is not a concave function:: Despite we proved the existence of a SE (T SE , J * (T SE )), calculating the SE is a challenging due to the non-concavity of the function U (T, J * (T )).We use an example to show that there exists Let N = 0.2; E = 1; C = 0.1; p 01 = p 02 = 2; k = 10; α = β = 0.5; x 1 = 1.05, x 2 = 1.1, t = 0.63.
In this example we have, is not a concave function on the set S t .This results proves the complexity of finding a closed form of the global optimum, that's why we propose a simulated annealing technique as shown in Algorithm 2 in order to approximate the global optimum of our given function U (T, J * (T )).
Algorithm 1 Calculate T SE = Arg max T ∈St U (T, J * (T )) Require: T ∈ S t Initialize the system parameters.Initialize G with a large value.T0=[0,1]; while (G = 0) do while (Accepted states number is below a threshold level) do Pick a random neighbor,

VI. ANTI-JAMMING WITH REINFORCEMENT LEARNING
Reinforcement Learning (RL) is considered as a method in which the player takes action in a current time step and receives the corresponding reward in the next time step to evaluate its previous action [6].RL is capable of solving more complex problems, specially, as the player does not require knowledge about the environment reaction and the reward function.However, the player learns just from previous experiences by interacting with the environment.
Through the above Game model, where both the transmitter and the jammer have a complete information of each other (i.e., channel gain and transmission cost), the SE strategies are derived.However, in view of the fact that Neither the jammer physical location nor its transmission cost is known by the transmitter due to the assumption that firstly, the jammer can change its physical location in a completely unforeseen time; secondly, the value of the jammer's transmission cost is not shared over the channel.Consequently, we introduce a reinforcement learning technique, especially the Q-learning method, so that the transmitter can act autonomously in a dynamic environment without knowing the above Game model.
We assume that the transmitter can choose its power multiplier and its initial transmission power from M and N levels respectively.Let P and A M N denote the power action taken by the transmitter and the set of power action respectively.Meanwhile, the state observed by the transmitter is denoted by s t n .In each transmission cycle, the transmitter and the jammer take actions sequentially, we denote by J the jammer power action.At the beginning of the n-th transmission cycle, the transmitter first takes action and the decision making of its power action P n is based on the transmission state in the previous transmission cycle, i.e., s t n = (J n−1 ).sequentially, based on the observed state s j n = (P n ), the jammer chooses its optimal power J n given by ( 14).The received utility value of the transmitter is denoted by u n .Let now describe the antijamming power control strategy based on Q-learning.Let α t and β t denote the learning rate and the discount factor of the transmitter.The Q-function with the power action P in the state s t is denoted by Q(s t , P ).The maximum Q value in the state s t is denoted by V (s t ).We define the update rule of the Q-function in the n-th transmission cycle as follows: As a well-known reinforcement learning method, Qlearning should try to balance between exploration and exploitation according to -greedy policy where the transmitter chooses with a high probability 1-the power action that maximizes the Q value in the state s t while other power actions are taken with an equal low probability M N −1 .Thus, the probability of power action x taken by the transmitter is given by the following formulation: Anti-Jamming Strategy of the transmitter with Q-learning is shown in detail as Algorithm 2.

VII. SIMULATION RESULTS
In this section, numerical results are performed to evaluate the performance of the proposed power control problem during  18).Break if convergence: small deviation on Q episode ← episode − 1 end while a cycle in both scenarios: 1) Transmitter against smart jammer (in which the jammer has the intelligence to quickly learn the transmission power of the transmitter and adjust its own transmission power).2) Transmitter against regular jammer (in which both players play the Game simultaneously in a noncooperative manner).Among all the system variables, only fading channel gains of the transmitter and the jammer, may vary significantly due to the fact that the players can change their physical locations.Thus, we investigate the relations of the utilities of all players in equilibrium with respect to α and β.Let N = 1, D = 0.2, C = 0.2 and K = 10.Fig. 3 shows the impact on the Utility function with respect to α at NE and SE.We observe that, as α increases, transmitter's SE utility increases while jammers' SE utility decreases; this phenomenon is due to the fact that the larger α became, the closer the transmitter became from the receiver.In addition, we depict in Fig. 4 the Utility function at NE and SE of both players with respect to β.As we can remark, the transmitter's utility at the SE decreases with β, while the jammer's utility increases with it; this is due to the fact that the larger β became, the closer the jammer became from the receiver.Moreover, in both Fig. 3 and 4, the transmitter at the NE has a lower utility than that at the SE, because at the latter the transmitter knows the existence of a jammer and utilizes its transmit power more efficiently.Similarly, a jammer obtains a higher utility at the SE than that at the NE, due to its ability to learn and adjust its own power according to the ongoing transmission power.This results proves that despite the jammer ability of sensing the active channel, the transmitter can enhance its efficiency by predicting the jammer reaction according to its own strategy.
Let now consider a scenario with power control strategy based on Q-learning.In this simulation, we set M = N = 20 and we set the maximum episode numbers in the learning to 120 in order to ensure the transmitter can learn an optimal action.The learning rate α t = 0.8 which indicates how far the current estimate value of Q is adjusted toward the update target value of Q.The discount factor of the source β t = 0.8 that indicates the increasing uncertainty about rewards that will be received in the future.We assume a transmitter that does not have a complete knowledge about the dynamic environment, while the jammer has these knowledge.The initialization of the value for greedy algorithm is starting from 0.5 to ensure that the transmitter can try all actions in all states repeatedly.The utility of the transmitter received by the receiver according to the learning episodes are shown in Fig. 5.We can remark that the utility of the transmitter converges towards the solution proved in the above model.This result validate the proposed power control model.Note that, as the transmitter is gradually aware of the dynamic environment with the learning episodes increasing, which indicates a well anti-jamming performance.This is due to the fact that the transmitter chooses a more proper power action after has a well knowledge about the environment.

VIII. CONCLUSION
In this paper we studied denial of service vulnerability in wireless networks in the presence of jamming attacks.We choose a Game theoretical approach which is an abstract concept that indicates how the final outcome of a competitive situation is dictated by interactions among the players.We considered a jamming during a transmission cycle.We studied the case where all players take action simultaneously and the case where the transmitter is the leader and the jammer is the follower.We proposed a Nash Game in the simultaneous Game and a Stackelberg Game in the hierarchical Game.A closed form of Nash Equilibrium is derived, then, we proved the existence of Stackelberg equilibrium.We sorted out the Stackelberg problem by using a simulated annealing algorithm.Moreover, we studied the relations of the utilities of all players in Nash and Stackelberg equilibrium.In order to validate our Stackelberg model, Q-learning method can is considered to be used by the transmitter to determine their transmission power actions in the presence of a smart jammer in a dynamic environment without knowing the underlying Game model.Simulation results have verified that despite the jammer ability of sensing the active channel, the transmitter can enhance its efficiency by predicting the jammer reaction according to its own strategy.Finally, this work can be extended to the case of several jammers that operate on a single sub-carrier during a single time slot in order to investigate the interaction among jammers who have interest to damage the source node transmission.
Fig. 2.The assumption of V i (T N E , J i ) with respect to J i .

Fig. 3 .
Fig. 3.The impact of α on the utility function of Jammer/transmitter at NE and SE.β=0.5.

Fig. 4 .
Fig. 4. The impact of β on the utility function of Jammer/Transmitter at NE and SE.α=0.5.

Fig. 5 .
Fig. 5. Utility function of the transmitter, where the transmitter action is chosen based on Q-learning.
Algorithm 2 Anti-Jamming Strategy of the transmitter with Q-learning Require: P ∈ S tSet the system parameters: β t , α t , , episode Set s t , A M N Initialize Q(s t , P ) ,V (s t ) as zero ∀ s t , P ∈ A M N while (episode = 0) do Set the starting state s 1