A Randomized Hyperparameter Tuning of Adaptive Moment Estimation Optimizer of Binary Tree-Structured LSTM

Adam (Adaptive Moment Estimation) is one of the promising techniques for parameter optimization of deep learning. Because Adam is an adaptive learning rate method and easier to use than Gradient Descent. In this paper, we propose a novel randomized search method for Adam with randomizing parameters of beta1 and beta2. Random noise generated by normal distribution is added to the parameters of beta1 and beta2 every step of updating function is called. In the experiment, we have implemented binary tree-structured LSTM and adam optimizer function. It turned out that in the best case, randomized hyperparameter tuning with beta1 ranging from 0.88 to 0.92 and beta2 ranging from 0.9980 to 0.9999 is 3.81 times faster than the fixed parameter with beta1 = 0.999 and beta2 = 0.9. Our method is optimization algorithm independent and therefore performs well in using other algorithms such as NAG, AdaGrad, and RMSProp. Keywords—Adaptive moment estimation; gradient descent; treestructured LSTM; hyperparameter tuning


I. INTRODUCTION
Optimization is involved in many deep learning algorithms. Analytical optimization is the basis of design algorithms. Neural network training is the most challenging problem of all the many optimization fields.
Among the hyperparameters to tune, learning rate is one of the most difficult because it drastically affects model performance. In general, the momentum algorithm can handle and mitigate the problem of being highly sensitive to some directions in parameter space. Adopting a separate learning rate for each parameter and these learning rates make sense about the directions of sensitivity.
Adaptive optimization is the algorithm to adapt the learning rate of model parameters based on many incremental or minibatch based methods. Naturally, the choice of which algorithm to use is an unavoidable question. However, the algorithm selection much depend on the user's familiarity. In this paper, we will follow the assumptions below. Hypothesis 1. There is no theorem or algorithm to determine which adaptive optimization is the best.
Then, as long as the optimal algorithm cannot be determined, we need to invent a method that can be commonly used to refine any algorithm. Parameter randomization is the promising approach to develop a speeding up procedure which is the adaptive optimization independent.
Hypothesis 2. According to Hypothesis 1, no optimization algorithms will be able to prove their own superiority over another algorithm.
Concerning the solution for Hypothesis 2, we propose the novel method concerning randomized hyperparameter tuning of adaptive moment estimation optimizer. The thrust of this paper is that the proposed method is adaptive optimization independent.
This paper is organized as follows. In Section II, three basic and popular algorithms are introduced: AdaGrad, RMSProp, and Adam. In Section III of related work, we introduce the related works of recurrent neural networks, recursive neural networks, LSTM, and adaptive optimization algorithms. In Section IV, we discuss the proposed method based on the binary tree of LSTM, linear activation unit, and constrained breadth-first search. In Section V, we discuss the research methodology for improving backpropagation, recurrent neural networks, and adaptive optimization algorithm. Experimental results are shown in Section VI. Our randomized hyperparameter tuning method is applied for Adam. In Section VII, we provide insights into the current situation of the research efforts of hyperparameter optimization. Then we conclude our paper in Section VIII.

A. AdaGrad
AdaGrad [18] algorithm adopting the learning rates with gradually changing them in proportion to the square root of the sum of all the historical squared values of the gradient.
In equation (1), the square of gradients is accumulated into the vector s. Each S i calculates the squares of the partial derivative of the cost function corresponding to the point to the parameter θ i . In the case that the cost function is steep along the i t h dimension, sı will get larger and larger at each iteration.
Parameter sı will get larger and larger at each iteration as long as the cost function is steep along the i t h dimension, Equation (2) is almost identical to one of Gradient Descent. However, there is one big difference. That is, the gradient vector is scaled down by a factor of √ s + ϵ.
AdaGrad is called an adaptive learning rate because Ada-Grad decays the learning rate so that the learning rate for steep dimensions is faster than gentler slopes. The parameters of AdaGrad decrease rapidly in their learning rate corresponding to the largest partial derivative of the loss. On the other hand, the parameters decrease in their learning rate with the small partial derivatives. As a result, if the learning process has the more moderately directions of parameter space, the effect whole becomes greater.

B. RMSProp
RMSProp [19] algorithm improves AdaGrad to perform better in the nonconvex setting. AdaGrad is designed to converge with changing the gradient accumulation into an exponentially weighted moving average. Generally, a nonconvex function is used to train a neural network. The learning trajectory eventually reaches a region that is a locally convex bowl after the trajectory goes through many different structures. RMSProp has advantages compared with AdaGrad in the point that AdaGrad slows down rapidly and consequently finished never converging the global optimum.
For doing this, the RMSProp accumulates only the gradients from the most recent iterations.
In equation (3), exponential decay is used. The decay rate β is typically set to 0.9. Except for very simple problems, this optimizer almost always performs much better than AdaGrad. In fact, it was the preferred optimization algorithm of many researchers until Adam algorithm came around.

C. Adam
Adam [20] is an adaptive learning rate optimization algorithm. The name Adam derives from the phase Adaptive moments. Adam can be described as a variant on the hybrid of momentum and RMSProp with some important distinctions. Adam incorporates momentum directly as an estimation value of the first-order moment. The first-order moment is called as exponential weighting. Adam adopts momentum and bias corrections. Momentum is used in combination with rescaling, which have a clear theoretical motivation. Bias corrections are used to estimate both first-order moments and second-order moments to account for their initialization at the origin.

III. RELATED WORK
Recurrent neural networks [1], or RNNs are feedforward neural networks for processing sequential data by extending with incorporating edges that span adjacent time steps. In general, RNNs suffer the difficulty of training by gradientbased optimization procedures. Local numerical optimization includes stochastic gradient descent or second-order methods, which causes the exploding and the vanishing gradient problems [13][14] [15]. Werbos et al. [11] propose the backpropagation through time (BPTT), which is a training algorithm for RNN. BPTT is derived from the popular backpropagation training algorithm used in MLPN training [12]. Derivatives of errors are computed with backpropagation over structures [6].
Recursive neural networks are yet another representation of the generalization of recurrent networks by using a different form of computation graph. The computation graph adopted in recursive neural networks is a deep tree instead of the chain-like structure of RNNs. Pollack [2] proposes recursive neural networks. Bottou [3] discuss the potential use of the recursive neural network in learning to reason. In [4] and [5], recursive neural networks are more effective in performing on different problems such as semantic analysis in natural language processing and image segmentation.
There is a long line of research efforts on extending the standard LSTM [7] in order to adopt more sophisticated structures. Tai et al. [8] and Zhu et al. [9] tree-structured LSTMs extended from chain-like structured LSTMs by adopting branching factors. They demonstrated that such extensions outperform competitive LSTM baselines on several tasks such as semantic relatedness prediction and sentiment classification. Furthermore, Li et al. [10] show the effectiveness of treestructured LSTM on various tasks and situations in which treelike structure is effective.
Boris Polyak proposes Momentum optimization with terminal velocity [21]. In 1983, Yurii Nesterov proposes Nesterov Momentum Optimization (NAG) [22]. NAG adopts the gradient of the cost function which is not measured in the local position but slightly ahead in the direction of the momentum. RMSProp [19] is an improved version of AdaGrad. RMSProp extends AdaGrad by accumulating the gradients from the most recent iterations. Adam [20] is based on the idea of both Momentum optimization and RMSProp. Adam keeps track of an exponentially decaying average of past gradients and an exponentially decaying average of past squared gradients. Santa (Stochastic Annealing Thermostats with Adaptive Momentum) [23] is an adaptation method of Adam and RMSprop by leveraging MCMC (Markov Chain Monte Carlo) methods. In GD by GD (Gradient Descent by Gradient Descent) [24] is based on the idea that the optimization algorithm is a learning problem, and the optimization structure is determined by learning. They also propose LSTM optimizer.
In Adam, RMSProp, exponential decay by exponential moving average was adopted. However, it has been reported that when the gradient in that mini-batch disappears immediately due to exponential decay, consequently, Adam and RMSProp does not converge to the optimal solution. Therefore, AMSGrad [25] is an improved version of Adam that prevents important gradient information from disappearing immediately.
In Adam, the adaptive learning rate is efficient for fast learning, but even after learning has progressed, the validation error is not well converged due to high volatility in the learning rate. On the other hand, in SGD, which uses a fixed learning rate, the final validation error can be reduced, but it takes too much time to get to that point. Concerning these drawbacks, AdaBound and AMSBound were proposed as optimization www.ijacsa.thesai.org methods as the combination Adam in the beginning and like SGD, in the end, [26].
As we have introduced in this section, adaptive optimization algorithms are constantly evolving, but there is still no theorem or algorithm to judge which algorithm is the best. Even one of the latest algorithm of [26] has not proven to be superior to everything else.

A. Binary Tree of LSTM
The Tree-LSTM is a generalization of long short-term memory (LSTM) networks to tree-structured network topologies, introduced in [9]. Here, the core design concept introduces syntactic information for language tasks by extending the chain-structured LSTM to a tree-structured LSTM.  Fig. 1 shows a chainstructured LSTM network. The lower side of Fig. 1 depicts a tree-structured LSTM network with an arbitrary branching factor. It is shown that Tree-structured LSTM has a good performance in the case that the networks cope with the combination of words and phrases in natural language processing [8].
Recursion is a fundamental process in any different modalities. It is associated with many phases. A recursive procedure and hierarchical structure is formed commonly indifferent modalities. Also, recursion is a core technique for traversing the binary tree. Fig. 2 depicts the representation of binary-tree-LSTM with the unary operator. A binary tree is a tree whose elements have at most two children.
If each element in a binary tree includes only two children, these two children are typically called as the left and right child. In the case of the forward computation of a S-LSTM memory block, it is represented in the following equations.
Here, σ is the element-wise logistic function. σ is adopted to restricts the gating signals to be in the range of [0, 1]. f L and f R denotes the left (L) and right (R) forget gate. b is biased, and W is the weighting matrices of the network. Finally, the sign * is a Hadamard product which is also called element-wise product.
www.ijacsa.thesai.org More importantly, equation (14)- (20) consists of a binary operator. Therefore, this equation can be represented as a binary tree. A binary tree is a fundamental data structure in different modalities. In binary tree, the elements have at most two children. We typically name them the left and right children because each element in a binary tree can have only two children, In computation, a binary tree consists of nodes, where each node contains a L("left") reference, a R("right") reference, and a data element. The topmost (or bottommost) node in the tree is called the root node.  In artificial neural networks, a node's activation function defines the output of that node given an input or set of inputs. The input-output model is defined as follows: Here, ψ is an activation function such as Tanh and RELU. Class FunctionLinear implements the function of n i=0 w i * x i + b. The notation of *creator is the pointer to the function which generates its variable. For example, FunctionLinear outputs r which is equal to n i=0 w i * x i + b and is passed to FunctionTanh. The creator of variable r is FunctionLinear. Fig. 3 also illustrates the detailed implementation of the inheritance of functions and variables of tree-structured LSTM. Inheritance in the middle of Fig. 3 enables us to define classes for modeling relationships among types by sharing what is common and specializing only on that which is inherently different. Its derived classes inherit members defined by the base class. We can use derived class without change. Deriving class do not depend on the specifics of the derived type. Those operations redefine those member functions depending on its type, specializing the function to take into account the peculiarities of the derived type. Consequently, a derived class may define additional members beyond those it inherits from its base class.

C. Constrained Breadth-first Search
As we discussed in Section I-B, a tree-structured LSTM graph is generated for each mini-batch. Fig. 4 depicts the model of a few tree-structured LSTM graphs for mini-batches. As usual, a breadth-first search (BFS) is applied for the recursive search of the tree structure. However, other procedures on our model, such as loss, MSE (Mean-Square Error), and Tanh should be skipped before the program reaches the LSTM tree, as shown in the lower-left side of Fig. 4. We modify the recursion algorithm as shown in Algorithm 1. Broadly, the breadth-first search is an algorithm for the traversal of tree (or graph) data structures. BFS begins at the tree root. It then searches all of the neighbor nodes at the present depth before it proceed to the nodes at the next depth.
is last backward = f alse then  Truncated Backpropagation Through Time (TBPTT), which is an online version of BPTT, is proposed in [16]. TBPTT works analogously to BPTT. But, the sequence is calculated one-time step at a time periodically. The BPTT update is performed back for a fixed number of time steps. In [16], the accumulation stops after a fixed number of time steps. Truncated BPTT performs well if the truncated chains are effective in learning the recursive target functions. www.ijacsa.thesai.org

B. LSTM
Long short-term memory (LSTM) [7] is a family of recurrent neural networks. Like other recurrent neural networks, LSTM has feedback connections. Concerning the memory cell itself, it is controlled with a forget gate, which can reset the memory. unit with a sigmoid function. In detail, given a sequence data x 1 , ..., x T we have the gate definition as follows: where f t is forget gate, i t input gate, ot output gate and g t input modulation gate. Particularly P f , P i P o indicates the peephole weights for the forget gate. The peephole connections introduced in [17] enable the LSTM cell to inspect its current internal states. Then, the backpropagation of the LSTM at the current time step t is as follows:

C. Adam
Adam [20] stands for adaptive moment estimation. Adam optimization is the hybrid based on the ideas of momentum optimization and RMSProp. In Adam, momentum optimization keeps track of an exponentially decaying average of past gradients. On the other hand, RMSProp keeps track of an exponentially decaying average of past squared gradients.
As far as steps 1, 2 and 5, Adam is closely similar to both momentum optimization and RMSProp.
The only difference is that instead of an exponentially decaying sum in momentum optimization and RMSProp, step 1 of Adam computes an exponentially decaying average rather than an exponentially decaying sum. Actually, these decaying sums are equivalent except for a content factor.
Steps 3 and 4 are technically specific detail. In steps 3 and 4, m and s are initialized at 0 at default. Then, m and s will be biased towards 0 at the starting phase of training. Consequently, steps 3 and 4 will help boost m and s in the early phase of training.

VI. EXPERIMENT
In this section, we describe the experimental results of the training and generating a sine wave. In the experiment, we use a workstation with Intel(R) Xeon(R) CPU E5-2620 v4 (2.10GHz) and 252G RAM.
Adam uses the moving average of gradient m t as well as v t , which is the squared moving average adopted by RMSProp and AdaDelta.
Optimization problem requires the search for good hyperparameters. The hyperparameters are variable to decide. The cost to be optimized is the validation set error. For evaluating our method, we generate a sin wave with random noise by the normal distribution. Then, we apply curve fitting to the generated sin wave.
For hyperparameter tuning, we use three test scenarios. In first case, we set the parameter β1 and β2 fixed to 0.9 and 0.999. In second case, we set the parameter β1 ranging from 0.89 to 0.91 and β2 ranging from 0.9985 to 0.9995. Finally, we set the parameter β1 ranging from 0.88 to 0.92 and β2 ranging from 0.9980 to 0.9999. Fig. 6, 7, and 8 are the results of the curve fitting of three test cases. Fig. 9 shows the comparison of validation loss of three test cases. It turned out that test case 3 with the parameter β1 ranging from 0.88 to 0.92 and β2 ranging from 0.9980 to 0.9999 has the best performance. The plot of test case 3 decreases rapidly comaware to the other two cases.
The results in Fig. 9 suggest that early stopping may be applicable. Early stopping is another approach to regularize iterative learning algorithms, including Adam and Gradient  Descent, to stop training immediately after the validation loss reaches out a minimum value. In other words, leveraging early stopping, we control and terminate training as soon as the validation loss falls to a minimum. Early stopping is a simple and powerful regularization technique. Table I shows the validation loss of three test cases. At step 5 with epoch size 750, elapsed time of the third case with β1 ranging from 0.88 to 0.92 and β2 ranging from 0.9980 to 0.9999 is 0.0493266. The elapsed time of 0.0493266 is 1.81 times faster than the second case and 3.81 times faster than the first case.

VII. DISCUSSION
In the process of discussing a series of optimization algorithms, a question now arises -which algorithm should one choose?
Schaul et al. [27] presents a comparative study of a large number of optimization algorithms across a wide range of learning tasks. According to this, although the series of optimization algorithms with adaptive learning rates such as RMSProp and AdaDelta works fairly robustly, no single best algorithm has emerged.
On the other hand, the drawback common to most hyperparameter optimization algorithms is the need for a training experiment to run before they can retrieve any information from the experiment. A more sophisticated (automated) random search is usually much less efficient than a manual search by a human practitioner. Partly because the set of hyperparameters is often completely pathological. Broadly, in this context, the choice of which algorithm to use largely depends on the user's familiarity with the algorithm.
Generally, adaptive optimization algorithms are recommended. However, Ashia C. Wilson et al. [28] pointed out that AdaGrad, RMSProp, and Adam generalize poorly on some datasets. According to this, it follows that we may stick to other alternatives such as Momentum optimization or Nesterov Accelerated Gradient as long as researchers have a better understanding of this issue. In this situation, we can conclude that our method is helpful and practical because our method is optimization algorithm independent.

VIII. CONCLUSION
Adam (Adaptive Moment Estimation) is one of the promising techniques for parameter optimization of deep learning. www.ijacsa.thesai.org In this paper, we propose a novel random search method for Adam with randomizing parameters of β1 and β2. Random noise generated by normal distribution is added to the parameters of β1 and β2 every step of updating function is called. In the experiment, we have implemented binary tree-structured LSTM and adam optimizer function.
There have been lots of research efforts on algorithms which each seek to address the challenge of optimizing deep models by adapting the learning rate for each model parameter. However, there is currently no consensus on which algorithm is best to choose. Our method of randomized hyperparameter tuning is an optimization method independent. Therefore, our method can be applied for various kinds of algorithms such as NAG, AdaGrad and RMSProp, and so on. Updating function is called. In the experiment, we have implemented binary treestructured LSTM and adam optimizer function. It turned out that in best case, randomized hyperparameter tuning with β1 ranging from 0.88 to 0.92 and β2 ranging from 0.9980 to 0.9999 is 3.81 times faster than the fixed parameter with β1 = 0.999 and β2 = 0.9. We can conclude that adding random noise to the fixed-parameter of β1 and β2 is effective and reasonable compared with a naive manual search.