Indoor Localization and Navigation based on Deep Learning using a Monocular Visual System

Now-a-days, computer systems are important for artificial vision systems to analyze the acquired data to realize crucial tasks, such as localization and navigation. For successful navigation, the robot must interpret the acquired data and determine its position to decide how to move through the environment. This paper proposes an indoor mobile robot visuallocalization and navigation approach for autonomous navigation. A convolutional neural network and background modeling are used to locate the system in the environment. Object detection is based on copy-move detection, an image forensic technique, extracting features from the image to identify similar regions. An adaptive threshold is proposed due to the illumination changes. The detected object is classified to evade it using a control deep neural network. A U-Net model is implemented to track the path trajectory. The experiment results were obtained from real data, proving the efficiency of the proposed algorithm. The adaptive threshold solves illumination variation issues for object detection. Keywords—Visual localization; visual navigation; autonomous navigation; feature extractor; object detection


I. INTRODUCTION
Computer systems are important for autonomous mobile robots to perform crucial tasks using sensors to extract useful information from the environment [1]. Vision-based sensors are powerful tools, providing the capability to interact with the environment [2], [3]. Cameras are used as a sensing equipment for environment perception in artificial vision systems, extracting relevant features from the image, allowing autonomous navigation, 3D mapping, visual robot localization, and object recognition [4], [5], [6].
Visual perception describes the environment extracting key-points around the system. The extracted key-points can achieve full autonomy to a mobile robot [6], [7], [8].
The robot localization problem implies the absolute current position estimation of the system, but also includes the relative position with respect to the object [2], [9], [10]. Localization is mandatory for a successful indoor navigation, and situates the system and the goal on a map.
Many visual localization methods are based on deep learning extractors. In [11] Perez, Caballero and Merino proposed an extending Monte Carlo localization method, extracting information from the images to generate a visual recognition using a map-based localization approach. Fu et al [12] applied a 3D mapping with an RGB-D camera, using an improved ORB algorithm as a feature extractor and measure the similarity of the descriptors with Hamming distance. Chen and Cheng proposed a cloud-based visual SLAM in [13], using the oriented Fast and ORB features to build the dense map and estimate the position. Chiang, Hsia and Hsu [14] proposed a stereo vision self-localization system applying a trigonometric function finding coarse distances from the robot to the landmark; therefore it is applied to a neural network to increase the precision of the distance measurement. Xu, Chou and Dong [15] presented a multi sensor based indoor algorithm for a localization system, integrating a convolutional neural network (CNN) based on image retrieval, to determine the location of the robot. Another example is presented by Xie, Wang, Li and Tang [16] use a CNN robot localization to recognize the scene, reducing the error of relocating the robot.
Object detection is necessary to predict the position of the obstacles to avoid them, and move through the environment using extracted landmarks from the images to detect objects and predict the next state of the system. Many approaches for object detection use image feature extractors based on edge features, color features and textures features. Edge features methods aim on the identification of key-points from brightness changes on images. Color features detectors compare nearest regions from key-points. Texture features used regions of interest to classify pixel intensity levels in a neighborhood, such copy-move detection algorithms that search similar areas in images [17], [18], [19], [20], [21]. Ahn and Chung in [22] combined the advantages from the multiscale Harris corner and SIFT descriptors to make an object detection using depth range features for metric information. Zheng, Barahimi, Aoun and Amar [23] present a boosted convolutional neural network for object recognition, using a boosted blocks in a succession of convolutional layers. Pertusa, Gallego and Bernabeu [24] propose an application for smartphones that allows object recognition using CNN.
Robot navigation use images to determine the system position using key-points as representations of the environment and detect obstacles [4], [5], [6], [11], [25]. Indoor navigation incorporates some knowledge of the environment, in particular a space representation, such as a map. The environment map representation is divided into three groups: Map-based navigation, the system depends on a created model or topological map. Map building-based navigation, a topological representation is contracted while the robot navigates on the environment. Mapless, the system does not have a spatial representation of the environment [26].
Visual Navigation describes the environment using landmarks on the image to build the map, locate the robot and detect obstacles. Sineglazov and Ischenko [27] used a SURF and RANSAC algorithms to locate landmarks and a neural www.ijacsa.thesai.org network to determine the object coordinates. Yilmaz and Gupta focus on combine signals from visual and inertial sensors in [28] for an indoor environment, this approach is based on visual odometry and a building information model. Shao et al [29] fusion the visual and inertial navigation system and applied a Kalman Filter to decline the errors of visual navigation. Sineglazov and Ischenko [30] analyze the characteristic features of different methods to develop n algorithmic maintenance for intellectual visual navigation based on neural networks.
The present approach addresses indoor map-based localization and navigation problems. A deep convolutional neural network is used to locate the system on the map and the pose estimation is based on a background modeling for feature extraction. A copy-move detection algorithm is applied to detect obstacles and a neural network for its evasion. Furthermore, a U-Net model is used for path trajectory tracking. This paper is divided in: Section II presents the proposed method for visual localization and navigation; Section III consists on the experiment results and analysis; Section IV contains the conclusions of the proposed algorithm.

II. PROPOSED METHOD
The present paper provides a solution for indoor navigation problem, including a map-based robot localization on the environment, and planning the trajectory using visual information from the mobile robot. Fig. 1 presents a general diagram from the proposed algorithm.

A. Robot Localization
Robot visual localization interprets the image information to estimate the current position of the system. Artificial neural networks are computational models capable of adapting their behavior in response to the stimulus actions from the environment [31]. The proposed visual localization stage is based on a deep neural network (Fig. 2) as a feature extractor. The convolutional deep neural network works as a feature extractor and classifies the image to locate the system in the environment; afterward it measures the robot position using the extracted interest points. The architecture of the deep neural network consists on three blocks, the first block has two CNN blocks with 64 filters, a (3,3) kernel size, a stride (1,1), a relu activation function, and an average pooling; the second block has two CNN blocks with 128 filters, a (5,5) kernel size, a stride (2,2), a relu activation function, and an average pooling; the first block has two CNN blocks with 256 filters, a (5,5) kernel size, a stride (2.2), a relu activation function, and an average pooling, therefore is used a Dense block after the Flatten stage, and classify eight classes. An Adam optimizer and a 0.0005 learning rate are used to get more accurate results.
Illumination changes influence the image scene on indoor visual localization. The neural network training process uses an image processing (gamma transformation, logarithmic transformation, blurring, rotation and translation) to reduce the error on the classification and increase the accuracy of each area of the environment, such as, rooms or corridors. The localization stage is completed by estimating the position of the robot.
Subsequently, the background model is used as a reference image. The diagram of the background subtraction is in Fig. 3, the main idea is to extract points from the image to generate a background model and analyze the image frames using an adaptive threshold to detect object when the system is static. On the other hand, a copy move detection is used to detect objects while the system is on movement [32]. One of the most frequent image forgery attacks is duplicating a fragment and place it on an image. Copy-move detection is a common technique for image forensic for detection of the tamper zones with a duplicated region from the digital image and verifies its originality [33], [34], [35], [36], [37], [38].
Key-points on the background model are used for the identification of possible objects.
B is background model, I is the ℎ frame, and N is the number of frames. A subtraction between the current frame and the reference image is applied to estimate the position of the system. This process identifies pixel intensity variations to detect possible objects.    where is the current frame and is the background. Therefore, it is extracted stable and invariant image features to get the position of the robot and obstacle detection, using an adaptive threshold based on copy move detection, such as in [35], [39]. The reference image S from the current image sequence and the last current image sequence are segmented the Discrete Cosine Transform is apply (DCT) [40].
The first stage of thresholding is obtained from C: where 2 is the variance, N are the number of frames, are the values of each block C.
The Discrete Wavelet Transform (DWT) is applied to each block B and divides the image into signal components as level filters, approximations (low frequency), and details (high frequency) using the approximation sub-band as a noise filter and obtain characteristic features from the image.
The feature extraction allows the detection and recognition of the object, through the correlation between the images.
The descriptors are computed using the block statics.
1 is the feature vector extracted from the current reference image and 2 is the feature vector extracted from the last reference image. The second stage from the adaptive threshold avoids the error in object detection and its localization. Each matrix of the current image are a shift to the right the column, shift down rows as in [39] and the descriptors are computed generating = = = .
Therefore, the thresholds are generated as in [39], then is proceed to compare the descriptors and obtain the interest points.
= | | < 2 and |mean| < 1 b = 2 < 1 < 3 The detection and localization of the object is owed to the pixel values changes on the image sequence. Therefore, a neural control network is applied for object classification and evasion. The neural control network allows the recognition of the detected objects, and the control tasks to evade the objects.
Once the interest points are detected, the robot position is estimated. The camera position is on the center and bottom of the image O as in Fig. 4.
The distance between the camera h and the object is given in centimeters: where ℎ is acquired by ultrasonic sensor, ℎ is the calculated distance with the camera.
is the start time from the pulse and is the final time from the pulse. The width from the detected obstacle is:

B. Robot Navigation
The neural control network classifies the detected objects and the control tasks to evade them. A path planning is implemented a control system using the neural network approach described in using predefined controls to evaded the recognized object; considering a system with multiple inputs and multiple outputs.
The system considered involve both non-moving and moving rigid objects and define the robot position (p), rotation matrix (q), velocity (v), angular velocity (w).

= [ , ]
Each control defines a force-torque pair (F, ) acting on the mass center B, therefore the motion equations can be defined as: The system used a set of predefined controls evaluating the current system state.

U=[
Evaluating the states choose the control to evade the object and track the trajectory. The trajectory is predefined using reinforcement learning, using a cost function, based on the Euclidean distance, to assign priorities for the movements.
Then the trajectory is tracked by the cloning behavioral network. The system learns the trajectory and reproduces the user's control using the knowledge of a human pilot for training a convolutional neural network (CNN) that receives as input the current camera frame and outputs the next state. The idea is to reinforce the knowledge of the trajectory using a UNET model [41], [42].
A successful path tracking depends on the path and specific object recognition. To perform path recognition in an image is required a semantic segmentation. Path detection describes the systems movements in the environment [43]. The predicted class P(class) represents the coincidence between the input image and the predicted class . where: The existing probability of a specific class and the object detected (in this case the trajectory) are taken in consideration. During the training, the following loss function is used the categorical cross-entropy: y is the training image and ̅̅̅ is the label. The use of the categorical cross-entropy will compare the distribution of the predictions, where the probability of the true class is 1 and 0 for other class.

III. EXPERIMENTAL RESULTS AND ANALYSIS
In this section the results are presented for the visual robot localization, object detection and trajectory tracking using a mobile robot with a Raspberry Pi 4 and a Raspberry V2 module (Fig. 5).
The proposed neural networks are trained under Windows 10 operating system using TensorFlow and Pytorch deep learning frameworks, the CPU model is 8GB Intel® Core i7-6700 @ 3.40 GHz, with a graphics card model NVIDIA GeForce GTX 960.

A. Robot Localization
The algorithm for robot localization uses extracted features from images as landmarks to determine where the robot is, by comparing this feature with the environment image.
The class coincidence in Table I has a good performance an 88% of accuracy, obtaining a high efficiency for robot localization. To increase the performance of the localization neural network more images for each class have to be added to the dataset.
The train dataset uses 1532 images from Room1, 1665 images from Room2, 1869 from Hall and 2052 from Room3, the validation dataset uses 1002 images from Room1, 1105 images from Room2, 1157 from Hall and 2500 from Room3, the test dataset uses 600 images, in addition, a pre-processing image process was applied to increase the classification accuracy (translation, rotation, and blurring). The neural network for localization presents in Fig. 6. An accuracy of 95%, reducing the classification error, then the position of the system is estimated by extracting key points from the background model and the object detection.  The proposed approach tracks all the detected key points on the frame sequence, obtaining the robot position and locates the obstacle as in Fig. 7. The main advantage is the detection and points matching despite the illumination changes or the system position. In order to evaluate the proposed algorithm is used the accuracy and feature ratio sequence rs.
= (30) where np are the detected points on the current image and are the detected points on the previous image frame. In order to analyze the effectiveness of the object detection algorithm is compare in Table II with popular algorithms. Clearly, the proposed method has a good performance tracking 97% of the detected features. In Fig. 8 is observed the behavior of the precision in a sequence, the proposed method has a 90%. The extracted features can be tracked without losing them on the frame sequence; the ratio sequence has a variation because the robot position changes.
The experiment was realizes using the robot on different conditions of illumination on the indoor environment. An advantage of the proposed algorithm is the feature detection and tracking on frame sequences, this advantage is useful for the detection and tracking of moving objects. In Fig. 8, shows that the extracted features can be tracked without losing information regardless of changes on the system position and illumination changes, being an advantage over the conventional methods. Once the obstacle is located a neural network classifies the object for its evasion. The effectiveness of the object classification is evaluated using the confusion matrix in Table III.   The accuracy and the recall from the control neural network are described using statics values from the confusion matrix. Where are the true positives, are the false positives and the false negatives.
The efficiency of the object classification neural network has high accuracy. Clearly, in Table IV, the proposed algorithm has a good performance with a 93% of accuracy and a 98% recall for object classification.
The trajectory is generated using reinforcement learning, and it is tracked by the U-Net model. In order to verify the accuracy performance of the U-Net model trained for the trajectory detection, the used data was acquired using the proposed system. The dataset contains real 1500 images with the path, in addition, the pre-processing image was applied to increase the accuracy (translation, rotation, and blurring). Furthermore, a segmentation on the image detects the path to decide the control actions to track a trajectory.
The accuracy from the U-Net model in Fig. 9 is 90% for path detection. The semantic segmentation recognizes the predicted class on the image, moreover is used to track the trajectory and evade a collision. The U-Net was trained using 1500 images and labels on different conditions on the environment to get a better performance.
To verify the applicability and accuracy of the algorithm in Fig. 10 is presented the image segmentation. The red region recognizes the path.

IV. CONCLUSIONS
This paper proposes an indoor navigation method, applying a self-localization solution and novel object detection. This approach has a high precision for feature tracking, this increases object detection and evasion.
A map-based is used in the present approach to self-locate the system on the map using a neural network. The accuracy for the localization neural network is >90%. In addition, the use of a deep learning neural network reduces the error in the system localization.
Furthermore, object detection is used to estimate the position of the system. The object detection has a >90% accuracy owing to the feature extractor and an adaptive threshold solving problems with illumination changes. The proposed solution tracks the object in the image frames; therefore, the system classifies the obstacle and reinforcement learning to evade it.
The path detection and track a trajectory a U-Net model is applied, providing the detection of the path, also the image semantic segmentation reduces the possibility of a collision. Semantic segmentation is an accurate process for the identification of the path.
To get an efficient object detection the system has to be on a dynamic environment for the detection of the objects, such as, an object has to move or the system is moving through the goal. The system navigates on unknown environments, but has to be trained to get the localization of system, because is used a supervised learning, also for the path detection is necessary to train the system with on the environment.
The proposed algorithm provides a novel solution for the navigation problem. In addition, the detection of similar zones on the frames increases the accuracy and precision for the object detection stage, also the detected features are used as landmarks to estimate the position of the system. This algorithm can be used to build a map representation of the environment. Landmarks can be set on the map to train the system with a reinforcement learning to set the locations.
The proposed algorithm has many applications, such as, car detection and pedestrian detection for autonomous driving, a surveillance robot, face tracking for recognition, car tracking for speed infractions, explorer robot, a background subtraction for a surveillance system.