Forest Fires Detection using Deep Transfer Learning

—Forests are vital ecosystems composed of various plant and animal species that have evolved over years to coexist. Such ecosystems are often threatened by wildfires that can start either naturally, as a result of lightning strikes, or unintentionally caused by humans. In general, human-caused fires are more severe and expensive to fight because they are frequently located in inaccessible areas. Wildfires can spread quickly and become extremely dangerous, causing damage to homes and facilities, as well as killing people and animals. Early discovery of wildfires is vital to protect lives, property, and resources. Reinforced imaging technologies can play a key role to detect wildfires earlier. By applying deep learning (DL) over a dataset of images (collected using drones, planes, and satellites), we target to automate the forest fire detection. In this paper, we focus on building a DL model specifically to detect wildfires using transfer learning techniques from the best pretrained DL computer vision architectures available nowadays, such as Xception, Dense-Net, MobileNet, MobileNetV2, and NASNetMobile. Our proposed approach attained a detection rate of more than 99.9% over multiple metrics, proving that it could be used in real-world forest fire detection applications.


I. INTRODUCTION
Forests are one of the most important natural resources on our planet. They support a diverse range of plants and animals' lives, play an important role in climate regulation, and provide numerous economic and social benefits. However, forests are vulnerable to damage and destruction, and wildfires are one of their most serious threats [1]. A wildfire can start accidentally (with a spark from a campfire), or it can be deliberately set by someone who intends to cause harm. Wildfires can quickly spread, destroying everything in their path. They can also cause significant environmental damage, such as the death of trees and other plants, the removal of soil, and the release of harmful emissions into the atmosphere [2].
In recent years, wildfires have become increasingly severe. Wildfires raged through Algeria, Tunisia, and Morocco in mid-July 2021, as well as Italy and Greece in the Mediterranean. The fires, which are believed to have been started by arsonists, burned through thousands of acres of land, killing dozens of people and injuring many more [3], [4]. The fires were especially devastating to the local economies, causing widespread agricultural damage as well as the destruction of businesses and homes. Furthermore, the tourism industry suffered as many people canceled their trips to the affected areas. Despite the efforts of firefighters and volunteers, the fires kept burning for weeks, leaving a trail of destruction in their wake.
We can do a lot to reduce the risk of wildfires, such as properly managing forests and using fire-resistant materials when building houses and other structures. In addition, we need to address the root cause of these fires.
The detection of wildfires at a preliminary phase is critical for protecting people, property, and resources. Imaging technology has the potential to aid in the early detection of wildfires. High-altitude drones, aircraft, and satellites can detect wildfire heat signatures by top shooting the fire area [5].
In this paper, we aim to build the most accurate DL model for forest fires using transfer learning out of the most achieving and well-known computer vision architectures pre-trained models available today, such as VGG, Inceptionv3, ResNet50, InceptionResNetV2, Xception, Dense-Net, MobileNet, and NASNetMobile.
The rest of the paper is organized as follows. The second section provides a focus on the used techniques, while the third section deals with the related works. Then the fourth section (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 8, 2022 269 | P a g e www.ijacsa.thesai.org presents our proposed method. Before concluding, the fifth section examines and discusses our study's findings.

A. Computer Vision (CV)
CV is the process of extracting useful information from digital images. This data could be used for tasks such as object recognition, scene description, and motion tracking [6]. There are two kinds of computer vision algorithms: low-level and high-level. Low-level algorithms work on individual pixels; whereas, high-level ones work on more abstract features such as edges and corners. Low-level algorithms are typically faster and more accurate, but they are also more complex and require more processing power. High-level algorithms are less accurate, but they are also faster and easier to implement. On the other hand, deep learning has shown great promise in this area and has been used to achieve impressive results in tasks such as object recognition and scene understanding [7]. Deep learning-based computer vision can recognize objects in images, recognize faces, and read text. It can also be used for automatic image tagging and classification [8].

B. Convolutional Neural Networks (CNN)
CNNs are a type of deep learning (DL) network, which means they are made up of multiple layers of neurons organized in a hierarchical structure. A deep learning network's goal is to learn representations of input data by gradually extracting more and more information from it. Because of their ability to learn features of the input data, CNNs are particularly well-suited for computer vision tasks [9]. This is accomplished through a process known as feature extraction, which involves identifying the important features in the data and representing them in a way that the network can learn from. In contrast, traditional machine learning algorithms require the programmer to explicitly specify which features the algorithm should use [10].
CNNs have the ability to learn features that are specific to the task at hand, which is one of their advantages [11], [12]. It is made up of several layers, each with its own function. The first layer is the input layer, which receives input data in the form of a numerical value matrix and feeds it into a series of convolutional and pooling layers. This combination of convolutional and pooling layers is known as a kernel. The output layer is the final layer in a CNN; it produces the network's results (see Fig. 1). We list below the most successful deep learning CNN architectures for image recognition:  VGG is a deep learning architecture (created by the Visual Geometry Group) that was also created for the ILSVRC contest (ImageNet Large Scale Visual Recognition Challenge). The VGG16 is made up of 16 layers, including five convolutional layers, four dense layers, and a final fully connected layer, while the VGG19 model has 19 layers [13].
 Inception is a deep learning model that performed well in the ImageNet competition. It is made up of a deep convolutional network with a large number of layers (more than 20) [14].
 ResNet (Residual neural network) is a deep neural network that performed well in the ILSVRC. It is made up of a deep convolutional network with a large number of layers (more than 100), one of ResNet's major purposes is a so-called "identity shortcut connection" that skips one or more layers [15].
 InceptionResNetV2 is a variant of the original Inception-v2 model, designed to improve its performance on the ImageNet dataset. The model is composed of an Inception module followed by a ResNet module [16].
 Xception (Xtreme Inception) is a deep learning model, based on a CNN with a large number of layers (more than 150) [17].
 DenseNet is a CNN model with a large number of layers (more than 500), designed to increase the number of connections between neurons. This helps to improve the overall accuracy of the network. [18].
 MobileNet is a deep learning framework that enables developers to create sophisticated neural networks for mobile devices. It is designed to be efficient and lightweight, making it suitable for running on a wide range of mobile devices. MobileNet50 version has a depth of 50 layers and can be used for both classification and detection tasks [19].
 NASNet (Neural Architecture Search Network) is a CNN model trained on the ImageNet dataset [20]. It automates network architecture engineering by searching for the best algorithm to achieve the best performance on a certain task, while automatically configuring the number of layers, the number and type of neurons in each layer, and the architecture of the network. The NASNetMobile version is suitable for mobile devices [21].
III. RELATED WORK Dutta S. et al [22] proposed a hybrid architecture of separable convolution neural networks and digital image processing employing thresholding and segmentation for reliably detecting small-scale forest burning, which generally heralds the beginning of more terrible catastrophes. Performance examination of the test data on the suggested design provided outstanding results in terms of high sensitivity (98.10 %) and specificity (87.09 %). www.ijacsa.thesai.org Aslan S. et al [23] proposed a smoke detection approach based on Deep Convolutional Generative Adversarial Neural Networks (DC-GANs). In order to ensure a robust representation of sequences with and without smoke, the training framework includes regular training of a DCGAN with real pictures and noise vectors, as well as training the discriminator separately using smoke images without the generator. With a TNR of 99.45% and a TPR of 86.23%, the suggested approach is able to identify smoke pictures in real time with minimal false positives.
Wang Y. et al [24] proposed a forest fire image identification system based on traditional image processing methods and convolutional neural networks, and an adaptive pooling methodology was established to identify fire automatically. Using this technique, the features of the fire flame may be segmented and learned in advance. It has been shown in experiments that the adaptive pooling convolutional neural network approach has greater performance and a higher recognition rate, with an accuracy as high as 90.7%.
Chen Y. et al [25] proposed a UAV-based forest fire detection approach based on a convolutional neural network method in order to identify a probable fire in its early stages. Experimentation with generated flames in an indoor testbed proves that the suggested fire detection system works.
We will make a comparative study between a wide range of deep learning models, practically all of those that have demonstrated their effectiveness in computer vision, in order to propose the most accurate model possible for forest fire detection IV. PROPOSED METHOD Our proposed solution is to detect wildfires before they spread out of control using drones and deep learning algorithms. Drones are used to fly over forests and identify hot spots that could spark a wildfire. Once a hot spot has been identified, the deep learning algorithm can be used to determine whether it is a wildfire, and notify the authorities via a cloud server (Fig. 2).
This scheme has several advantages over traditional methods. First, drones can fly over large areas much faster than ground crews. Second, the deep learning algorithm can identify wildfires much more accurately than human observers can. Third, the use of drones and deep learning algorithms can help to protect firefighters and their assistants and keep them safe from danger by alerting them earlier.
Our main contribution in this paper is to build the most accurate DL model specifically for detecting wildfires in forests from the best resulting DL computer vision architectures available at the time, by leveraging previously known knowledge from pre-trained models. These models are already trained to know certain categories, and we narrow their knowledge to focus only on two categories (Fire or Non-Fire).

A. Dataset
A common challenge in deep learning is obtaining datasets that are sufficiently large and diverse in nature for the task at hand [26], [27]. The dataset used for training our models is comprised of a large number of images captured of wildfires in different locations around the world, as well as images of forest landscapes with no fire. It was constructed by mixing and merging multiple smaller datasets from search engines and Kaggle [28], resulting in 4661 images in our new dataset; 2525 images with the label "no fire" and 2136 images with the label "fire", after cleaning some corrupted images.
In addition, we performed data augmentation on the dataset, allowing us to significantly increase the size of our training dataset and, as a result, the quality of the trained models [29]. With data augmentation, we added new data to the dataset that is similar to the original data, but with some slight modifications, which can improve the performance of neural networks learning from data and improve their accuracy [30]. We used different data augmentation techniques [31], such as: 1) Random rotation: this can help to reduce overfitting by creating new images that are rotated versions of existing images. This also gives the model a chance to learn how to recognize objects from different angles.
2) Horizontal and vertical mirroring: they can also help to reduce overfitting by providing the model with new images that are mirror images of existing images. This can also help the model learn to recognize objects that may be upside down or rotated in different orientations.
3) Gaussian blur: it can help to improve the robustness of the model by making the images less detailed and more forgiving of small changes. This can help the model to generalize better to new data. 4) Pixel level augmentation: it can help to improve the model's ability to learn from small changes in the input data. This can be useful for learning from data that may be noisy or have low resolution. www.ijacsa.thesai.org

B. Building the Models
A model can be trained in a variety of ways. In this section, we will stare at transferring pre-trained models to a new task. Transfer learning models are typically constructed by first training them on a large dataset, such as the ImageNet dataset. This model is then used as the "base model" for another model trained on a smaller dataset (see Fig. 3). In our case, the smaller dataset is often a more specialized dataset (images of fires and forests). The smaller model is then tuned to better fit the dataset on which it is being used. This process is frequently repeated, with the final model trained on a dataset even smaller than the original. This process of model training is commonly referred to as fine-tuning [32]. We used this same strategy for each of the state-of-the-art models, importing the pre-trained DL model class [33], while ensuring that we can add our own custom input and output layers according to our data. While leveraging the previously learned weights during the initial training on the old data, a massive amount of time and space is saved while minimizing the model's complexity. (see Fig. 3).
Afterward, we inserted a fully connected and output layer (new classifier) after the pre-trained model was imported so that new real learning could take place; the fully connected layer is a flatten layer and a dense layer with 512 neurons [34]. The sigmoid activation function is used in the output layer with only one output neuron matching the binary label in our data (Fire or Not). Finally, we train our models on 80% of the augmented new dataset (Training set) and validate the obtained results with the remaining 20% (Validation set).

A. Hardware and Software Characteristics
In order to get our results, we used TensorFlow on an HPC system with the following hardware specifications:

B. Evaluation Metrics
It is necessary to have a proper evaluation metric in place in order to find the best model during the training phase [36]. When evaluating deep learning models, certain metrics must be used, such as Accuracy, Precision, Recall, and Loss. In order to calculate these metrics, four different parameters are used [37]:  False Negative (FN): is the total of incorrectly categorized negative class records.

1) Accuracy:
Accuracy is the percentage of correctly classified items. It is the most basic and common evaluation metric for classification tasks. The accuracy is simply the ratio of correctly predicted labels out of all predicted labels [34]: (1)

2) Loss:
It is a measure of how far off the algorithm is from the desired output. The lower the loss, the better the algorithm is performing [38]. The cross-entropy loss is a commonly used loss metric for classification problems. It is calculated by this formula: Where y i is the output of the true label and p i is its predicted probability. The cross-entropy loss is used to assess the performance of a classifier by penalizing incorrect predictions.
The higher the cross-entropy loss, the more incorrect predictions the classifier is making.

3) Precision:
Precision is a metric estimating how well a model predicts true positives. True positives are those instances that are correctly identified as positive by the model [39].
A model with high precision will correctly identify the most positive examples, while a model with low precision will misclassify many positive examples as negative. The Precision metric is given by: (3)

4) Recall:
The recall is the ratio of correctly predicted positive instances to all positive instances (see formula (4)). It is also known as the true positive rate or sensitivity. A model with high recall is capable of detecting the most positive www.ijacsa.thesai.org instances [39]. A model with low recall is not informative as it classifies most positive instances as negative. (4) In general, Recall should be used alongside other metrics such as Precision and Accuracy to get a complete picture of a model's performance.

5) The number of parameters:
The number of parameters in deep learning refers to the number of variables that are used to define the structure of the neural network. These variables can be the weights and biases of the network, the size of the network, or the type of activation function used. It reflects the learning capacity of the model; A deep learning model with a large number of parameters has the capacity to learn more complex patterns than a model with fewer parameters [40]. The number of parameters also determines the amount of memory required to store the model; A model with a large number of parameters requires more memory than a model with fewer parameters [41]. The number of parameters in the models will vary from the original pre-trained models owing to the change in the fully connected layers (the convolution base was unmodified as it was frozen).

C. Evaluating the Results
In Table I and Figures 4-7, we present the obtained results on multiple metrics; the accuracy, loss, precision, and recall, along with the number of parameters for each model (this number differs from the original pre-trained models, due to our new classifier). To maximize effectiveness, we trained all of the models over one hundred epochs. These obtained results show that the ResNet50, VGG16, and VGG19 algorithms have higher accuracies, lower losses, and higher recalls and precisions, achieving a near-perfect score. Meanwhile, MobileNet and DenseNet came in second place with more than 97% in three metrics (accuracy, precision, and recall), but with a loss of around 5 to 6%; On the other hand, MobileNetV2 achieves close results, more than 96% in the three metrics and a loss of more than 9%. Then, Xception, which received more than 94% in the three metrics, and a high Loss averaging 14 to 15%. ResNet50V2 obtained mediocre results; even though its first version (ResNet50) got good results, around 84-85% in the three metrics, but with high losses up to 34% (higher errors are related to high loss, which means that the model does not do a good job). NASNetMobile and InceptionV3 performed similarly to ResNet50V2 in all metrics. On the other side, the mixed-model InceptionResNetV2 performed the worst in the accuracy, precision, and loss metrics, but reached the best score in the recall metric (100%). This shows that the model has a low false-negative rate (down to zero), but with a high falsepositive rate due to the low precision results. At the end of this discussion, the best models retained are ResNet50, VGG16, and VGG19. Then, we compared their number of parameters. They have the respective numbers: ResNet50 (24.6~ million), VGG16 (~14.9 million), and VGG19 (~20.2 million). if we are looking for a model with the best learning capacity, ResNet50 is the accurate candidate; On the other hand, if we are targeting a lightweight model to deploy on a limited resource and battery-connected devices such as a drone or an IoT thing [42], VGG16 is the suitable one among the three. It has60% fewer parameters in comparison with the ResNet50. With fewer performances, DenseNet is the best lightweight model (after VGG16) with only 7.5 million parameters. Also, if we prioritize model size, MobileNet will be the best choice with only 3,7m parameters and an accuracy close to 98%, MobileNetV2 is the lightest model in this case study with only 2,9m with a modest accuracy of more than 96% just a little behind it first version MobileNet.
For the other models, ResNet50V2 has about the same number of parameters as ResNet50, while Xception and InceptionV3 have respectively ~24.6m and ~21.9m, but produced modest results. Despite its high number of parameters (55.1m), InceptionResNetV2 is the poorest model in our case study, indicating that deeper networks or more neurons do not always produce the best results. NASNet Mobile is the lightweight model in our case study (~4.8m) it is an edge devices model however its performance is insufficient for our purposes. Fig. 8 shows predicted image samples that demonstrate that our system can almost perfectly distinguish between fire and normal forest state regardless of all the features and variety of objects (people. snow. different types of trees. etc.). Fig. 9 shows the incorrectly predicted images using the VGG16 and ResNet50 models (these images are collected from the Web and not seen by the model neither in training nor in validation) which can easily explain why the model wrongfully predicted the bad labels. The most likely explanation for the negative results is that they can really deceive the human eye; in the first image the sun and its radiation in the clouds and the lake can be easily misinterpreted as fire because we see the same No system is perfect, but these findings show that deep learning can be extremely accurate in detecting wildfires, with a success rate of more than 99.9%. Thanks to its ability to identify the unique signatures emitted by wildfires, this can provide an early warning of a wildfire, allowing fire crews to be dispatched to the scene before it spreads too far. This solution could be a valuable tool for fire departments and other emergency responders in identifying and responding to wildfires.      In the future, we will deal with the incorrect negative cases, in which the system can be confused between flames and the sun, fog and clouds, and smoke. Furthermore, we will try to www.ijacsa.thesai.org implement these models as the feature extractor backbone in other DL algorithms such as the R-CNN (Region-Based Convolutional Neural Networks) family [43], SSD (Single Shot Detector) [44], or applying YOLO (You Only Look Once) [45], [46] in order to detect not only fires but also its precise coordinates.

VI. CONCLUSION
Deep learning has revolutionized computer vision by enabling computers to learn from data to recognize patterns and classify objects with high accuracy. This has led to the development of powerful computer vision algorithms and applications that can detect and identify objects in photos and videos with a high degree of accuracy. Our proposed approach in this paper involves building a deep learning model specifically for detecting wildfires in forests using the transfer learning technique. Our discussion based on the obtained results has given us VGG16 and ResNet50 as relevant models for our issue; they are able to achieve higher scores in accuracy, recall and precision of more than 99.9% and a loss down to 0.19% for ResNet50 and down to 0.48% for VGG16. Fire departments and other emergency responders may benefit from these techniques to better identify and control wildfires before they spread too far. Through future works we will try to improve and develop these models by using object detection approaches such as the R-CNN family, SSD and YOLO to identify fires based on their precise location coordinates.