Ransomware: Analysis of Encrypted Files

—Ransomware is a type of malware that damage the system by encrypting all the files existing in the computer. To get access, the victim has to pay a ransom to get a key to decrypt his data. When the virus is running in machine, the user cannot stop it on the first try, so he may lose his entire files. One of the goals of this work is to detect ransomware based on encrypted files in real time and to minimize the cost of losing files. We will try to do an analysis of a received file (without opening it and seeing its contents). This scanning action can prevent a ransomware from spreading in the system. Most Ransomware files are sent in “.exe” format, but in this work, we will try to use other file formats that can accept malware, for example, .doc or .docx, .xls or .xlsx, .ppt or .pptx, .jpg, etc. In fact, an attacker can focus only on the files that contain useful data. In this paper, we are going to identify the types of files if they are suspicious or normal (without opening them) from their headers. For that first, we are going to analyze each extension separately (.docx, .exe, .pptx, .xlsx, .jpg, etc.) by identifying their headers and signatures. Then we will take several files with different extensions to analyze them by doing a program who detect if a file is benign or suspicious.


INTRODUCTION
In recent years, ransomware attacks continue to explode exponentially around the world; the cost keeps falling and exploit different sectors.
Researchers and cybersecurity specialists are still looking for a solution to detect this attack and even to slow down its growth in order to find an effective and reliable solution. We see many solutions, but not 100% sure, because hackers are always attentive and updated with the new technologies, they use more sophisticated techniques to follow the evolution and bypassing the protection techniques.
This study focuses on the examination of the behavior and method in which ransomware encrypts files. Ransomware can infiltrate a device in various formats like .exe, .docx, .ppt, etc.
A user may open a .docx file without realizing it is an unsafe file that contains metadata that can damage their computer. Therefore, we aim to analyze the files (without opening them) before and after ransomware encryption, in order to distinguish between a typical file and a suspicious one.
In this paper, we will make a study on files to differentiate between a normal file and a suspicious one. For that in Section II, we will approach some "state of the art" concerning the study of files to give you an idea of the current research on this subject. In Section III, we will see our objectives and working methodology to identify and detect a normal file from another suspect one. We will discuss the results that we have had in Section IV. At the end, we sum up with a conclusion and some perspectives.

II. STATE OF THE ART
As you know, attackers are very inventive when they want to target a victim and we find, often, that emails are the trickiest (more than 90%) way [1] for them to create a link between the attacker and the target. Fig. 1 explains how ransomware attacks your machine: Ransomware detection techniques [2]- [5] are becoming more and more competitive, and each researcher has his own method and technique. If we take the detection of ransomware or malware in general, using file headers, several researchers work focus on a single file extension like PE (Portable Executable) files [6]- [8], but there is not enough research on the detection of ransomware using the headers of different extensions.
The authors in [9] proposed a new classification model based on machine learning techniques to detect and classify malicious and benign PE files based on their headers information. The experimental results proved that the Random Forest algorithm yields a higher accuracy (99.68%) compared to other algorithms. The tests were performed on 211,067 malware samples obtained from the VirusShare database [10]. Manavi and Hamzeh [11] presented a method for detecting ransomware using the PE header. They used a Convolutional neural network (CNN) to identify ransomware by converting the header bytes into 32*32 pixel images. The use of a header is advantageous, but transforming it into an image would necessitate the use of a network with additional layers in order to extract its features.
To detect ransomware, the authors [12] used a static method. They proposed a method that is based on the bytes extracted from the header of the executable file using LSTM network to build the detection model.  214 | P a g e www.ijacsa.thesai.org The modification of the file header changes its structure. Therefore, they did the extraction of the executable file headers, then they processed the byte sequence that builds the file header with LSTM network, and they separated the ransomware samples, from the benign samples to form the template. With this technique, they managed to detect ransomware with 93.25% accuracy without running the program.
Subedi et al. [13] employed data mining techniques to recognize and detect ransomware families using both static and dynamic analysis at three different levels: assembly, function calls and library. They also created an analytical tool that uses reverse engineering to create signatures for identifying ransomware families. Arabo et al. [14] proposed a dynamic analysis approach to gather ransomware API properties, which is then utilized to test 9 Machine Learning classifiers and a neural network. The goal of this research is to understand the link between a process's behavior and its nature, to detect if it is a ransomware or not. With a detection rate of 75.01%, Random Forest surpasses other classifiers. The benefit of this technique is that it does not require a signature database, but rather a collection of ransomware and non-ransomware data. The detection rate of the classifiers may be better by improving the dataset.
Before encrypted files were moved to a backup disk, Lee et al. [15] utilized machine learning techniques to detect and classify infected files. The training step was implemented at the backup system according to their recommendation. It identified files from various users and file types, as well as determining file entropy thresholds. These thresholds were transmitted to client hosts in order to decide whether a new version of the file was encrypted or not. The authors in [16] suggest a two-stage mixed ransomware detection approach using Markov model with the Random Forest technique to detect ransomware. Random Forest has the best detection rate of 97.3%.
The paper [17] emphasizes the capabilities of behaviorbased detection mechanisms to identify crypto ransomware, demonstrating the limitations of signature-based detection approaches. In [18], Nieuwenhuizen proposed a ransomware detection scheme using behavior analysis and machine learning. Although the specific features were not revealed, their created feature set included properties such as payload persistence, anti-system restoration, stealth methods, environment mapping, network traffic, and privilege elevation that were extracted from the behavior of a malicious set up. Author employed the support vector machine (SVM) method as the classification technique in addition to the behavioral features related with data transformation behavior, such as huge file encryption.
The effect of certain ransomware families on the Windows platform is demonstrated and analyzed by Mohammad [19]. He deduces that most families of ransomware behave in a similar way when it comes to affect file system and registry entities. Furthermore, all types of ransomware generate files in the Windows system files and rename other files. To do the experiments, the author used Windows 7, Oracle VirtualBox VM, Cuckoo sandbox, and Virtual windows 10. The author concludes that monitoring system file and registry activities can protect against ransomware. He also mentions that Windows 10 is more effective than Windows 7 regarding malware. The best method to follow as a recommendation is to regularly back up company or individual data.

III. METHODOLOGY
As mentioned at the beginning, our goal is to detect whether a file is suspicious or not (regardless of its content), from its header which will be identified from its extension. This leads us to detect ransomware from encrypted files in real time.
It is well known that each extension has a fixed header according to the standards. If the header differs from the standard state, we deduce that it is suspicious.
To achieve our goal, we took several files with different extensions, if we take an extension, for example ".docx", and we open some files with the same extension using the Hexadecimal editor (Hex Editor Neo [20]), we found that they have the same signature, also called "Magic number". According to a deep study on Microsoft Office files, we notice that their signature is different, the "x" added at the end made many differences. If we take the extensions .doc and .docx (the same thing for .xls, .ppt/.xlsx, .pptx), the differences are seen in Table II. A DOCX file is actually a zip file with all XML files associated with the document.

File size
The DOC format has a greater size than the DOCX format.
The DOCX format has a smaller size than the DOC format. www.ijacsa.thesai.org From the article [21], the header is always at the beginning of the file and is exactly 512 bytes in length. Fig. 2 and Fig. 3 show you some information about the header of the ".docx" and ".doc" file, respectively.

IV. RESULTS AND DISCUSSION
We took a corpus [22] that contains a large number of files with different extensions, and I encrypted them with a python program, adding to its files an '.enc' extension to make the difference between a clear file and an encrypted file. As an example, I took four files for each different extension (.doc, .docx, .ppt, .pdf); we got the following result: For the files " *.doc " (in Fig. 6), those on the left are clear files, their header should be normal [D0 CF 11 E0 A1 B1 1A E1]. While on the right, you see that there is an extension added at the end " *.doc.enc ", this means that they are encrypted files (the encrypted file of each clear file, e.g. "1.doc.enc" is the encrypted file of "1.doc"), and even their header is different. What is relevant is that each encrypted file has a different header from the other file, we have [2F 2E 02 89 56 38 DD AA], [A7 C5 DE D5 24 9D FA E2], etc.
The same thing for the files " *.docx " (in Fig. 7   As you can see in Fig. 6, Fig. 7 and Fig. 8, the signature of a clear file and its encrypted is not the same; the case is the same for the other extensions. We also notice that the signature is fixed for any clear file (with different extension), but for encrypted files, it is not fixed and differs between each file. The program is done by Python language to make this study and detect if a file is suspicious or normal from its signature.
Each extension has a "Magic Byte". We instantiated our dataset by creating a dictionary with the file extension as a key and its "Magic byte" as value, and then we analyze the file. If the file does not contain the corresponding signature, i.e. it has a different header than the one presented in our dataset; we deduce that it is a suspect file. We have also dealt with the case of a file without extension, if we give it to our program, it analyzes the header and if it does not find the corresponding signature, it sends us back that it's a suspicious file, otherwise, if everything is normal the result is: "This is a benign file, its extension is: … ". Fig. 9 shows the result of a file without extension that is benign.   If we take the example in Fig. 11, you can see that the result is "this is a suspicious file", even though the file has the extension ".doc". In effect, sometimes attackers send files that look normal with a legal extension, while the file is infected by the ransomware, so as you can see, our program perfectly analyzes the header of the given file identifying its signature, and it found that its signature does not match to the normal signatures. 217 | P a g e www.ijacsa.thesai.org We have tested our program on files encrypted by Ransomware with ".lmas", as you can see in Fig. 12, we have taken as an example the file "formation.xlsx.lmas", the ".lmas" extension is added after ".xlsx" extension. We got the followed result: Fig. 12. Analysis of a file encrypted by Ransomware.
As you can see, Ransomware infects the file «formation.xlsx», it is encrypted and the attacker has added the extension ".lmas" to the file.
We know that "xlsx" has a fixed signature (see Table I); in the Fig. 12, we can see that the first 4 bytes are similar to the first 4 bytes of the normal xlsx file (50 4B 03 04), but the difference is in the next 4 bytes. Therefore, our program was able to detect that this file is encrypted by ransomware so it is a suspicious file without opening it.

V. CONCLUSION
In this work, we have made a Python program that allows to detect a suspicious file from another normal one, we started by studying the header of files of different extensions separately, later we extracted the header of each file and compared the headers of a normal file with another encrypted one. With this study, we could deduce that a normal file has a fixed and unchangeable extension, once it is changed the file is suspicious.
In the upcoming work, we will conduct a dynamic analysis by executing ransomware files in a simulated environment. This will allow us to extract ransomware encrypted files and analyze them in order to develop and implement our own neural network. This network will be trained to identify ransomware files by first learning the characteristics extracted from the ransomware encrypted files, and then using that knowledge to detect ransomware when a "vulnerable" file is downloaded onto a victim's device.