New Technique to Insure Data Integrity for Archival Files Storage (difcs)

—In this paper we are developing an algorithm to increase the security of using HMAC function (Key-Hashed Message Authentication) to insure data integrity for exchanging archival files. Hash function is a very strong tool used in information security. The algorithm we are developing is safe, quick and will allow the University of Tabuk (UT) authorities to be sure that data of archival document will not be changed or modified by unauthorized personnel through transferring in the network; it will also increase the efficiency of network in which archived files are exchanged. The basic issues of hash functions and data integrity will be presented as well. In this research: The developed algorithm is effective and easy to implement using HMAC algorithm to guarantee data integrity for archival scanned documents in the document management system.


INTRODUCTION
Information and data security in the different systems at UT are one of the most critical issues for the university authorities.Ensuring data in these systems are not modified in an unauthorized fashion is a fundamental goal.UT departments use different kinds of information systems: the academic system, the ERP system, document management system etc.All of these systems don't have any tool to guarantee the integrity of their data.
Data Integrity is one of the fundamental components of information security.Data integrity is a tool used to insure that data (documents, messages, emails, files, etc.) can't be changed, modified, deleted by unauthorized personnel, thereby insuring accuracy and consistency.
When a message is sent through the local network or Internet to a Receiver; data integrity tools are used to insure that the message was not altered and that it is identical to that sent from the Sender.There are many tools to insure data integrity, such as: parity bit, checksum, encryption and hash functions.Hash functions are one of the most used tools because of simplicity, speed and being free of charge.
Insuring Data Integrity is already an important tool used in data exchange in telecommunications and networking systems.For UT the use of DIFCS (Data Integrity File Checking System) algorithm will guarantee that data stored in all applications will be safe and reliable.This solution also will increase the safety of the university information systems, in a convenient and effective method.Additionally the DIFCS algorithm will increase the effectiveness of the whole files archive system.We depend in our improved DIFCS algorithm on using HMAC function to insure data integrity, authentication and we will add additional improved techniques to increase the effectiveness of the algorithm in the local network.

II. HASH FUNCTIONS
Hash function is a function h: MY that has, as a minimum, two properties:  it compresses a sequence mM of bits of arbitrary length, including the empty sequence, into a sequence h(m)Y of the constant (fixed) length,  for any mM it is easy to compute h(m).
The hash function transformation of the message m = m 1 ||m 2 ||…||m t divided into fixed length blocks m 1 ,m 2 ,…,m t can be described as follows (see Fig. 1):

H
The hash function, MD5 or SHA-1 B The number of bits in the block in the hash function IV The initial value for the hash function M The data input to HMAC Yi The i th block of m, 0≤i≤(l-1) L The number of blocks in m after padding N The length of hash code K The secret key, if K length is greater than b then K=h(K) K+ The K padded with zeros on the left so the result has b bits ipad The inner pad; the byte 36 (in hexadecimal) repeated b/8 times opad The outer pad; the byte 5c (in hexadecimal) repeated b/8 times h(m) The value of the HMAC; the length of the data is n bits, where the maximum value for n depends on the hash function used, MD5 or SHA-1 Y Set of all possible hash results www.ijacsa.thesai.org Where; IV is an initial value, H i is a chaining variable,  is a compression function (also called a round function) and ψ is an output transformation.As a result we obtain h(m) of fixed length.In cryptographic literature [2,5] the resulting sequence h(m) has been given a wide variety of names: hash result, hash code, hash total, imprint, fingerprint, message digest, cryptographic checksum, authenticator, authentication tag, compression, compressed encoding, condensation, Message Integrity Code (MIC), etc.In the sequel h(m) will be called hash result.
The structural model of the hash function is presented in Figure 1.[2].It works well if the length of m t is of the same length as each previous block m 1 ,m 2 ,…,m t-1 .If it is not a case then extra bits must be appended to an input string before hashing to make m t as long as m 1 ,m 2 ,…,m t-1 .

Input data m
Hash result h(m) It insures the accuracy and consistency of data stored or transmitted from one point to another.There are many methods for insuring data integrity: physical and logical.
Physical tools like RAID (Redundant Array of Independent Disks).And logical like parity bit, CRC, Checksum, Encryption and Hash functions.In our paper we will improve a logical tool that will use hash function to insure data integrity of archived documents and files.
We will focus on insuring data integrity by using hash functions.And we will explain some algorithms that use hash functions (by using SHA-256 hash algorithm) to insure data integrity and (something more like) authentication and confidentiality.

Algorithm1:
Process file m j by using a hash function SHA-256 h to calculate hash result h(m j ).Save file m j in the archive folder and save y j =h(m j ) in the secure folder of hash results.When you want to read m j from its original folder then hash m j by the same hash function h to calculate actual x j =h(m j ).If y j =x j then the file was not changed, if not then the file was changed.In this algorithm it is required to download the original file and hash result each time from the files storage and the hash storage, which are usually located on server decreasing the effectiveness of the whole reading process.Also there is no confidentiality for the files, or authentication for the source of the file where Man in the Middle attack can be a big threat.

Algorithm2:
Process file m j by using a hash function h to calculate hash result h(m j ) and encrypt it by using private key k d .Save file m j in the archive folder and save k d (y j )= k d (h(m j )) in the secure folder of hash results.When you want to read m j from its original folder then hash m j by the same hash function h to calculate actual x j =h(m j ). and decrypt k d (y j ) by using system public key k e to recover y j .If y j =x j then the file was not changed if not then the file was changed.

Algorithm 3:
Process file m j by using a hash function h to calculate hash result h(m j ) and encrypt it by using secret symmetric key k.Save file m j in the archive folder and save k(y j )= k(h(m j )) in the secure folder of hash results.When you want to read m j from its original folder then hash m j by the same hash function h to

Files storage
Hash storag e www.ijacsa.thesai.orgcalculate actual x j =h(m j ).Decrypt k(y j ) by using same symmetric key k to recover y j .If y j =x j then the file was not changed, if not then the file was changed.
In this algorithm it is required to download the original file and hash result each time from the files storage and the hash storage, which are usually located on server decreasing the effectiveness of the whole reading process.Also there is no confidentiality for the files.In the other hand, authentication of file source is insured.

Algorithm 4:
Pad secret p serial of bits to m j and then process file m j ||p by using a hash function h to calculate hash result h(m j ||p).Save file m j in the archive folder and save y j = h(m j ||p) in the secure folder of hash results.When you want to read m j from its original folder then pad secret p serial of bits to m j and hash m j ||p by the same hash function h to calculate actual x j =h(m j ||p).If y j =x j then the file was not changed, if not then the file was changed.In this algorithm it is required to download the original file and hash result each time from the files storage and the hash storage, which are usually located on server decreasing the effectiveness of the whole reading process.Also there is no confidentiality for the files but the authentication of file source is insured.Additional powerful cryptographic characteristic is fulfilled, where for m 1 = m 2 then h(m 1 )≠ h(m 2 ).
If we want to make the saved files secret we can apply an additional operation where we encrypt m j by using symmetric or asymmetric encryption algorithm.
In this paper we will use a special case of the fourth algorithm, where we will use HMAC (Key-Hashed Message Authentication code), which is used as an authentication cryptographic tool.

IV. HMAC
The main goals behind the HMAC construction [20] are:  To use available hash functions without modifications; in particular, hash functions that perform well in software, and for which the code is freely and widely available.
 Preserve the original performance of the hash function without incurring a significant degradation.
 Use and handle keys in a simple way.
 Gain a well-understood cryptographic analysis of the strength of the authentication mechanism based on reasonable assumptions on the underlying hash function, and to allow easy replacement ability of the underlying hash function if it will be faster or more secure.
HMAC requires a cryptographic hash function, which we denote by h, and a secret key K.We assume h to be a cryptographic hash function where data is hashed by iterating a basic compression function on l blocks of data.We denote by b the bit-length of such blocks (where l*b equal to the length of m in bits after padding), and by n the bit-length of hash outputs (n=128 bits for MD5, n=160 bits for SHA-1).The authentication key K can be of any length up to b, the block length of the hash function.Applications that use keys longer than b bits will first hash the key using h and then use the resultant n bit string as the actual key to HMAC.In any case the minimal recommended length for K is n bits (as the hash output length).
HMAC can be calculated as follows (Fig. 6):

1) Append zeros to the left end of K to create a b-bit string K+. 2) XOR (bitwise exclusive-OR) K+ with ipad to produce the b-bit block Si.
3) Append m to Si. 4) Apply h to the stream generated in step 3.

5) XOR K+ with opad to produce the b-bit block S0. 6) Append the hash result calculated in Step 4 to S0. 7) Apply h to the stream calculated in step 6 and output the result.
Because of using such different fixed values of ipad and opad and doing two times hashing function we avoid the situation where the XORing operation between K + and ipad or K + and opad to have zero's value.The key for HMAC [21] can be of any length (keys longer than b bits are first hashed using h).However, less than n bits is strongly discouraged as it would decrease the security strength of the function.Keys longer than n bits are acceptable but the extra length would not significantly increase the function's strength.A longer key may be advisable if the randomness of the key is considered weak.Keys need to be chosen randomly (or using a cryptographically strong pseudo-random generator seeded with a random seed), and periodically refreshed.Current attacks do not indicate a specific recommended frequency for key changes as these attacks are practically infeasible.However, periodic key refreshment is fundamental security practice that helps against potential weaknesses of the function as well as the keys, and therefore limits the damage of an exposed key.We will focus in our research on insuring data integrity by using HMAC [19].As HMAC is open to use any hash function with it.So in our paper we recommend to use at least SHA-256, which still secure against brute-force attack.In the future we recommend using even hash results with 1024 bits length.In any document management system, each department in the organization has to archive its uploaded files in a central archival warehouse.In the implementation of such solution we will face two important issues: the insuring of data integrity for archived files through transmission and the performance of the network where the transfer of these files is done from the server to the local computers.

Keys
Usually each department has an access to its own archived files only and not to the files of the whole archival warehouse.The improved algorithm we developed depends on this factor, that most of the retrieved files requested by the department's user are usually uploaded by the same department.In this paper we are implementing an efficient algorithm to insure data integrity and authentication for the archived files and at the same time to insuring better performance for the network.HMAC algorithm will be used to insure data integrity and authentication and a temporary local storage on local PC of most used archival files, which will increase the efficiency of the network.
In the proposed solution uploaded files will be saved in two storage devices: in the local PC of the uploaded user (LPC) and in the Central Archive Server (SAV).Additionally in SAV and LPC we will apply HMAC with a secret key.

Uploading process:
When the user uploads F on his LPC, this file is saved in the temporary matrix storage on LPC and it is also sent and saved in SAV server.This saving process is explained in fig.7, where each file F will have unique identifier f id identifying F in a unique way on LPC and SAV.LPC will calculate hash result h kid (F) for F by using HMAC algorithm and random secret unique key k id then it will encrypt h kid (F) and k id by using User public key LPC kd to insure authentication and then result is encrypted by SAV public key SAV ke to insure confidentiality.
Encrypted result c and F together are sent through network to SAV.
On SAV encrypted c is decrypted by using SAV private key SAV kd and then again decrypted by using LPC public key LPC ke to recover h kid (F) and k id .By using k id recovered from c SAV calculates hash result h' kid (F) for F by using HMAC with key k id .SAV compares recovered sent hash result h kid (F) of F with the calculated one h' kid (F), If they are equal then F and f id and k id are saved on SAV else SAV must sent a request to retransmit all again from LPC.

Downloading process:
We will have two situations, when file F with f id and k id exist on LPC, where only LPC will request for h kid (F)  When LPC requires a file F with f id identifier from SAV, the following steps will be done: If additional security is required like confidentiality then symmetric key algorithm is used to insure confidentiality to F. Public key algorithm will be used to exchange the secret key between SAV and user working on the LPC.
In our improved algorithm DIFCS we increased the cryptographic characteristics of the whole process of saving the file and its hash result on server and reading the files and their hash results from the same server.If we will compare the developed algorithm cryptographic characteristics with the other mentioned algorithms in this paper we can easily conclude the following: a.In DIFCS algorithm the original file is saved on the local machine so it is not required to download each time the original file from the files storage located usually on server, which increases the effectiveness of the whole archive file retrival process.
b. Authentication of file source is insured.In future work, we will develop the algorithm to make it a distributed algorithm: where archival files will be distributed and saved in different places according to a known mechanism.Such a development will increase the efficiency of the system.

Fig. 1 .
Fig. 1.General model of the hash function h III.DATA INTEGRITY Any information system is deemed secure if it has at least three properties: Confidentiality, Data Integrity and Availability.So data integrity is one of the most important aspects of security according to data.

Fig. 7 .
Fig. 7. Sending and saving process of file F to SAV

Fig. 7 .
Sending and saving process of file F to SAV www.ijacsa.thesai.org from SAV, fig 8. (a).And second one when you have only f id , Where we need file F with f id and k id and h kid (F), fig.8. (b).
kid(F) = hkid(F)  Retrieve ELSE If h'kid(F) ≠ hkid(F)  Resend m h'kid(F) = hkid(F)  Retrieve ELSE If h'kid(F) ≠ hkid(F)  Resend www.ijacsa.thesai.orgc.Additional powerful cryptographic characteristic is fulfilled, where for if we have two messages m1 and m2, where m1= m2 then h(m1)≠ h(m2).d.Confidentiality for the files or hash results can be implemented according to the user requirementsVI.CONCLUSIONSIn this research we developed a new algorithm called DIFCS, which uses HMAC function to insure data integrity and authentication for archival file systems.DIFCS also uses a new technique for retrieving and checking if the archive files are authentic.The main function of DIFCS is to increase the efficiency of the files archival system and the local network.Such an algorithm insures data integrity for archived files and makes them immune against unauthorized manipulation and Man in the Middle attack.It also insures authentication between LPC and SAV.