Arabic Cursive Characters Distributed Recognition Using the Dtw Algorithm on Boinc: Performance Analysis

— Volunteer computing or volunteer grid computing constitute a very promising infrastructure which provides enough computing and storage powers without any prior cost or investment. Indeed, such infrastructures are the result of the federation of several, geographically dispersed, computers or/and LAN computers over the Internet. Berkeley Open Infrastructure for Network Computing (BOINC) is considered the most well-known volunteer computing infrastructure. In this paper, we are interested, rather, by the distribution of the Arabic OCR (Optical Character Recognition) based on the DTW (Dynamic Time Warping) algorithm on the BOINC, in order, to prove again that volunteer computing provides very interesting and promising infrastructures to speed up, at will, several greedy algorithms or applications, especially, the Arabic OCR based on the DTW algorithm. What makes very attractive the Arabic OCR based on the DTW algorithm is the following, first, its ability to recognize, properly, words or sub words, without any prior segmentation, from within a reference library of isolated characters. Second, its good immunity against a wide range of noises. Obtained first results confirm, indeed, that the Berkeley Open Infrastructure for Network Computing constitutes an interesting and promising framework to speed up the Arabic OCR based on the DTW algorithm.


INTRODUCTION
Arabic OCR based on the DTW algorithm provides very interesting recognition and segmentation rates.One of the advantages of the DTW algorithm is its ability to recognize, properly, words or connected characters without their prior segmentation.In our previous studies achieved on high and medium quality documents [2], [3], [5] we obtained an average of more than 98% as recognition rate and more than 99% as segmentation rate.The purpose of the DTW algorithm is to perform the optimal time alignment between a reference pattern and an unknown pattern in order to ease the evaluation of their similarity.Unfortunately, the drawback of the DTW is its complex computing [6], [7], [8].Consequently, several solutions and approaches have been proposed to speed up the DTW algorithm, [1], [7], [8], [4], [9], [5].
In this paper we show and confirm, through an experimental study, how volunteer computing, which present the advantage to be costless, can speed up, substantially also, the execution time of the Arabic OCR using the DTW algorithm.More specifically, we show how BOINC can achieve such a mission.The reminder of this paper is organized as follows; section (2) describes the Arabic OCR using the DTW algorithm.Section (3), gives an overview on volunteer computing especially BOINC (Berkeley Open Infrastructure for Network Computing).The proposed approach and the corresponding performance evaluation are detailed in Section (4).Conclusion remarks and some future investigation are presented in section (5).

II. MECHANISM OF THE ARABIC OCR BASED ON THE DTW ALGORITHM
Words in Arabic are inherently written in blocks of connected characters.We need a prior segmentation of these blocks into separated characters.Indeed many researchers have considered the segmentation of Arabic words into isolated characters before performing the recognition phase.The viability of the use of DTW technique, however, is its ability and efficiency to perform the recognition without prior segmentation [4,2].
We consider in this paper a reference library of R trained characters forming the Arabic alphabet in some given fonts, and denoted by : r=1, 2,…, R. The technique consists to use the DTW pattern method to match an input character against the reference library.The input character is thus recognized as the reference character that provides the best time alignment, namely character A is recognized to be if the summation distance corresponding to the matching of A to reference character satisfies the following equation [4,5].

{ }
Let T constitutes a given connected sequence of Arabic characters to be recognized.T is then composed of a sequence of N feature vectors that are actually representing the concatenation of some sub sequences of feature vectors representing each an unknown character to be recognized.As portrayed in Fig. 1 text T lies on the time axis (the X-axis) in such a manner that feature vector is at time i on this axis.http://ijacsa.thesai.org/ The reference library is portrayed on the Y-axis, where reference character is of length , 1≤ r ≤ R. Let S (i, j, r) represent the cumulative distance at point (i, j) relative to reference character .The objective here is to detect simultaneously and dynamically the number of characters composing T and recognizing these characters.There surely exists a number k and indices ( , , ..., ) such that  … represent the optimal alignment to text T where  denotes the concatenation operation.The path warping from point (1, 1, ) to point (N, ,k) and representing the optimal alignment is therefore of minimum cumulative distance that is: This path, however, is not continuous since it spans many different characters in the distance matrix.We therefore must allow at any time the transition from the end of one reference character to the beginning of another reference character.The end of reference character is first reached whenever the warping function reaches point (i, , r), i =⌈ ⌉,...,N.As we can see from Fig. 1, the end of reference characters , , are first reached at time 3, 4, 3 respectively.The end points of reference characters are shown in Fig. 1 inside diamonds and points at which transitions occur are within a circle.The warping function always reaches the ends of the reference characters.At each time i, we allow the start of the warping function at the beginning of each reference character along with addition of the smallest cumulative distance of the end points found at time ( i -1) [4,5].The resulting functional equations are:

With the boundary conditions:
[ ] To trace back the warping function and the optimal alignment path, we have to memorize the transition times among reference characters.This can easily be accomplished by the following procedure: Where trace min is a function that returns the element corresponding to the term that minimizes the functional equations.The functioning of this algorithm is portrayed on Fig. 1 by means of the two vectors and , where represents the reference character giving the least cumulative distance at time i, and provides the link to the start of this reference character in the text T .The heavy marked path through the distance matrix represents the optimal alignment of text T to the reference library.We observe that the text is recognized as C1  C3 [5].

III. VOLUNTEER COMPUTING AND BOINC
Volunteer computing is a form of distributed computing in which the general public volunteer make available and storage resources to scientific research projects.Early volunteer computing projects include the Great Internet Mersenne Prime Search [12], SETI@home [10], Distributed.net[11] and Folding@home [13].Today the approach is being used in many areas, including high energy physics, molecular biology, medicine, astrophysics, and climate dynamics.This type of computing can provide great power (SETI@home, for example, has accumulated 2.5 million years of CPU time in 7 years of operation).However, it requires attracting and retaining volunteers, which places many demands both on projects and on the underlying technology.
The Berkeley Open Infrastructure for Network Computing (BOINC) is a framework to deploy distributed computing platforms based on volunteer computing.It is developed at U.C. Berkeley Spaces Sciences Laboratory and it is released under an open source license.
BOINC is the evolution of the original SETI@home project, which started in 1999 and attracted millions of participants worldwide [16].
This middleware provides a generic framework for implementing distributed computation applications within a heterogeneous environment.The system is designed as a http://ijacsa.thesai.org/software platform utilizing computing resources from volunteer computers [14].
BOINC software is divided in two main components: the server and the client side of the software.BOINC allows sharing computing resources among different autonomous projects.A BOINC project is identified by a unique URL which is used by BOINC clients to register with it.Every BOINC project must run a host with the server side of the BOINC software.
BOINC software on the server side comprises several components: one or more scheduling servers that communicate with BOINC clients, one or more data servers that distribute input files and collect output files, a web interface for participants and project administrators and a relational database that stores information about work, results, and participants.
BOINC software provides powerful tools to manage the applications run by a project.For instance it allows to easily define different application versions for different target architectures.A "workunit" describes a computation to be performed, associating a (unique) name with an application and the corresponding input files.Not all kind of applications are suitable to be deployed on BOINC.Ideally, candidate applications must present "independent parallelism" (divisible into parallel parts with few or no data dependencies) and a low data/compute ratio since output files will be sent through a typically slow commercial Internet connection [16].
In this paper, we show how volunteer computing such as BOINC can speed up the execution time of Arabic OCR based on the DTW algorithm.

IV. THE DTW DATA DISTRIBUTION OVER BOINC
The Arabic OCR based on the DTW procedure described in the preceding section presents many ways on which one could base its parallelization or distribution.The idea of the proposed approach is how to take advantages of the enough power provided by BOINC to speed up the execution time of the DTW algorithm?
A BOINC project uses a set of servers to create, distribute, record, and aggregate the results of a set of tasks that the project needs to perform to accomplish its goal.The tasks are evaluating data sets, called workunits.The servers distribute the tasks and corresponding workunits to clients (software that runs on computers that people permit to participate in the project).When a computer running a client would otherwise be idle (in the context of volunteer computing, a computer is deemed to be idle if the computer's screensaver is running), it spends the time working on the tasks that a server assigns to the client.When the client has finished a task, it returns the result obtained by completing the task to the server.If the user of a computer that is running a client begins to use the computer again, the client is interrupted and the task it is processing is paused while the computer executes programs for the user.When the computer becomes idle again, the client continues processing the task it was working on when the client was interrupted.
To be added into a BOINC project, applications must incorporate some interaction with the BOINC client: they must notify the client about start and finish, and they must allow for renaming of any associated data files, so that the client can relocate them in the appropriate part of the guest operating system and avoid conflicts with workunits from other projects [16].
We propose to split optimally the binary image of a given Arabic text to be recognized into a set of binary sub images and then assign them among some volunteer computers which are already subscribed to our project.BOINC uses a simple but a rich set of abstraction files, applications, and data.A project defines application versions for various platforms (Windows, Linux/x86, Mac OS/X, etc.).
An application can consist of an arbitrary set of files.A workunit represents the inputs to a computation: the application (but not a particular version) presents a set of references input files, and sets of command line arguments and environment variables.Each workunit has parameters such as computing, memory and storage requirements and a soft deadline for completion.A result represents the result of a computation, it consists of a reference to a workunit and a list of references to output files.
Files can be replicated, the description of a file includes a list of URLs from which it may be downloaded or uploaded.When the BOINC client communicates with a scheduling server it reports completed work, and receives an XML document describing a collection of the above entities.The client then downloads and uploads files and runs applications; it maximizes concurrency, using multiple CPUs when possible and overlapping communication and computation.

A. Experimental Study
Significant reduction in the elapsed time, defined as the time elapsing from the start until the completion of the text recognition, can be realized by using a distributed architecture.This effect is known as the speedup factor.This factor is properly defined as the ratio of the elapsed time using sequential mode with just one processor to the elapsed time using the distributed architecture.
Next we consider only the case where the volunteer computers participating in the work are homogeneous.It means that all the corresponding interconnected computers are homogeneous in terms of computing power, hardware configuration and operating system.We ran several experiments on several specific printed Arabic texts.
Our experiments aim at proving that volunteer computing present, indeed, interesting infrastructures to speed up the execution process of the Arabic OCR based on the DTW algorithm.
During these experiments, we have considered the following conditions: To reach our expectation, we have studied the effect of the distribution of approximately a hundred (100) of similar printed Arabic text pages over a variable number of volunteer computers. However, the speedup factor increases with the number of computers used.
 If we use 16 computers then the execution time reaches the value 1450 seconds and the speedup factor reaches the value 15.This result is very interesting, because in this case our proposed OCR system is able to recognize more than 830 characters per second, compared with the existing commercial Arabic OCR [2].
Consequently, volunteer computing constitute interesting infrastructures to speedup, drastically, the execution time of the Arabic OCR based on the DTW algorithm.Moreover, and thanks to the enough computing power provided by such infrastructures, we can think, now, about the improvement of the recognition rate of our system by adding to it some complementary approaches or techniques.

V. CONCLUSION AND PERSPECTIVE
This paper has shown how volunteer computing present interesting infrastructures to speed up, substantially, the execution time of the Arabic OCR based on the DTW algorithm.Indeed, conducted experiments confirm that such infrastructures can help a lot in building a powerful Arabic OCR based on the combination (integration) of some strong complementary approaches or techniques which require enough computing power.
Several investigations are under studies especially the way to exploit in a large scale the BOINC and the way to improve the recognition rate of the Arabic OCR based on the DTW algorithm in order to build a powerful Arabic OCR system.
The number of pages is 100,  The number of lines per page is 7,  The average number of characters per line is 55, http://ijacsa.thesai.org/ The number of characters per page is 369,  The reference library contains 103 characters,  We have used 16 dedicated homogeneous workers having the exact configuration: 3GHZ CPU frequency, 512 Mega Octets RAM and running Windows XPprofessional.

Fig. 2
Fig. 2 and 3 illustrate the obtained results of our experiment.These figures show in particular that: