Recovering UML2 Sequence Diagrams from Execution Traces

Reverse engineering is a proven and efficient technique for automatically generating UML2 models from object-oriented legacy systems with missing or obsolete documentation. To perform reverse engineering, two techniques are used: dynamic and static analysis. Dynamic analysis refers to collecting information when the system is running while static analysis corresponds to inspecting the source code. Dynamic analysis is preferred than static one in order to extract dynamic models that represents the behavior of a systems because of polymorphism and dynamic binding. In this paper, we present new different methodology that use Colored Petri Nets (CPNs) to recover UML2 Sequence Diagram (SD). First, it generates execution traces corresponding to the different scenarios representing the system behavior. Then, CPNs are used to model and analyze these execution traces to extract UML2 sequence diagram. Our case study illustrates the process of our approach and show that sequence diagram can be extracted with a good accuracy. Keywords—Execution traces; Reverse engineering; UML2; Sequence Diagram; Colored Petri Nets


I. INTRODUCTION
Today object-oriented systems, are becoming increasingly larger and more complex. This increases the cost of their development and maintenance. According to [1], the cost of software maintenance represents 50% to 75% of the total cost. Despite the progress made in software engineering and development methods, several legacy systems still suffer from many problems such as unavailability of developers, obsolete development methods used to code the software, outdated documentation and non-compliance with the design when coding the software. In the software lifecycle, understanding its architecture and behavior is the main task in the maintenance phases. It is a tedious and time-consuming task that requires the mobilization of a large number of human resources. As mentioned in [2], up to 60% of maintenance time is spent on understanding the software. Therefore, it is important to develop techniques to obtain an abstract representation that facilitate the understanding of these systems.
A proven and effective technique to face this problem is reverse engineering of UML2 models. It can be defined as a process of analyzing the source code of systems and representing it in models with a higher level of abstraction. Reverse engineering is mostly used to extract high level abstraction models or semantics from the source code [3]. Reverse engineering is used to help understanding existing systems. The IEEE-1219 [4] standard considers reverse engineering as a technological solution to deal with legacy system. For the object-oriented software, the most used modeling language is UML (Unified Modeling Language) [5]. Dynamic models are as important as static models because they allow to understand the behavior of the system. One of the major UML dynamic model is SD. Indeed, it allows to represent complex interactions between its objects [6]. As described in [7], dynamic analysis allows to remove the ambiguity of message sending when inheritance, delegation, polymorphism, dynamic links, reflection are used intensively. For this, we will give more importance to this type of analysis.
This paper draws on our previous work [8,9] to propose a new, more coherent and precise approach for reverse engineering the UML2 SD. This new approach allows to extract the conditions for combined fragment operator alt, opt and loop. For this purpose, improvements in the generation of execution traces and modeling with CPNs have been made. Indeed, the CPNs used have a smaller size and are more coherent. However, this approach does not currently apply to multi-threaded systems.
The remainder of this paper is organized as follows. Section II includes related works. Section III introduces a background in reverse engineering of UML2 SDs using CPNs. Section IV outlines the proposed approach. Section V presents a case study. Finally, Section VI concludes and points out some of our future works.
II. RELATED WORK Reverse engineering is defined as "the process of identifying and analysis of software's system components, their interrelationships, and the representation of their entities at a higher level of abstraction" [10]. Reverse engineering aims to discover the technological principles of a system through the analysis of its structure and behavior.
In the literature, depending on the type of analysis used, there are two main categories in existing approaches: static and dynamic. Static analysis consists in performing the analysis of the source code or the binaries to generate UML dynamic diagrams. This is done without running the system. There are several approaches that perform reverse engineering through static analysis [11, 12, 13, and 14]. One of the main works based on static analysis is [14]. In this work, the authors present an algorithm that builds UML 2.0 sequence diagrams from Java code, using control flow graphs. They create an algorithm that transform this graph on SD. This is done on three phases. First, the algorithm associates subgraphs with sequence diagram fragments operator alt, opt, break, and loop. After that, a series of transformations are applied to the obtained SD to in order to make it more understandable.
Dynamic analysis consists of running the program to obtain the necessary information in the form of an execution trace for the creation of sequence diagrams. These traces represent the values of the program variables, the state of the execution stack, the occurrences of objects created, the signatures of the methods called, the information about threads or any other execution information considered useful. As a result, objects under execution can be observed. There are several works that use dynamic analysis. In [15], an approach to extract SD from dynamic information of object-oriented programs is presented. In order to reduce repetitions in the execution trace, four rules are used to optimize the size of the execution traces by detecting similarity between sub-trees and replace merging them. The author in [16] propose an approach that allows extracting UML High Level Sequence Diagrams (HLSD) from java code by constructing control flow graphs. They proposed a method for switching between the general control flow graph (FCG) and UML sequence diagrams. The combination process is done by analyzing the different states of the system. In [17], it is proposed an approach based on dynamic analysis using Labeled Transition System (LTS). These LTSs are used for modeling execution traces in order to facilitate there analyzing. For each trace a corresponding LTS is generated. After that come the step to merge these LTSs in order to have one LTS modeling the behavior of the system. Finally, an HLSD is generated from the obtained LTS using regular expressions. The approaches listed before were able to extract SDs that represent the system behavior. However, the diagrams obtained are incomplete and suffer from several problems. These problems include information filtering problems. As listed in the catalog of abstractions and filtering in the context of reverse engineering of sequence diagrams [18]. In addition, these approaches fail to extract the conditions in the combined fragment operators like loop, opt and alt.

III. BACKGROUND OF THE APPROACH
In this section, we first explain what a sequence diagram in UML 2.x is. Second, we give some definitions regarding execution traces and how they are obtained. Finally, we introduce CPN and how we used it to represent an HLSD.

A. UML2 Sequence Diagrams
The SD is a form of behavioral diagram that allows to specify in a chronological way the interactions that exist between a group of objects from the temporal point of view. It has been significantly changed in UML 2.0 [5]. Indeed, the sequence diagram in UML2 is considered as partially ordered collections of events, which introduces new concepts such as combined fragment, parallelism and a synchronism and allows the definition of more complex behaviors.
An HLSD is obtained by combining Basic Sequence Diagram (BSD) using interaction operators. The most commonly used combined fragment operators in the UML2 sequence diagram are seq to express sequence, opt for optional, alt for alternatives and loop for iterative actions.

B. Execution Traces
Dynamic analysis, starts by generating traces. These traces are then analyzed to extract a HLSD. In our approach, for each scenario a trace execution is generated. In what follows, we introduce a set of definitions that are necessary to understand the approach.
Def. 1: A trace line is a method invocation or a control structure.
Def. 2: A method invocation is a triplet T1=<Sender, Message, Receiver> where: Sender is the caller object, expressed in the form package:class:object.
Message is the invoked method of the receiver object, expressed in the form methodName (par1, par2, …).
Receiver is the called object, expressed in the form package:class:object. • The objects s1 and s2 (respectively, r1 and r2) are equivalent if they are instances of the same class.
• The messages m1 and m2 concern the same method and have the same arguments.
Def. 5: An execution trace is a set of trace lines.
An example of execution traces in the format described before are shown in Table 1. For each Scenario correspond a trace (ex: Trace1 refers to Scenario1). The trace Trace1 is composed of lines from L0 to L5. Lines L6 to L10 belongs to Trace2. Pack1 represent the packages to which classes A and B belong. m1() to m5() refers to the methods invocation (messages) of objects a, b, c, and d.

C. CPN
Petri nets is a formal modeling language used to represent the dynamic behavior of different systems (computer, industrial, telecommunications ...) [19]. It was first introduced in 1962 by the German mathematician and computer scientist Carl Adam. CPN is an extension of Petri nets. CPN is an extension of petri nets. This extension considerably reduces the size of the network when extending the modeling with Net Petri. It allows the distinction between places by attaching a color to them. A Petri Net block is a subnet of the Petri Net with one initial place and one final place. Those places refer respectively to the precondition and the post-condition of the subnet. In [20], CPN is used to integrate scenarios represented in the form of SDs. They use four combined fragment operators (conditional, sequential, iterative and concurrent) to combine scenarios. CPNs are suitable for our approach as they can be transformed easily into an HLSD (see Fig. 1).
Transitions represent BSD or the operator such as seq, opt or control type as defined in Def 3. Places can represent a state of the system or the beginning or the end of the operator alt and loop. Colors are used to distinguish between traces. Fig. 1 shows how a CPN can be transformed easily into a HLSD and vice-versa. Pi and Pf represent respectively the initial and the final place of the Petri Net block. The first transition a | m1() |b corresponds to the first BSD. In this BSD the object a of class A call to the object b of the class B with the message m1(). The place LOOP | BEGIN represents the state before the start of the operator loop that leads to two transitions. The transition IF | BEGIN| C1 allows entering in the loop statement. This is done when the condition C1 of loop is equal to the value true. After that, comes the place ALT | BEGIN which represents the state that also leads to two transitions. The first transition labeled IF | BEGIN| C2 refers to the case when the condition of alt is satisfied and leads to the transition b | m2() | a. This transition describes that the message m2() is sent by the object a of the class A to the object b of class B. The transition ELSE | BEGIN| C2 represents the second transition of alt and consequently occurs when its condition is not verified. The transition a| m3() | b refers to the BSD which describes that the object a calls the object b using the message m3(). The second transition of loop is ELSE | BEGIN refers to its end and thus occurs when its condition is not verified.
In this section, we have presented and explained the most important concepts about SD, trace and CPN. This is the background on which our approach is based.

IV. PROPOSED APPROACH
Our objective is to extract SD from execution traces for an object oriented system using CPNs.
In this section, we present our approach for reverse engineering of the HLSD. As illustrated in Fig. 2, the approach is divided in four main steps. First, the step trace collection. Second, the trace filtering step. Third the step of trace merging. Finally, the step of HLSD extraction. In the next subsections, each step is described in details.

A. Traces Collection
In order to generate an accurate HLSD, our approach use dynamic analysis technic. In [6], it's described that dynamic analysis is more efficient than static one in the context of the reverse engineering of UML dynamic models such as SDs. This analysis is based on analyzing traces execution. These traces can be generated using several technics [2]. These technics includes the instrumentation of the source code, bytes code or the use of a customized debugger. In our approach we use the byte code instrumentation.
We choose to use AspectJ [21] as trace collection tools. This tool concerns java software systems. It allows to report all information created during the execution of the program. This includes methods invocations, occurrence of objects, sending and receiving messages between objects, loops and conditions. The behavior of the system is highly dependent on the input data entered by the user. Therefore, it is necessary to identify the majority of input variable values in order to specify all system behavior. This can be done with the help of a system functional expert. This can be done with the help of a system functional expert. After running the system with different input data values, execution traces are generated, each corresponding to a given scenario. Since execution trace formats differ from one tool to another, we have developed a tool that will allow to standardize the format of execution traces. This format is defined in the definitions 1, 2, and 3. The objective of this tool is to format the trace execution to facilitate the processing of merging traces.

B. Trace Filtering
The generated execution traces contain a lot of information about all classes composing the system. For example, these classes can be divided into three types: data access classes, business classes and presentation classes. The business classes are the classes that describe the behavior of the business logic of the system. Our objective in this step is to concentrate on traces lines that describe this behavior and ignore other traces lines. This is the objective of the trace filtering step. We have developed an algorithm that allows us to delete execution traces which belong to data access or presentation classes.

C. Traces Merging
As mentioned before, one execution of the system doesn't allow an accurate description of all the system behavior. Therefore, the system must be run several times and thus generate different traces. The challenge is to be able to merge these different traces to identify the behavior of the system as a whole. In [22], several well-defined merging techniques were listed.
For merging execution traces, we choose to use CPNs. The process is done in two successive sub-steps: first CPN initialization, then CPN merging.

1) CPN Initialization:
In this sub-step, a basic CPN is generated for each execution trace. All the trace lines are transformed into transitions in CPNs except those which refers to the start or the end of iterative control structure like LOOP | START and LOOP | END. These line traces are transformed into places. This reduces the size of CPN and makes it more consistent. A same color is assigned to all places that belong to the same execution trace. We use these colors to differentiate between scenarios. Each color refers to specific scenario. This allows subdividing an HLSD into less complex HLSDs to facilitate understanding the behavior of the system.
2) CPN Merging: In this sub-step, all the CPNs corresponding to execution traces are merged to obtain a single CPN. The algorithm kBehavior [23] is used for this reason. This algorithm has points in common with the known Ktail algorithm. [24]. These algorithms are used to construct finitestate automaton (FSA) that abstracts execution traces. The algorithms iteratively merge the equivalent states in order to generalize the resulting FSA. kBehavior can reuse already learned path to adapt it in the FSA with newly generated traces. This is not the case for Ktail. In our approach, we have developed an algorithm called adapted kBehavior which is totally inspired by kBehavior. It's a new version adapted to deal with CPNs. Adapted kBehavior does not need to preprocess the new traces it receives. However, it needs to explore the already generated CPN when it tries to learn again. When a new sequence of transition and places needs to be added to the CPN, it must be ensured that this sequence is not already present in the CPN. Since the generated CPN is generally not non-deterministic, the path of the CPN is quite inexpensive and the additional cost generated by this method remains reasonable.
To make the CPN more coherent, a final transformation is carried out. This transformation concerns the processing of an iterative behavior. This processing includes adding two test transitions after the place LOOP | BEGIN | CONDITION. The first transition labeled IF | BEGIN | CONDITION is executed when the condition of Loop is satisfied. The second transition labeled LOOP | END is executed in the other case. This transition leads to the place labeled LOOP | END and consequently indicating the end of Loop. The output place of the last transition inside loop does not refer anymore its end but to its beginning. The labeling of this place is changed by removing the indication of its condition in order to avoid redundancy as illustrated in Fig. 3.

D. HLSD Extraction
In this step, we can easily build an HLSD by mapping the resulting CPN using the following transformation rules.
• Rule 1: all names of objects in the CPN are transformed into lifelines in SD.
• Rule 2: a transition T1 with the method invocation 0:a:B | m1 ()| b:B is transformed into a BSD where object a:A sends message m1() to object b:B • Rule 3: A Place P1 that contains the operator ALT | BEGIN or OPT |BEGIN or LOOP | BEGIN refers respectively to BSD with the operators alt, opt and loop.
• Rule 4: the CPN paths coming after the place ALT | BEGIN and ending on the transition ALT | END are transformed into combined fragments with the operators ALT.
• Rule 5: the CPN paths coming after the place OPT | BEGIN and ending on the transition OPT | END are transformed into combined fragments with the operators OPT.
• The rules listed below can be combined to map a CPN to an HLSD representing system behavior (Fig. 2).

V. CASE STUDY
To test and illustrate the different steps of our approach, we have developed an application called Sales. It allows vendors to create sales of several articles. It gives the possibility to print an invoice, delivery or a payslip. All these operations are saved in a database. The application developed in Java provides different types of behavior (iterative, optional, sequential and alternative) which are the objective of our case study. The application has a layered architecture with two layers: business logic and data access layer and therefore is structured in two packages BLL (for business logic) and DAL (for data access layer). The BLL package contains six business classes (see Listing 1): Vendor, Sale, Calculation, Invoice, Payslip, and Delivery. The DAL package is composed of the following classes (see Listing 2): VendorDAL, SaleDAL, InvoiceDAL, PayslipDAL, and DeliveryDAL.
To create sales, the vendor makes an order to start a new sale. The vendor can add articles repetitively and calculate the total amount of the sales (repetitive behavior). When the vendor completes the sale, he chooses to print a delivery slip or an invoice in order to be signed (alternative behavior). Finally, if the customer wants it, a pay slip must be printed (optional behavior).

A. Trace Collection
At this stage, the generated execution traces that represent the behavior of the systems are organized in text files according to the format proposed by our approach. To do that, we use a java program that we have developed. It takes as input the traces generated after instrumentation with AspectJ. Then, it adapts them according to the adequate format as it shown in Table II, Table III and Table IV.

B. Trace Filtering
As input for this step, we have formatted traces that each correspond to a given scenario. These traces include calls of all objects that belong to packages BLL and DAL. In this case study, we consider that the main behavior of our application is illustrated in the business logic layer. So, we will ignore all trace lines that refers to the data access layer (trace lines with red color). For that, the algorithm will delete all lines traces that includes the package: DAL. Therefore, the line traces in red in Table II will be deleted. The final traces will contain only traces that include the BLL package.

C. Trace Merging
This step consist in generating for every filtered trace a corresponding CPN (Fig. 4, 5 and 6). These CPNs include as transitions the events generated by the system like the invocation of methods and performing tests. The places of CPN contains only the stars and the end of structure controls: LOOP | BEGIN |, LOOP | END, ALT | BEGIN |, ALT | END. To do this, we have developed an algorithm that transforms every trace into a CPN.
First, our algorithm creates the initial place that represents the start of the CPN. Then, if it finds a method invocation, a correspondent transition is created and attached to the CPN. If a loop control structure is found, it creates places to indicate the start and the end of the iteration and create transitions corresponding to methods invocations between them. When, an alternative structure control is found, the algorithm checks if there is trace line with IF and ELSE. Then, it creates two places labeled ALT | BEGIN | CONDITION and ALT | END that indicate the start and the end of the if else test. After that, it checks if there is a method invocation after the trace line IF | BEGIN, a transition with the same label is created then another transition with the method invocation. Otherwise, a transition with the label ELSE | BEGIN is created. In the case when only IF is found without ELSE the algorithm creates two places labeled OPT | BEGIN | CONDITION and OPT | END.  The Places Pi and Pf are added to the CPN to indicate respectively the initial place and the final trace. To simplify the CPNs, all repeated method invocation between places "LOOP | BEGIN" and "LOOP | END" is deleted. All places representing trace lines are colored with the same color. These colors are used to diferentiat between the different scenarios.
After merging CPNs that refers to scenario1, scenario2 and scenario3, using the adapted Kbehavior, a new CPN is generated (Fig. 7). This CPN includes diferent paths with differents colors. Scenario1 has a yellow color, scenario2 has the green color while scenario3 has the red color.
The condition C1 refers to when the variable i is less than nbr_article while the condition C2 corresponds to if the variable isinvoice is true. Now, we apply our last transformation on loop places to make the obtained CPN more coherent (Fig. 8).

D. HLSD Extraction
The objective of this step is to extract the HLSD that represent the system behavior. For that, we use the transformation rules described in Section 4.4 to transform the final CPN into HLSD (Fig. 9).
The approach, as shown in following figure, is able to extract HLSD with the main UML2 fragment operators (seq,opt, alt and loop). Unlike other approaches using dynamic analysis, our approach succeeds in extracting the conditions corresponding to the combined fragment operators loop, opt and alt.     112 | P a g e www.ijacsa.thesai.org In addition, our approach can be generalized to all objectoriented languages since it only uses text files for execution traces.
The colors are used to facilitate understinding the behavior of the system by subdividing it into several HLSD.

VI. CONCLUSION
Our work consists on proposing a new methodology for recovering an UML2 HLSD from execution trace using CPNs. For this, we first present a background of SD, CPN and reverse engineering. Then, we define several concepts which are essential for the understanding of our new approach. The approach starts by generating and collecting traces. Then, these traces are filtered and represented on CPNs in order to merge them. This merging is performed using an adapted version of the kBehavior algorithm that we have created. These CPNs are less complexes and more coherent than CPN in [8,9]. The final obtained CPN use colors to differentiate between paths that represents different scenarios of the behavior of the system. This facilitates the understanding of the system. The approach succeeds to extract SD fragments operators such as seq, loop, alt and opt. It's also extracts UML2 operator conditions relating on alt, opt and loop which is not the case in [8,9].
Our future work is to extract the fragment operator par which is important for multi-threading systems. Besides, we will try to handle the problem of extracting others UML2 diagrams like a state diagram and activity diagram. 113 | P a g e www.ijacsa.thesai.org