Scheduling of Distributed Algorithms for Low Power Embedded Systems

Recently, the advent of embedded multicore processors has created interesting technologies for power management. Systems consisting of low-power and high-efficient cores create new possibilities for the optimization of power consumption. However, new design methods, dedicated to these technologies should be developed. In this paper we present a method of static task scheduling for low-power real-time embedded systems. We assume that the system is specified as a distributed algorithm, then it is implemented using multi-core embedded processor with low-power processing capabilities. We propose a new scheduling method to create the optimal or suboptimal schedule. The goal of optimization is to minimize the power consumption while all time constraints will be satisfied or the quality of service will be as high as possible. We present experimental results, obtained for sample systems, showing advantages of our method. Keywords—Embedded system; distributed algorithm; task scheduling; big.LITTLE; low power system


INTRODUCTION
Embedded systems are dedicated computer-based systems that are highly optimized for a given application.Besides the cost and performance, power consumption is one of the most important issue considered in the optimization of embedded systems.Design of energy-efficient embedded systems is important especially for battery-operated devices.Although the minimization of power consumption is always important, because it reduces the cost of running and cooling the system.It was observed that power demands are increasing rapidly, yet battery capacity cannot keep up [1].
Embedded systems are usually real-time systems, i.e. for some tasks time constraints are defined.Therefore, power optimization should take intconsideration that all time requirements should be met.In general, higher performance requires more power, hence, the optimization of embedded system should consider the trade-off between power, performance and cost.Performance of the system may be increased by applying a distributed architecture.The function of the system is specified as a set of tasks, then during the codesign process, the optimal architecture is searched [2].Distributed architecture may consist of different processors, dedicated hardware modules, memories, buses and other components.Recently, the advent of embedded multicore processors has created an interesting alternative to dedicated architectures.First, the co-design process may be reduced to task scheduling for multiprocessors systems.Second, advanced technologies for power management, like DVFS (Dynamic Voltage and Frequency Scaling) or ARM big.LITTLE [3], create new possibilities for designing of low-power embedded systems.
Although there are a lot of synthesis methods for lowpower embedded systems [4], the problem of optimal mapping of a distributed specification onto the multicore processor is rather a variant of the resource constrained project scheduling (RCPSP) [5] one, than the co-synthesis.Since the RCPSP is NP-complete, only heuristic approach may be applied to reallife systems.According to the best of our knowledge there is no synthesis methods taking into consideration ARM big.LITTLE architecture as a target platform for real-time embedded systems.Only, some work considering run-time scheduling were done [6].
The most of RCPSP approaches are dedicated to the task graph specification of a system.But in many cases, especially in case of embedded software, more general distributed models [7] would be more convenient.It was occurred that the function of a real-time distributed system may be efficiently specified as a distributed echo algorithm [8].Moreover, such specification may also be statically scheduled [9].
In this paper we present the novel method for synthesis of the power-aware scheduler for real-time embedded systems.We assume that the function of the system is specified as a distributed echo algorithm [10] that should be executed by the multicore processor supporting the ARM big.LITTLE technology.The goal of the static scheduling is the reduction of power consumption by moving some tasks to low-power cores (LPCs), while critical tasks are assigned to high-performance cores (HPCs), to satisfy all time constraints.The proposed method is dedicated to high performance embedded computing systems.

II. RELATED WORK
The problem of design of low-power embedded systems has attracted researchers for many years.One direction of these research is finding the low-power architecture by optimizing the allocation of resources and task assignment according to the power consumption (e.g.COSYN-LP [11], SLOPES [12], LOPOCOS [13]).The overview of some power aware codesign methods is presented in [4].But all above methods create the dedicated hardware/software architecture and cannot www.ijacsa.thesai.orgbe applied to multicore processors.
Another direction of research concerning the design for low-power is to develop methodologies that takes into consideration dynamic reduction of the power consumption during runtime.AVR (Average Rate heuristic) [14] is a task scheduling method for variable speed processor.Dynamic Power Management [15] tries to assign optimal power saving states.Other methods reduces power consumption by efficiently using voltage scale processors [16].All above methods are based on power-aware scheduling called YDS.Above methods schedules dynamically a set of tasks by selecting the proper speed for each task.ARM big.LITTLE uses only 2 predefined speeds, thus it is rather not possible to adopt above methods to this technology.
There are a lot of scheduling methods for real-time embedded systems.Earliest Deadline First (EDF) [17] or Least Laxity First (LLF) [18] is ones of the most efficient dynamic scheduling methods.But above methods are dedicated to homogeneous architectures (SMP).Discussion concerning the problems of task scheduling in real-time systems is presented in [19].Most of them optimize schedule length.
Embedded software consists of the given set of tasks.Usually it is possible to estimate the task parameters like execution time, power consumption, memory requirements.Most static scheduling methods are based on specification represented as task graph [20].But in many cases it is difficult to specify function as a task graph, some other models e.g.distributed algorithms [7] are more suitable.It seems that the echo algorithm [10] would be attractive for this purpose.
According to our best knowledge there is no scheduling method for real-time systems specified as distributed echo algorithm, as well as the static scheduling method optimizing energy consumption in embedded systems based on the big.LITTLE platform.

III. PRELIMINARIES
We assume that the target embedded system is based on multi-core processor with LPCs and HPCs.LPC requires less power to execute tasks but execution times are longer.HPC executes tasks faster but consumes more energy.We consider soft real-time systems, i.e. all tasks should be executed before the specified deadline.But it is acceptable to slightly exceed the deadline.In this case the quality of service (QoS) decreases with increasing delay.The goal of optimization is to find schedule for which the power consumption is minimal while time constraints are satisfied or QoS is maximal.Since we consider shared memory architecture, transmissions between tasks may be neglected

A. Echo algorithms
Echo algorithms [18] are a class of wave algorithms [7] used for describing distributed computations.The system is specified as a set of tasks communicating by message passing.One task is an initiator, which starts all computations.After finishing its execution the initiator sends explorer messages to all neighbours.After receiving the first explorer message the task stores source node as an activator and after execution sends explorer message to all neighbour nodes except the activator.After finishing execution of all tasks, all tasks which were not activators execute again to compute echo message which is sent to their activators only.Each task, after receiving echo messages from all activated task, executes again and sends echo message to its activator.Finally, the initiator should receive all echo messages and then it computes the final result.Fig. 1 presents sample echo algorithm consisting of 10 processes.Assume that task 0 is the initiator.Therefore this task will be executed first.Then, tasks 1, 5 and 4 should be executed.It should be noted that the order of activation of tasks depends on times of execution of the following tasks, e.g.task 6 may be activated by task 5 but also it may be activated by task 7, in case when tasks 1, 2, 3, and 7 will finish their execution before finishing task 5. Thus, the scheduling on heterogeneous processors is complex even when the execution times of all tasks are known e.g. are estimated or when the worst case execution times are assumed.

B. Functional Specification of Distributed Systems
We assume that the system is specified as a collection of sequential processes coordinating their activities by sending messages.Specification is represented by a graph G = {V, E}, where V is a set of nodes corresponding to the processes and E is a set of edges.Edges exist only between nodes corresponding to communicating processes.Tasks are activated when required set of events will appear.As a result, the task may generate other events.External input events will be called requests (Q), external output events are responses (O) and internal events correspond to messages (M).The function of the system is specified as finite sequences of activation of processes.There is a finite set of all possible events { } .System activity is defined as the following function:


where C is an event expression (logical expression consisting of logical operators and Boolean variables representing events), Ω=[ω L , ω H ] are workloads of the activated process defined for LPC and HPC respectively, and Π=[π L , π H ] defines power consumption.

C. ARM big.LITTLE technology
ARM big.LITTLE technology is an architecture where high-performance CPU cores are combined with the most efficient ones.In this way the peak-performance capacity, higher sustained performance, and increased parallel processing performance, at significantly lower average power, are achieved.It was shown that using this technology it is possible to save up to 75% CPU energy in low to moderate performance systems and it is possible to increase the performance by 40% in highly threaded workloads.
Three different methods of applying big.LITTLE technology for minimizing the power consumption were proposed [3]: 1) In the cluster switching, LPCs are grouped into "little cluster", while HPCs are arranged into "big cluster".The system uses only one cluster at a time.If at least one HPC core is required then the system switches to the "big cluster", otherwise the "little cluster" is used.Unused cluster is powered off.
2) In CPU migration approach, LPCs and HPCs are paired.At a time only one core is used while the other is switched off.At any time it is possible to switch paired cores.
3) The most powerful model is a Global Task Scheduling (GTS).In this model all cores are available at the same time i.e. tasks may be scheduled on all HPC as well as LPC cores.
Our approach is dedicated to the global task scheduling model.GTS is the most flexible and the most efficient method of applying big.LITTLE architecture.Moving tasks between HPCs and LPCs is fast, it requires less time than a DVFS state transition or SMP load balancing action.

IV. POWER-AWARE SCHEDULING
The draft of our algorithm of power-aware scheduling is given in Fig. 2. First, a list of schedulable tasks (S list ) consists of the initiator, only.Then time marker (T) is initialized to 0. The main loop schedules the successive tasks, ordered according to their priorities.Priority of each task is based on the laxity (L), defined as a difference between task start times, obtained using ALAP (As Late As Possible) and ASAP (As Soon As Possible) methods, assuming the deadline (TL).These methods are applied assuming non limited number of cores.The Sort() method orders all schedulable tasks according to increasing laxity.
Tasks with the lowest laxity are scheduled first.If the laxity is higher than the difference between task execution times for LPC and HPC, then the task is scheduled on the LPC (if any LPC is available).If the laxity is lower than above difference, then the task is scheduled on the HPC (if any HPC is available).If no HPC is available, the task is scheduled on the LPC (if any LPC is available).If the difference between the time limit (deadline) and the time maker is higher or equal the system execution time obtained from ASAP method (in version for LPC), then the task is scheduled on the LPC.If none of the above conditions is fulfilled, the task stay in S list and will be attempted to schedule in the next time frame.
Finally, all scheduled tasks are removed from the S list .Before starting the next iteration of the main loop, the next tasks are added to the S list using NextReadyTasks() method.The tasks are chosen according to rules of distributed echo algorithm.When all cores are busy or S list is empty, the time marker is moved to the next time frame (function NextAvailableTimeFrame()), i.e. the nearest time when any core will finish executing task.
The presented algorithm is a greedy approach.First, it tries to reduce the power consumption by assigning tasks to LPC whenever it is possible.Although it is heuristic, we observed that in most cases it is able to find the solution for which all time constraints are satisfied.

V. EXAMPLE
Assume that the target embedded system is based on multicore processor with 2 LPCs and 4 HPCs.The sample system specification (Fig. 1) consists of 10 tasks that are executed twice, first time in the exploration phase and the second time during the echo phase.The initiator is defined as task 0. It starts the computations in the exploration mode and it returns the final result after execution in the echo mode.Assume that the soft deadline is equal 37 ms.
The algorithm starts with S list ={0} and T=0.During the first pass all tasks are initially scheduled using the ASAP and ALAP methods.Results are given in Tab.II.During the exploration phase tasks are identified by the task number, for the echo mode tasks are identified by adding suffix "e" to the task number.It may be observed that according to the ASAP L method scheduling on LPCs, the minimal system execution time is equal 61 ms and it requires 5 cores.The energy consumption equals 368 mJ.Initial ASAP L scheduling gives an information about minimal execution time using LPCs only.It gives also the solution with minimal energy consumption.The initial ASAP H scheduling returns the following results: execution time=35 ms, energy consumption=824 mJ, and requires 5 HPCs.Above results specifies the fastest solution, which consumes the maximal power.

VI. EXPERIMENTAL RESULTS
The efficiency of our method was evaluated using the example from Fig. 1 as well as using other examples consisting of 25 and 45 tasks.Unfortunately there is no standard benchmark sets for echo algorithms.There is also no similar approaches of scheduling that may be compared with our approach.Therefore for comparison the classical list scheduling and ASAP methods were chosen.
Tables III, IV and V present results obtained for all sample algorithms using our method (EchoLPS) and list scheduling.Two different big.LITTLE architectures were examined, the first consists of 4 LPCs and 2 HPCs, the second one consists of 2 LPCs and 4 HPCs.For each architecture 4 different deadlines were examined.The mildest constraint was chosen in such a way that all tasks may be scheduled on LPCs.Such systems are found for reference, as the most power savings systems.Experimental results show how the tightening of time constraints affects the energy consumption.It should be noted that, nevertheless that our method is heuristic, in all cases solutions satisfying the deadline were found.But of course EchoLPS does not guarantee the fulfilment of hard real-time constraints.
For comparison the results obtained using classical list scheduling was given.List scheduling, first assigns tasks to HPCs i.e. it tries to find the fastest solution.Lists of tasks are ordered according to priority that is based on ALAP-ASAP values.We may observe that for comparable results (as far as the execution time is concerned) the solution found using the List Scheduling consumes significantly more energy than solutions obtained using our method.
For reference we also performed scheduling of all sample systems using List Scheduling, ASAP and ALAP methods.Table VI presents the results.For each solution the minimal number of LPC or HPC cores was found.Using List Scheduling it was possible to find the lowest energy consuming solutions.In some cases solutions are faster than obtained our method, but more LPC cores are required.Solutions found using ASAP/ALAP methods usually found the fastest solutions, but these methods do not minimize the number of cores required to execute task.

VII. CONCLUSIONS
In this paper a power-aware static scheduling method for embedded systems was presented.The method schedules realtime tasks on multi-core processor with power management capabilities.We applied our method to processors supporting ARM big.LITTLE technology, but the method may be adopted also to DVFS.The method gives better results than classical scheduling methods adopted to low-power embedded systems.
The method assumes the specification of the system in the form of a distributed echo algorithm.Such specification is more general than task graphs used in the most of existing static scheduling methods for real-time embedded systems.According to our best knowledge this is the first static scheduling method for real-time embedded software specified as the echo algorithm.
Our future work will concentrate on extending our method to systems specified using other classes of distributed algorithms, systems using other power management technologies as well as adaptive systems [21], considering the dynamic power optimization.Other direction of our work is to perform scheduling of the set of applications on the same system.Another interesting result may be obtained by developing quasi-static or quasi-dynamic scheduling method for distributed specifications.Such methods may be applicable to systems where the time of execution for tasks is not known or it is difficult to estimate.
The presented method uses simple heuristic to find the best tradeoff between the power consumption and efficiency of the system.Although the method gives quite good results, we will consider to apply more sophisticated optimization methods like constraint logic programming, mathematical programming [22] and developmental genetic programming [23].www.ijacsa.thesai.org

TABLE I .
TASK CHARACTERISTICS

TABLE II .
INITIAL TASK SCHEDULING

TABLE III .
RESULTS FOR 10 TASKS

TABLE VI .
RESULTS FOR LIST SCHEDULING, ASAP AND ALAP