Bond Portfolio Analysis with Parallel Collections in Scala

In this paper, we report the results of new experiments that test the performance of Scala parallel collections to find the fair value of riskless bond portfolios using commodity multicore platforms. We developed four algorithms, each of two kinds in Scala and ran them for one to 1024 portfolios, each with a variable number of bonds with daily to yearly cash flows and 1 year to 30 year. We ran each algorithm 11 times at each workload size on three different multicore platforms. We systematically observed the differences and tested them for statistical significance. All the parallel algorithms exhibited super-linear speedup and super-efficiency consistent with maximum performance expectations for scientific computing workloads. The first-order effort or "naive" parallel algorithms were easiest to write since they followed directly from the serial algorithms. We found we could improve upon the naive approach with second-order efforts, namely, fine-grain parallel algorithms, which showed the overall best, statistically significant performance, followed by coarse-grain algorithms. To our knowledge these results have not been presented elsewhere.


INTRODUCTION
A review of the high performance computing literature suggests opportunities and challenges to exploit parallelism to solve compute-intensive problems.[1] [2] [3].Proponents of functional programming have long maintained that elaboration of the lambda calculus lends itself to mathematical expressiveness and avoids concurrency hazards (e.g., sideeffects, managing threads, etc.) that are the bane of shared-state parallel computing.[4] Yet parallel functional programming has remained largely outside the mainstream programming community.[5] One could conceivably argue that parallel functional programming was ahead of its time and the era of inexpensive multicore processors in which some investigators have observed that the "free lunch is over" since clock speeds have been decreasing or at least not increasing significantly, necessitating a turn toward parallel programming.[6] Enter Scala [7], a relatively new, general-purpose language which runs on the Java Virtual Machine (JVM) and hence, desktops, browsers, servers, cell phones, tablets, set-tops, and lately, GPUs [8] [9] [10], a related topic we do not explore here (see the section, "Conclusions and Future Directions").Scala blends object-oriented and functional styles with sharednothing, task-level parallelism based on the actor model.[7] Parallel collections [11] [12] are recent additions that provide data-level parallelism [3] through a simple, functional extension of the ordinary, non-parallel collections of Scala.While the use of parallel collections has potential to improve programmer productivity and greatly facilitate a transition to parallel programming, no independent study has investigated whether parallel collections scale in terms of run-time performance on commodity hardware, taking into account furthermore end-to-end processing that involves I/O which is typically a prerequisite for and often the bottleneck of practical applications.
Coleman, et al., conducted end-to-end experiments to find the fair value of riskless bond portfolios using task-level parallelism via map-reduce.[13] [14] In this paper, we take a new, different tack on the same problem that applies data-level parallelism via parallel collections.We were motivated to use bond portfolio analysis, first, because computational finance workloads can be very large.[15] Second, bond portfolio pricing theory is fairly transparent.[16] Finally, bonds inform or are closely related to other financial instruments, including annuities, mortgage securities, bond derivatives, and interest rate swaps, which are among the most heavily traded financial contracts in the world.[17] Thus, computational methods and performance results from this class of problem would likely have implications beyond bonds and finance.Indeed, the experiments with Scala parallel collections using eight algorithms on three different hardware platforms show super-linear speedup and super-efficiency are consistent with the maximum performance expectations for scientific computing workloads.While the data suggests that the more modern processors are also more efficient, overall fine-grain algorithms significantly outperform others in runtime, which interests and surprises us considering the presumed overhead of this approach.The coarse-grain algorithms are next best, followed by the "naïve" algorithms.The findings we report here using parallel collections are new and have not been reported elsewhere or by others.All the source code is available online for review, download, and testing (see section, "Appendix -Source Code").

A. Parallel collections -a primer
Scala has standard, template data structures called collections, which include lists, arrays, ranges, and vectors, among others.Scala collections are different from the ones it also inherits from the Java standard library in that the Scala www.ijacsa.thesai.orgversions are typically immutable with methods to operate on the data elements using functional objects.For instance, to multiply every element of a range collection by two using the map method, we have the snippet below (where "scala>" is the Scala interactive shell prompt): scala> (1 to 5).map(x => x * 2) Vector (2,4,6,8,10) Snippet 1. Maps sequential range.
The parameter, x => x * 2, an anonymous function literal object, receives each element of the range collection as an immutable value parameter, x, multiplies it by two, and map copies the result into a new collection, Vector.
Here map invokes the function literal object on the range using the machine's parallel resources.The parallel collections map method returns a parallel vector, ParVector, in which the ordering of the return results is unspecified because of the asynchronous nature of parallel execution.From a programmer's point of view, virtually no effort is involved to parallelize the code.There are no new programming constructs to learn and apply and algorithm redesign and code refactoring are not demanded.There is furthermore no need to write special test cases to verify the results since in principle the serial (non-parallel) implementation is the test case.While the result ordering may need to be addressed, in general, parallel collections are a potential windfall for programmer productivity and transitioning to parallel programming.
The research question is whether use of .parscales, enabling speed-up and efficiency on a non-trivial problem on commodity hardware.For bond portfolio analysis, the functional nature of parallel collections makes implementation of the pricing equations straightforward.In the "naïve" case, we simply reuse the pricing function object from the serial algorithm with no other changes to the code other than to apply .par,just as we did in the above snippet.However, we go further and explore whether we can obtain further improvements using fine-grain and course-grain algorithms.

B. Pricing theory
For purposes of this paper, we are considering only simple bonds [16] b i , defined by the five-tuple: i is an integer which plays no part in bond pricing except to uniquely identify the bond in an inventory which we describe below; C is the coupon amount paid one or more times; n is the payment frequency of coupons per annum; T is the time to maturity in years; and M is the face value due at maturity.The sum of the net present value of these cash flows, C and M, is the fair value of the bond.Thus, the fair value, P(b i,, r), of a bond, b i , is the net-present value of its cash flows which functionally defined as: The parameter, r, is the time-dependent yield curve, the general discussion of which is beyond the scope of this paper.Without loss of generality, we use the United States Treasury on-the-run bond yield curve, which we observe once.We interpolate between the tenors (i.e., Treasury maturity dates) using polynomial curve fitting, the coefficients of which we cache and apply for all bonds in the inventory.
A portfolio is a collection instruments, in our case, bonds.The fair value, P(φ j ), of a portfolio, φ j , with a basket of Q bonds is functionally defined as follows:

C. Bond portfolio generation
We generate simple bonds that model a wide range of computational scenarios.The goals are to 1) produce a sufficient number of bonds to mimic realistic fixed-income portfolios and 2) avoid biases in commercial-grade bonds that depend on prevailing market conditions.Specifically, we have where • is an integer uniform random deviate in the range of [0, s-1]; and s is the size of the respective collection.We invoke Equations 4a -4d a total of 5,000 times to produce the bond inventory, V, which we store in an indexed persistent database that we describe below.
We generate a portfolio by first selecting its size, that is, the number of bonds, Q, per the equation below.
η is a Gaussian deviate with mean of zero and one standard deviation.v and σ are configurable parameters set to 60 and 20, respectively.Finally, we construct a basket of size, Q, bonds for a portfolio, φ j .We use the equation below to specify a bond id or primary key, i = • (6) where • is an integer uniform random deviate in the range of [1,|V|] and |V|=5,000 is the size of the bond inventory.We generate a universe, U, of bond portfolios where |U|=100,000.www.ijacsa.thesai.org The bond portfolios are also store in a database indexed by j, a unique portfolio id.

D. Database design
We store the bonds, b i , portfolios, φ j (which also contains the result of Equation 3) in MongoDB, an indexed, documentoriented, client-server database.[18] As we noted above, φ j does not contain bond objects, b i , but the bond primary key, i.In MongoDB parlance, the bonds are linked to portfolios rather than embedded by them.In other words, the database is organized in third-object normal (3ONF) form.[19] Thus, to evaluate Equation 3, a total of 2+Q accesses are necessary: one access to fetch φ j ; Q fetches to retrieve each b i ; and finally, one store to update the portfolio, φ j , with its price.The figure below gives the class diagram, as it is stored in the document repository.Although this design is consistent with best practices for data modeling, we could reduce the number of database accesses at the expense of redundancy through denormalization.However, we decided to forgo this optimization in the interests of establishing a baseline of performance for future reference.

E. Algorithms
We develop two classes of algorithms: serial and parallel.There are three types of parallel algorithms, "naïve," fine-grain, and coarse-grain.Each serial and parallel algorithm comes in two kinds: composite and memory-bound.The composite kind, represented by the notation, {io+compute}, overlaps access to the database while evaluating Equation 2 and Equation 3. The memory-bound kind, represented by the notation, {io}+{compute}.In other words, we measure I/O ({io}) and compute ({compute}) runtimes separately, first caching all the bonds by portfolio into memory and only then evaluating Equation 2 and Equation 3. I/O ({io}) and compute ({compute}) runtimes furthermore provide insight into the maximum compute and IO performance potentials.In each case, the algorithms evaluate the same collection of portfolios, U'⊂U, which has been randomly sampled from the database.We give here only snippets from the source code.See the appendix to access the complete source.

F. Serial algorithms
We invoke the composite serial algorithm as the snippet below suggests.
val outputs = inputs.map(price)Snippet 3. Maps input of randomly sampled portfolio key ids to price results The object, inputs, is a collection of portfolio ids and outputs is a collection of portfolio prices.(The "val" declaration means that outputs is an immutable value object.)The parameter, price, is a named function object with the declaration: def price(input: Data): Data Snippet 4. Price the collection of randomly sampled portfolio ids serially This means price receives a Data object as an input parameter and returns a Data object.We wrote the Data object for use by all the algorithms of this study.It contains the portfolio id, a list of bonds, and a result object which itself contain the portfolio price and diagnostic information about the run.On input in this case, the Data object has set only the portfolio id.On output, Data has the portfolio id and the result object defined.
The function object, price, accesses the 3ONF repository to retrieve a portfolio by its id and then retrieve the bond objects, pricing them according to Equation 2, then according to Equation 3 summing the prices using the foldLeft method.(For readers who may be unfamiliar with functional programming, "folding" is a common operation in functional programming for aggregating elements.The foldLeft method is a serial aggregator, traversing the collection, left-to-right, that is, from the element at index zero to the element of the last index.The analogous foldRight traverses the collection from right-to-left using tail-recursion.We prefer foldLeft as opposed to foldRight to avoid the problem of stack overflow.) The serial memory-bound algorithm is virtually identical to the composite algorithm as the snippet below suggests.The method, loadPortfsFoldLeft, loads a random sample of n portfolios from the database and uses foldLeft to aggregate the corresponding bonds.Thus, in this case, the inputs value is a collection of Data objects, each containing a list of bond objects.The parameter, price, is a function object, the same one used in the composite serial algorithm.

G. Naïve algorithms
The naïve algorithms are so-called because, as a first-order effort, they "naively" use .par.They are virtually identical to the serial algorithms.That is, we have the snippet below for the composite case.Notice that the memory-bound kind uses loadPortfsParFold (i.e., rather than loadPortfsFoldLeft), which accesses the database and loads the portfolios in parallel using a parallel collection.It uses Scala's par.foldmethod.This method aggregates like its serial version, foldLeft, except par.folddoes so in parallel with non-deterministic ordering.The figure above shows how loadPortfsParFold works.Namely, we start with an empty List collection.Here for the sake of demonstration, portfolios, #999 and #5, are being loaded into memory from the database by the "query" operation.The "++" nodes are binary operations that merge partial lists of bond objects until a complete list is merged at the root in O(log N) time.At the top of the merge tree we have the fully merged in-memory List collection of portfolio data objects.In this depiction, the value, 17, represents a portfolio id chosen for demonstration purposes.Thus, the outer list contains portfolio data objects, each of which contains a list of bond objects.Note that this parallel memory-caching algorithm is not "embarrassingly parallel" as the data lists must be merged.

H. Fine-grain algorithms
In a second-order effort to improve the naïve application of .par,we developed fine-grain algorithms, composite and memory-bound kinds.Unlike the naïve algorithm, the finegrain algorithm uses a parallel collection within the pricing function object.In other words, we have a parallel collection within a parallel collection.
The inner parallel collection has a bondPrice function object to price the bonds by their id (i.e., it makes a query to the database) per Equation 2 using par.map and a sum function object to reduce (i.e.., accumulate) the bond prices in parallel using par.reduce.In effect, we have the snippet below of the price function.
Bond prices flow directly to their reduction in an O(log N) processing tree.Thus, like parallel I/O, the workload is not "embarrassingly parallel" as the figure below suggests.
The memory-bound algorithm is similar except, it uses the parallel IO query-tree to access the database and cache the bonds in memory.

I. Coarse-grain algorithms
The other algorithms created a parallel collection of input portfolios whose size was independent of the number of processor cores.The idea of the parallel coarse-grain algorithm is to "chunk" the portfolios as a second-order effort to the naïve application of .par.That is, we create a parallel collection whose size is proportional to the number of processors.
The design of parallel collections does not provide a direct way to bind the pricing function to a core.This is part of parallel collections design philosophy: the programmer focuses on the functional specification and the parallel collection distributes it across the cores.
Nevertheless, the programmer can control the chunk size by making the input collection a List of a List of portfolio ids.For example, for a four-core platform like the W3540 we study, the containing List has eight List elements.
Each element has |U'|/c portfolios where c is the number of cores.For u=1024 portfolios, each element in the containing is a List of 128 portfolios.The pricing function object is then passed this list with 128 portfolios, which it processes serially to evaluate Equation 2 and Equation 3.
To compute the size of the contained List, we use the Java class, Runtime.It has a method, availableProcessors( ).However, this method returns the number of hyperthreads, not the number of cores.As far as we know there is no way to get the number of core except manually from the OEM datasheets, which we rely on for calculating the efficiency (see below).Otherwise, programmatically we use the Runtime class.
The coarse-grain composite algorithm loads the bond objects by portfolio just as the naïve algorithm except it does in "chunks" on-demand.The memory-bound algorithm, like its naïve and fine-grain counterparts, uses the parallel IO query tree to cache the bonds in memory.

A. Environment
The test environment consisted of three hardware platforms of different Intel multicore processors.The table below shows the system configurations, with the clock speed in GHz and years of introduction by the Intel Corporation.www.ijacsa.thesai.orgAll platforms run Microsoft Windows 7. The code was compiled by Eclipse 3.7.1 using the Scala IDE plugin version 2.0.0.The code was executed with the 64-bit JVM.We used MongoDB, version 1.8.3.Although MongoDB is accessed through TCP/IP, the database server runs on the same host as the Scala code.We indexed the portfolios and bonds documents on their key ids.

B. Runs and trials
We instrument the code and make the following measurements.For each algorithm by its kind in Table 2, we make a total of 11 trial invocations of the code to obtain stable run-time statistics following.[20] Each trial starts a new JVM, the code of which allocates new JVM objects and opens new database connections.The trial ends when the algorithm ends and the code exits, terminating the JVM, which closes the database connections and causes the operating system to recycle the JVM objects.A given set of trials, taken together, we call a run.There is a run for u=2 x portfolios (i.e., the problem size) where x∈[0..10].The run, u=1024, is we call the terminal run.Note: #4 in Table 2 is not an actual run; it is derived by adding the measurements for #2 and #3 for the respective runs.For each run at a given problem size, we analyze the measurements for statistical significance as we describe below.We also graph the run-times using the median value of the run.

C. Speed-up and efficiency calculations
T 1 is the serial time of a serial algorithm.T N is the time using parallel collections.
Given T 1 and T N where N is the number of cores, we have the speedup, R: The efficiency, e, is e = R / N (9) In this case, N is the number of cores, which we got from the OEM datasheets online.[21] [22] [23]

D. Statistical significance calculations
After obtaining the runtimes, we observe the differences and test them for statistical significance in the indicated direction.That is, if the median runtime of algorithm, A, is less than the median runtime of algorithm, B, we have the null hypothesis H 0 : where E is expectation.To conservatively estimate the p value, we used the one-tailed Mann-Whitney test.[24] We report (see the appendix) the rank sum statistic, S, S = R(T i ) ∑ (11) where R(T i ) is the rank of runtime, T i .Since there are 11 observations for each algorithm, the one-tailed threshold for p=0.05 is the rank sum, S .05=101.This value can be found in Table A7 in [24].Thus, for S < S .05, we reject H 0 .
We compare each of our eight algorithms relative to one another and test the differences for statistical significance.To make the report more accessible, we give the frequency count for the number of times an algorithm is found to be statistically significantly faster than another algorithm.Again, the rank sums, S, algorithm by algorithm for each hardware platform, can be found in the appendix.
We present graphical evidence for performance over the range of u mentioned above for each algorithm on each platform.We assess the statistical significance and present tabular data only for the terminal run, u=1024.

IV. RESULTS
The table below gives the kind of algorithms symbolized in the graphs and tables that follow.

A. Naïve results
The results for the naïve treatments are summarized in the next three graphs, one for the W3540, i7, and i3, respectively.
The number of portfolios or problem size, is u=2 x .The speedup, R, is on the left axis, and the efficiency, e, is on the right axis.The table below gives the terminal run results.T N is the median run-time in seconds.Note: In general, the median operator does not distribute.Namely, median({io} + {compute}) ≠ median({io}) + median({compute}).For example, for the W3540, T N ({io} + {compute}) = 12.39 whereas T N ({io}) + T({compute}) = 8.74 + 3.52 = 12.26.

B. Fine-grain results
The results for the fine-grain algorithms are summarized in the next three graphs, one for the W3540, i7, and i3, respectively.The table below gives the results for the terminal run.

C. Coarse-grain results
The results for the coarse-grain algorithms are summarized in the next three graphs, one for the W3540, i7, and i3, respectively.Note that the algorithms are not defined for portfolios less than the number of hyper-threads.The table below gives results for the terminal run.

D. Statistical significance results
The table below gives the counts in which an algorithm is statistically significantly faster than another algorithm and kind.The details underlying this table are in the appendix, "Sorted Rank Sums."To read the above table, choose the kind of algorithm (composite vs. memory-bound) and read across for type of algorithm.For example, the composite serial algorithm ran slower than every other algorithm on the W3540, i7, or i3 platforms.Hence, there are zero (0) values across the composite serial row.The memory-bound naive algorithm ran faster than three algorithms on the W3540 and two algorithms the i7 and i3, respectively.
The memory-bound serial algorithm outperformed one algorithm on each platform: these slower algorithms were the composite serial algorithms.Evidently loading on all the portfolios into memory significantly improves even the serial performance.
See the appendix, "Sorted Rank Sums" for the specific counts.

V. DISCUSSION
The graphs,  show that for larger problem sizes, u, the composite and memory-bound algorithms performed better than I/O processing alone which is the least efficient but worse than compute by itself which is the most efficient.The slopes of these graphs generally point toward increasing speedup and efficiency for larger u.
Tables IV -VI show evidence for high levels of overlap between compute and I/O.For instance, the ratios of T{compute} /T {io + compute} and T{compute} / (T{io}+T{compute}) found in these tables are often around 80% or higher.
Table VII nevertheless indicates that the memory-bound algorithms tend generally to give statistically significant better runtimes compared to the composite algorithms.In other words, caching the portfolios in memory upfront seems to give better performance than loading them, as they are needed.
Table VII also suggests that the algorithms on a given platform tend to run with significantly more efficiency on the i3 across all the algorithms, followed respectively by the W3540 and the i7.
Finally, the data in Table VI show the fine-grain algorithms give statistically significantly better runtimes followed respectively by coarse-grain and the naïve algorithms across different platforms.

VI. CONCLUSIONS AND FUTURE DIRECTIONS
This study has found that bond portfolio analysis using parallel collections achieve super-linear speedup and superefficiency with as few as u=64 portfolios across different multicore processors.The data suggests that the "naïve" application of parallel collections can be improved significantly, foremost with the fine-grain algorithm, which we find interesting.That is, portfolio analysis is "embarrassingly parallel," but not for the fine-grain or the I/O parallel algorithms which contain inherent dependencies that necessitated the use of parallel merge-trees.
The data points toward greater speed up and efficiency for larger problem sizes, u>1024.The terminal run analyzed only about 1% of the portfolios.Additional research could consider how to harness multiple hosts and/or GPUs to price all portfolios.
Future work might also compare and contrast map-reduce versus parallel collections as well as possibly consider how to improve the I/O performance.

={0. 005
, 0.01, 0.02, 0.03, 0.04, 0.05}, where the elements of n  are payment frequencies, T  are maturities, and δ  are coefficients.We derive the parameters for a bond object from the bond generator equations below:

Figure 1 .
Figure 1.Third normal object form (3ONF) of the database val inputs = loadPortfsFoldLeft(n) val outputs = inputs.map(price)Snippet 5. Serially load the bonds in memory, then price portfolios serially val outputs = inputs.par.map(price)Snippet 6.Price the collection of randomly sampled portfolio ids in parallelWe have the snippet below for the memory-bound kind.valinputs = loadPortfsParFold(n) val outputs = inputs.par.map(price)Snippet 7. First, load the bonds into memory in parallel by portfolio id, then prices the portfolios in parallel

Figure 3 .
Figure 3. Accessing and pricing bonds then reducing prices in parallel.