Abstract

The hardware structure of a processing element used for optimization of an investment strategy for financial markets is presented. It is shown how this processing element can be multiply implemented on the massively parallel FPGA-machine RIVYERA. This leads to a speedup of a factor of about 17,000 in comparison to one single high-performance PC, while saving more than 99% of the consumed energy. Furthermore, it is shown for a special security and different time periods that the optimized investment strategy delivers an outperformance between 2 and 14 percent in relation to a buy and hold strategy.

1. Introduction

The goal of technical financial market analysis is to predict the development of indices, stocks, funds, and other securities by evaluating the charts of the past. A method to find such predictions can lead to an investment strategy. Many well-known chart-analysis methods (e.g., Elliot waves [1], Bollinger Bands [2]) try to extract patterns from the charts, expecting that such patterns will come up in similar ways again in the future. There are more than 100 different chart-analysis methods but their success is doubted [3]. In most cases, the current development of the markets significantly affects the quality of the different investment strategies. Since the business volume per year on the worldwide stock markets is more than USD 35 trillions [4], it is not surprising that successful investment strategies are in the focus of intensive research.

In general, there are lots of indicators influencing the chart of a security. Those are not only economical and political indicators but also psychological ones. It is very difficult to decide which weight should be assigned to which indicator, the more so as there are known and unknown tradeoffs between different indicators. Furthermore, weights change in time. Recent papers [58] try to apply data mining methods on historical market rates, in order to find investment strategies that perform significantly above the average. This approach is extreme compute-intensive since every day there are millions of quotations that are fixed worldwide. Even with the use of high-performance computers the reduction of this amount of data is required. But as shown in the literature [58], data mining helps to keep the essential information contents in order to come to successful investment strategies.

In this paper, we present an investment strategy using a novel data mining method, which is discussed in Section 2. It results in a performance significantly above average for certain periods. It is based on the idea of an iterative search for an optimal set of indicator weights in the space of all possible weights of the indicators. Since this space grows exponentially with the number of indicators, the method is very compute-intensive but can optimally be parallelized. Therefore, an FPGA implementation seems to be very promising, because the computational core can be kept very simple and small in hardware. The two phases of the method have been implemented on the FPGA-based massively parallel computer RIVYERA. The RIVYERA architecture and its idea of efficiently exploiting 128 modern FPGAs in parallel are explained in Section 3. Section 4 describes the architecture and the implementation of one processing element for the main computation. The speedup achieved by such an FPGA approach is investigated in Section 5. For different time intervals, the advantage in comparison to a single buy-and-hold strategy is determined. We do not want to discuss the investment strategy itself in this paper. Instead, the main focus here lies on the improvement in terms of speed, energy, and cost efficiency of the new method in comparison to an implementation on a sequential computer architecture. Section 6 summarizes the results and concludes the paper.

2. The Process of Optimizing an Investment Strategy for Securities

For a single security a successful strategy for buying and selling is desired. For simplicity, in this paper let be an investment fund that can be traded without trading costs (there are several discount brokers offering such funds, e.g., Vanguard in USA, InvestSMART Financial Services Pty Ltd in Australia, European Bank for Fund Services GmbH in Germany). Since the taxation regulations are varying in different countries, these are not considered here either.

We consider indicators that might have influence on the chart of . Typical indicators are S&P500, Nikkei225, EuroStoxx50, EUR/USD, and so forth. In other methods of technical analysis itself is used as an indicator as well. We consider a time interval of the past consisting of subsequent trading days . should be large enough to get significant results (e.g., ).

is an matrix, where is the percentage difference of the indicator from to . The vector is the th row of the matrix : . The required data for such a matrix can either be collected or downloaded from some trading platform in the internet.

At time we assume a cash capital of one million EUR and a depot with pieces of the security . The value of one million has been chosen in order to be able to abstract from rounding errors. The results for other starting values can be computed proportionally. Generally, let be the cash money, the number of pieces of in the depot, and the total property at day (). The fund considered here has exactly one market price per day. Therefore, the following condition holds: We are looking for a function that computes the decision of buying or selling a certain amount of from the values of known up to . The output of is on the one hand the decision either to buy, to do nothing, or to sell, and, on the other hand, the amount for the first and the last case. The optimal function of this kind is the one that maximizes the value of . This approach is motivated by the assumption of technical analysis that a successful strategy of the past will also be successful in the future.

In order to simplify the search for , we consider only functions of the kind . The weight vector is denoted by . We define as the vector which yields a maximum value for . A positive value of is a buy indication, a negative sell indication. The amount of to be traded is . A buy at day is limited to in order not to overdraw the cash account and a sell is limited to , accordingly. The decision to cut down on functions of the kind is based on the assumption that the influence of the different indicators is almost linear. Although this cannot be proven here, the results with this simplification are already remarkable. However, it is still worth to investigate modifications of this method with nonlinear functions.

There is one problem with investment funds: at that point in time at the day where the trading decision is made, the exact value of is not known. Therefore, at this point of time it is not possible to compute the exact number of pieces to be traded without overdrawing the cash account. For a buy order, we therefore transmit not the number of pieces but the amount of money for which we want to buy pieces. Vice versa, for a sell order, we should transmit the exact number of pieces to be sold.

Let be the amount of money for which pieces of should be bought at and the number of pieces to be sold at . Obviously, the following condition holds: .

Furthermore, if , it holds: And if ,

It is computationally unfeasible to determine with a brute force approach, even on a supercomputer. If one considers only 100 different values for each of 8 indicators, then there are different combinations of those. Using a calibration period of 26 weeks, with 5 trading days per week the required number of computations of the function would be As specified in the following examinations, even the presented RIVYERA implementation would require 377 days to evaluate that number of combinations. Instead of such a brute force method, we use an iterative approach: initially a very rough grid of 16 values per indicator is used. For each of the weight vectors, the final property is computed. The areas in the grid, where is relatively large, are the targets of the next iteration: in the environment of the corresponding weights the grid, is refined. If a component of a promising weight vector is at the boundary of the grid then the grid is extended in this direction. In the same way, we keep on refining the grid. Already after 100 iteration steps, the results are satisfying. In each iteration step weight vectors are considered, resulting in calculations of the function plus the resulting calculation of the development of the depot value under the assumption that the corresponding buying and selling decisions are taken into account. The process of refining the grid is part of the host system and not specified in this paper.

The high computational effort is caused by the exhaustive search concerning the evaluation of . This calculation can be remarkably accelerated by an FPGA-based implementation. Hence, the concrete task is the following: How can we accelerate the identification of the optimal weight vector which maximizes out of a given set of weight vectors?

3. FPGA-Based Hardware Platform RIVYERA

Introduced in 2008, the massively parallel FPGA-based hardware platform RIVYERA [9] is the direct successor of the COPACOBANA, presented in 2006 for cost optimized breaking 56 bit DES ciphers in less than two weeks [10]. Besides applications in cryptanalysis (e.g., [11]), RIVYERA finds its applications in the fields of bioinformatics [1214] and now stock market analysis, as described in this paper.

For the application presented here, the specific RIVYERA S3-5000 is used, distributed by SciEngines GmbH [15]. RIVYERA is designed to be a completely scalable system consisting of two basic elements. Firstly, the in-built multiple FPGA-based supercomputer provides the resources for parallel high-performance applications (Figure 1, right side). Secondly, a standard server grade mainboard equipped with an Intel Core i7-930 processor, 12 GB of RAM, and 2 TB of hard disk space, provides the resources for quick pre- and postprocessing purposes (Figure 1, left side). The RIVYERA S3-5000 is powered by two 650 W supplies and packed in a standard rack mountable 3 U housing. It is running a standard Linux operating system and, therefore, presents an independent system. The details are discussed briefly in the following.

The FPGA-based supercomputer consists of a backplane and up to 16 FPGA cards (fully equipped for the application described in this paper). Each FPGA card is equipped with eight user configurable Xilinx Spartan3-5000 type FPGAs and one additional FPGA as communication controller. In total, these are 128 user configurable FPGAs. Additionally, a DRAM module with a capacity of 32 MB is directly attached to each user FPGA.

All FPGAs are connected by a systolic-like bus system. Each FPGA on an FPGA card is connected with two neighbors forming a ring including the communication controller. The FPGA card slots are connected to each neighboring slot as well on the backplane, providing the connections between the communication controllers on each FPGA card. The communication is physically realized by high-throughput symmetric LVDS point-to-point connections. The communication of the FPGA-based computer to the host-mainboard follows a connection via PCIe controller card directly to a communication controller on a chosen FPGA card. For applications requiring a higher bandwidth from the host system to the FPGA-based computer, more than one PCIe controller may be attached to other FPGA cards as well. For a configuration as used for this application, the measured net bandwidth from the host to the FPGA computer reaches up to 66 MB/s. Of course, the latency will be different dependent on which clients are communicating with each other, according to the length of the communication chain.

For application development, the RIVYERA provides an API for each of the two basic elements, that is, an API controlling the data transfer between the host software and the FPGAs including broadcast facilities, and an API for the user defined hardware configuration of the FPGAs controlling the data transfer to other FPGAs and the host as well.

A picture of the RIVYERA S3-5000 is shown in Figure 2.

4. Processor Architecture

The FPGA-based part of the presented algorithm is based on exhaustive searches. As different weight vectors can be evaluated independently, the algorithm is suitable for massive parallelization. Therefore, the following description of the technical implementation only considers a single FPGA. Assuming uniform programming of all available FPGAs and an equally divided search space, the computational speed rises approximately linear with the number of FPGAs. According to the RIVYERA platform, the implementation presented here is optimized for Xilinx Spartan3-5000 FPGAs [9, 16].

The key aspect concerning the identification of valuable weight vectors is the calculation of the score for every possible element of the search space. Since these evaluations are the fundamental issue of the computational effort, the success of creating an efficient processor architecture is directly linked to the performance of the underlying implementation of the scoring function. Thus, the main objective, and therefore starting point for the design of the processing element, should be the creation of a scoring unit with a high throughput.

4.1. Scoring Pipeline

The evaluation of consists of repetitive computations of the sequences and . Therefore, the throughput of the scoring unit is directly connected to the performance of the computation of these two sequences. Thus, despite the high spatial cost, the advantages of a pipeline architecture are persuasive. The implementation presented here is based on pipelines that yield a new pair in every clock cycle. As the values and are defined recursively, the pipeline has to wait for its own outputs. Thus, to avoid idle time, scores for different weight vectors are evaluated concurrently, where is the length of the longest cyclic path. Hence, is given by the number of clock cylces that are necessary to compute and from and .

Basically, the structure can be subdivided into three segments. The first one is described by the function . Assuming indicators, the calculation of needs multiplications and additions. The corresponding structure for is shown in Figure 3. As all following calculations directly depend on , this computation is part of the longest path of the pipeline. Hence, the additions should be arranged in a way that only the minimum number of steps (1 multiplication and additions) is required. However, the path is not element of the longest cyclic path because and do not depend on the outputs of the pipeline. A different arrangement of the additions has no effect to .

Due to resource reduction the buy order size and the sell order size are combined to a general order size . A negative value indicates a sell order, and a positive value denotes a buy order. Instead of the sequence , the pipeline calculates the values as it enables to the evaluate just by one addition. The calculation of is given by the following instruction:

When the evaluation of is finished, the order size at day is computed as shown in Figure 4. The intermediate result is restricted to by the usage of two multiplexers and corresponding comparators. The total property and the negative depot value do not depend on and, thus, can be calculated in parallel to its evaluation. Hence, the longest path of the pipeline is extended by the multiplication and the comparator chain.

After the calculation of the order size, new cash and depot values are computed. The value that identifies the day of the historical data set rises by 1 since the end of the given time period is reached. In this case is set to 0 which implies the start of the evaluation of a new weight vector. The sequences and are reset to the default values and . In the same clock cycle the sum of and is calculated and transmitted to the multiplexer that refers to :

In comparison to a multiplication, a division is much more expensive in regard to resource usage [16]. As a consequence, the quotient is realized as the multiplication . On the one hand, this implies the additional calculation and storage of inverse elements. On the other hand, every calculation needs to be done only once and can be outsourced to the host system. Likewise, the additive memory usage can be disregarded as we will see in Section 4.3.

As considered, the algorithm is trivially parallelizable. The computational speed depends linearly on the number of FPGAs. Likewise, this statement can be assigned on the number of pipelines. But how many pipelines can be synthesized on an FPGA and are there further possibilities to increase that number?

4.2. Optimized Fixpoint Representation

All in all, one scoring pipeline is built of multiplications, additions, 2 subtractions, 3 comparators, and 5 multiplexers where is the number of indicators (see Figure 6) that are operations. A Spartan 3–5000 FPGA consists of 8,320 Configurable Logic Blocks which can be separated in 33,280 Slices [16]. Additionally, 108 dedicated  bit multipliers can be assigned for synthesis.

The allocation report of two synthesis results is shown in Table 1. A single precision floating point representation of all variables is assumed in both cases. Using 32 multipliers, 8 indicators yield a consumption of 25% of the available slices. Assuming that 10% of the slices are reserved for further control units, three pipelines can be synthesized on the FPGA. In case of 16 indicators, additional 17% are required. The drastic increase results from the 8 additional adders and multiplicators and the comparatively high spatial cost of floating point units [16]. An important point is the synchronization of the different pipeline stages. For example, the third stage (see Figure 5) receives, amongst others, the input values and . While is given, is only known after several calculations. Hence, to provide synchronicity, the transfer of is delayed using shifting registers. The longest cyclic path consists of 2 additions, 3 multipliers, 2 comparators, and 3 multiplexers. As an extension of the pipeline implies the requirement of more shifting registers, the path should be as short as possible. Optimized in terms of space, the longest cyclic path comprises clock cycles. All in all, only two pipelines are possible in this case.

To counter that problem, a fixpoint representation will be introduced in the following. The idea is motivated by the fact that many of the given values are located in limited ranges. For example, the daily price fluctuations in rarely exceed the interval . That is the reason why the values of will be stored in 18 bits where the decimal place is coded in 12 bits and the new codomain is the interval with a precision of . Likewise, the elements of the weight vector will be stored in 18 bits. While a decimal place of 12 bits seems to be the best tradeoff between overflow immunity on the one hand and precision on the other hand, the range of the weight vectors may be determined specifically for every use case. Cash, depot value, and stock prices are stored in integer values in cent. The inverse prices are multiplied with 232 and also stored in integer values.

The 18-bit representation of and promotes the efficient usage of the dedicated multipliers. Furthermore, the transfer from floating point to fixpoint units leads to a considerable decrease of the allocated resources. The length of the longest cyclic path can be reduced to 37. As shown in Table 2, the available resources suffice for up to 6 pipelines per FPGA.

4.3. FPGA Overview

The pipelines are triggered synchronously. The trading period and the corresponding historical data are set globally for all scoring units. Since independent score evaluations are calculated in parallel in every pipeline, the value of has to change only once every clock cycles. To trigger the pipeline in the th recursion, the historical information of day is necessary. This set consists of the vector of price fluctuations and the values and . To transfer these values within one clock cycle, the historical data of day i is stored in a single Block RAM word. Such a word consists of  bits, for example, 208 bits for indicators. Spartan3-5000 provides 104 RAM blocks with 1,872 KB in total [16]. This is obviously enough in our case, as it suffices for over 9000 days relating to 8 indicators.

As the optimization is based on an exhaustive search, it is necessary to determine the search space. The declared objective is to identify the optimal weight combination for 8 indicators. 8 possible candidates are given for every indicator. So, the search space is declared by an matrix. Every row describes an indicator and consist of 8 values. Each of these values can be used as a weight to the correspondent indicator. As there are 8 possible candidates for each of 8 indicators, the number of possible weight vectors is million. So, one FPGA is able to calculate the optimal weight vector out of 16.7 million combinations. One unique combination for every pipeline has to be calculated in every clock cycle. To accomplish this, every possible combination is declared by an 24-bit identifier in the range of . An equally divided subspace is assigned to every pipeline. The weight vector is extracted by masking the identifier. The bits to show the position of coefficient of the weight vector . For example, the identifier references the matrix items    for and for . This interpretation is very efficient as the effort of bit masking is comparatively small. Thus, 6 different weight vectors can be selected in a single clock cycle and assigned to the pipelines.

As the data flow is synchronous, the scores of all pipelines are calculated at the same time. Assuming 6 pipelines, 6 results are returned per clock cycle. Obviously, it is neither possible nor does it make sense to store 88 values. Likewise, the effort to administrate a list of the best scores is too high as it implies the sorting of 6 results into the list in a single clock cycle. The examination of this problem shows that a good tradeoff is the storage of the best result of every pipeline. Utilizing 6 pipelines and 128 FPGAs, 768 results are evaluated in every iteration. This set seems to be widespread enough to calculate new weight coefficients for the next iteration. An overview of the FPGA structure is shown in Figure 7.

5. Results and Performance Analysis

For further research, we will now consider results and performance for a certain security, the investment fund DWS Convertibles, ISIN DE0008474263, that is operating internationally.

As described in Section 2 (referred as calibration phase in the following), the optimal weight vector is determined for the security and furthermore for a randomly chosen time period of 26 weeks (calibration time interval). We chose 8 indicators that widely represent the current economical environment: S&P 500, DAX, EuroStoxx 50, ASX 200, Nikkei 225, Hang Seng, S&P 500 Future, and EUR/USD. The goal is to find with the maximal value of .

The computational effort with 8 indicators is already rather high. In this paper, we disclaim to investigate more indicators since, on the one hand, these 8 indicators represent the activities on the international stock markets to some high degree, and on the other hand, the results with this restriction are already remarkable.

We now focus on the investment strategy where at day for indicators the are calculated and then from the volume of buying or selling orders is computed based on the value of . To determine the quality of the vector , we test it in a different period of time referred to as the evaluation phase. Of course, this makes only sense for a time interval (the so-called evaluation time interval) which does not overlap with the calibration time interval. We have chosen three different evaluation time intervals of 26 weeks as well. The question is whether or not the new investment strategy gives an outperformance in comparison to a buy-and-hold strategy. Buy-and-hold means is bought at the beginning of the evaluation time interval and sold at the end.

Figure 8 shows an example of the chart of the security in comparison to the performance of our investment strategy with the same security within three different evaluation time intervals of 26 weeks each, . (2009-09-14–2010-03-15) is a period where tendency for the fund is rising. (2010-09-27–2011-03-28) is a period without a clear tendency and (2011-03-28–2011-09-26) is a period where tendency for the fund is falling.

The values of had been determined for each in the iterative way described previousuly. The resulting investment strategy was then applied for the evaluation time intervals , where and . The chart shows the performance of the monetary assets in the evaluation time interval using investment strategy .

In all time intervals, an outperformance of the investment strategies over between (see in Figure 8) and (see in Figure 8) can be seen. Although this is no proof in a mathematical sense that such an investment strategy can be applied to arbitrary securities in arbitrary time periods, it seems to be very promising to further improve the method described here.

Considering computing performance as well, the RIVYERA or similar computer architectures are perfectly suited for such research. Table 3 shows a comparison between the RIVYERA-based approach and a PC version of the algorithm implemented in C. The test system uses an Intel Core i7–970 with MHz, an ASRock X58 Extreme mainboard and 8 GB GeIL DIMM DDR3-1066 RAM. The implementation uses all cores of the processor. In addition, the improved number representation is used in the PC version as well.

pairs are calculated in every clock cycle on RIVYERA where is the number of pipelines per FPGA. The clock rate of the implementation is 50 MHz. Assuming 8 indicators, 6 pipelines can be synthesized on an FPGA. This yields  billion pairs per second. Examinations of the PC version denote that 2.26 million calculations per second are possible on the specified test system. The conclusion is a speedup of about 17,000. While RIVYERA requires up to 1300 W, 300 W is supposed for a standard PC. Accordingly, the power consumption is reduced by up to .

Of course, such a comparison yields a number of questions. Intel declares 76.8 GFLOPs for i7–970 [17]. The presented FPGA design needs operations for indicators. Assuming that the PC version manages to work with the same number of operations, one could deduce that the referred processor reaches up to billion pairs per second. This is more than 1,000 times faster than the actual implementation. So, what is the reason for this gap?

In fact, the computing power of the processor is not the bottleneck. The main problem is located in the intensive memory communication. Even a cache-optimized version needs several RAM accesses (and of course many cache accesses) to calculate a single pair. The pipeline structure cannot directly be translated but only be simulated by further memory instructions. In contrast, the FPGA Block RAM modules are triggered in parallel to the actual calculations. Thus, there is absolutely no latency concerning memory operations. This is a reason why this algorithm is very suitable for massively parallel computing. A further interesting issue would be a comparison in performance using GPGPU.

As well, the time complexity of the presented algorithm differs in regard to the different platforms. On standard processors, the complexity is where denotes the maximum number of possible coefficients for one indicator. The factor occurs because the evaluation of needs multiplications and additions that has to be executed sequentially. A RIVYERA pipeline calculates one pair per clock cycle in every case. So, there is obviously no such dependency. Therefore, the time complexity is . However, the dependency on is not erased by this approach. While the size of a standard processor remains constant for an increasing , more adders and multipliers are necessary in terms of an FPGA-based implementation. According to this, the spatial complexity is . As this may lead to less pipelines per FPGA, an indirect influence to the runtime cannot be concealed.

6. Conclusion

The FPGA-machine RIVYERA is very suitable for optimization of the investment strategy as it was presented in this paper. A speedup of 17,000 and an energy saving of more than 99% in comparison to one single high-performance PC has been determined. The investment strategy which is optimized with RIVYERA delivers for the special investment fund and different time periods reviewed a significant outperformance in relation to a buy and hold strategy.

Several other securities for different time periods were tested. Although always the same, simple indicators were used, the optimization of the investment strategy by using RIVYERA delivered almost in every case a significant outperformance.