Abstract

Hardware acceleration in high performance computer systems has a particular interest for many engineering and scientific applications in which a large number of arithmetic operations and transcendental functions must be computed. In this paper a hardware architecture for computing direct kinematics of robot manipulators with 5 degrees of freedom (5 D.o.f) using floating-point arithmetic is presented for 32, 43, and 64 bit-width representations and it is implemented in Field Programmable Gate Arrays (FPGAs). The proposed architecture has been developed using several floating-point libraries for arithmetic and transcendental functions operators, allowing the designer to select (pre-synthesis) a suitable bit-width representation according to the accuracy and dynamic range, as well as the area, elapsed time and power consumption requirements of the application. Synthesis results demonstrate the effectiveness and high performance of the implemented cores on commercial FPGAs. Simulation results have been addressed in order to compute the Mean Square Error (MSE), using the Matlab as statistical estimator, validating the correct behavior of the implemented cores. Additionally, the processing time of the hardware architecture was compared with the same formulation implemented in software, using the PowerPC (FPGA embedded processor), demonstrating that the hardware architecture speeds-up by factor of 1298 the software implementation.

1. Introduction

Currently, there exists a real demand in many robotics applications for higher operational speeds, and the solutions improving performance would have clear benefits in terms of manufacturing efficiency, precision, and processing time. In general, a fully operational robotic system running in real-time requires the repeated execution of a variety of complex algorithms, which involve the use of several transcendental functions and arithmetic operations. Most of these algorithms, if not all, need to be computed within milliseconds (ms) or microseconds (μs) in order to keep several real-time constraints. Such algorithms require massive computing power that surpasses the capabilities of many sequential computers [1].

A classical but important problem in the robotic manipulator is the direct kinematics. The direct kinematics allows the Cartesian location of the end effector to be calculated from measured values of the joint angles [2]. This mathematical formulation frequently needs a large number of arithmetic and trigonometric computations that must ideally be performed in a floating point, because of precision requirement [2]. Design and implementation of floating-point arithmetic operations in FPGAs has a relevant importance in a variety of scientific applications (such as robotics) due to the large dynamic range for representing real numbers, which permits suitable representation of both very small and large numbers in a fixed bit-width with proper precision. Additionally, the software implementations of direct kinematics computations are very expensive in processing time due to the sequential behavior of general purpose processors (GPPs).

Floating-point-based algorithms for arithmetic operators are commonly implemented on software and executed in microprocessors. Typically this solution requires to pay a performance penalty given that the conventional approaches require to perform the data transfer between the ALU (Arithmetic Logic Unit) and the program and the instruction memories. This problem, well known as von Neumann bottleneck, has been partially overcome by using multicores microprocessors reducing the execution time. Recently, Graphic Processor Units (GPUs) have been used for implementing complex algorithms taking advantage of the parallel floating-point units and increasing in this way their throughput in execution time. Although the GPU- based implementations achieves a noticeable speed-up, it is important to point out the following aspects: (a) the GPU-based solution presents bandwidth bottlenecks when all the source data are accessed from global memory or when simultaneous accesses from different threads to memory have to be addressed, (b) GPUs are not tailored for specific applications and commonly are difficult to fine-tune for executing only the operations required by an algorithm, and (c) these integrated circuits operate at high frequencies leading to large power consumption. This is a drawback for embedded system applications.

Due to the high capacity of parallel processing, FPGAs are now being used to accelerate processes such as digital signal processing, image processing, robotics, encryption and decryption, and communication protocol decoding [3, 4]. Robot tasks with higher operational speeds are one of the applications that require high processing capabilities. Therefore, the FPGA implementation of direct kinematics can be an important solution in order to achieve several real-time constraints.

There are two important aspects that must be considered when parallel processing computations using floating-point operators are implemented in FPGA: (a) the tradeoff between the need of reasonable accuracy and the cost of logic area and (b) the choice of a suitable format such that dynamic range is large enough to guarantee that saturation will not occur for a general-purpose application.

Current advances in VLSI technology raised the density integration fast enough for allowing the designers to develop directly in hardware several algorithms commonly implemented on software, thus, obtaining an expressive processing speed-up [5]. Moreover, computation of direct kinematics has high parallelization capabilities, and then, the performance can be improved by implementing it on FPGAs. In this way, a hardware architecture of direct kinematics taking advantage of these features could be useful in robotics applications that require high-speed movements.

In this paper, an FPGA implementation for computing the direct kinematics of a spheric robot with five degrees of freedom (5 D.o.f) is described. The hardware architecture considers a floating-point arithmetic, parameterizable by bit-width, allowing the designer to choose a suitable format according to the available hardware resources, accuracy and dynamic range requirements. The main contributions of this work can be summarized as follows: (a) the proposed architecture makes use of several floating-point arithmetic and trigonometric libraries, allowing the performance to be improved in comparison with previous works implementations, in which the floating-point operations are executed in software using DSPs, GPUs, or CPUs (see Section 2), (b) this work presents an error analysis for different bit-width representations (32, 43, and 64 bits), allowing the designer to analyze the tradeoff between the bit-width representation, which directly affects the cost in logic area and the accuracy requirements. Comparison of the processing time between the hardware architecture and a software implementation, using a PowerPC embedded microprocessor, is also presented, (c) the proposed hardware architecture has been developed taking into account a resource constrained methodology, allowing the arithmetic and trigonometric units to be scheduled between different states in order to achieve the lowest execution time according to the available hardware resources.

Section 2 presents the related works covering hardware implementations of direct and inverse kinematics. Section 3 describes the direct kinematics mathematical formulation. Section 4 describes the IEEE-754 standard for the floating-point number representation and a tradeoff analysis of the FPGA implementation of the floating-point operators required by the direct kinematics. Section 5 presents the FPGA implementations and, before concluding, Section 6 presents the synthesis and simulation results.

Several previous works have presented hardware architectures for implementing only the servo control loop of robotic manipulators, being in these cases the direct and inverse kinematics computed on software [69]. In [6, 7] a hardware structure for controlling a SCARA robotic manipulator is developed, relieving the computational cost of general purpose microprocessor by implementing on FPGAs the servo control loop of the robot manipulator.

In [8, 9] a hardware-software codesign for controlling an articulated robot arm is presented, using a NIOS processor for implementing the inverse kinematics. In the same context, in [10] the case study is a neural controller for 3 D.o.f parallel robot for milling. This controller is based on neural model of the inverse dynamics of the manipulator, trained on data collected with the use of a computed torque stabilizing controller, and the authors propose that good candidates for hardware implementation are those fragments of an algorithm that can be calculated using only fixed-point operations, due to the fact that they require less FPGA resources and, therefore, are faster whenever are compared with floating-point modules. The other parts of the algorithm (including those working with floating-point) are good candidates to be implemented in an embedded processor. Additionally, [11] proposes hardware solutions based on an ARM processor and fixed-point FPGA modules for computing the trigonometric and square root functions of inverse kinematics. They are based on existing pipeline arithmetic circuits, which employ the CORDIC (Coordinate Rotation Digital Computer) algorithm.

Taking into account previous works, the hardware/software codesign is a very applied approach. In this direction, [12] implements the direct and inverse kinematics of a multifinger hand system, computing the floating-point arithmetic and trigonometric operations (for instance inverse kinematics) using a high-speed DSP processor and several hardware peripherals (PWM controllers and interfaces for data communication), connected among them throughout a NIOS embedded processor. On the other hand, [13] shows the implementation of the direct kinematic and force feedback algorithm using both CORDIC (for fixed-point) and Xilinx arithmetic library. This work has measured the error of the direct kinematics with respect to a PC implementation, in which numerical analysis shows that Cartesian position error is always below 0.1 mm and the error in orientation is less than 0.001 deg. Although a 32 bit fixed-point data representation is used, details about the same (for example, bit number for fractional part) are not described.

In [14] a hardware implementation of inverse kinematics and a servo controller for a robot manipulator are proposed, allowing the FPGA to compute the high complex computations, such as the transcendental functions, as well as exploring the parallel capabilities of the inverse kinematics. As previous approaches, all hardware embedded operations are performed in fixed-point.

In [15] a generic controller for a multiple-axis motion system (that can be applied to a manipulator) is implemented, which includes several modules such as velocity profile generator, interpolation calculator, inverse kinematics calculator, PID controller, among others. In this case, the trigonometric operations (such as sin, cos, arc tang and arc cosine) have been also implemented by using lookup tables and a fixed-point arithmetic representation.

Additionally, [16] shows the direct and inverse kinematics computation of a 6 D.o.f space manipulator using an ARM processor and FPGA coprocessor. Additionally, it considers the hardware implementation of a pipelined CORDIC library in an FPGA device (using fixed-point representation). The experiment shows that the absolute accuracy of the end-effector is less than 3 mm error. In [17] an optimized inverse kinematics approach is proposed for controlling an arm of a virtual human, in which an FPGA device is used for accelerating the computations. In order to improve the performance, a floating to fixed point conversion is performed; however, it imposes a limitation on the dynamic range of the operations.

In summary, almost all previous works describe a hardware-software codesign for implementing the direct and inverse kinematics, in which critical parts are developed in hardware accelerators, developed in FPGAs. On the other hand, these previous works do not consider explicitly floating-point arithmetic for performing the computations using appropriated arithmetic libraries for FPGA, and this can become very important as required by several engineering applications such as robotic manipulators, in which high accuracy and a large dynamic range are important requirements, apart from attending both good performance and low cost in logic area. In this context, taking the importance and dramatical growth of embedded application for automation, control, and robotics areas [18], the full hardware implementation of a kinematics is very important to be researched, in terms of comparing the cost, performance and precision with respective software implementations, specially developed over FPGA embedded processors. Additionally, most of the previous works do not show the error analysis comparing both hardware and software results of the same kinematics, and only few ones measure the error of the direct/inverse kinematics (for instance, [13, 16]) using only fixed-point representation.

3. Background

3.1. Direct Kinematics

A robot manipulator can be described as a series of links, which connect the end effector to the base, with each link connected to the next by an actuated joint. Attaching a coordinate frame to each link, the relationship between two links can be described with a homogeneous transformation matrix—an matrix. Therefore, a sequence of these matrices relates the based to the hand of the manipulator (see (1)) [2].

Thus, the direct kinematics problem is summarized by finding a homogeneous matrix transformation that relates the end effector pose (position and orientation) to the based coordinate frame, knowing the angles and displacement between the links and the geometric parameters of the manipulator [2].

The joint coordinate frames have to be assigned by using a rational convention, associated to a zero position (where all joint variables are set to zero). So, it is assumed that for the assignment of coordinate frames to each link the manipulator has to be moved to its zero position. The zero position of the manipulator is the position where all joint variables are zero. This procedure may be useful to check if the zero positions of the model constructed are the same as those used by the controller, avoiding the need of introducing constant deviations to the joint variables (joint positions). Subsequently the -axis of each joint should be made coincident with the joint axis. This convention is used by many authors and in many robot controllers [2, 19]. For a prismatic joint, the direction of the -axis is in the direction of motion, and its sense is away from the joint. For a revolute joint, the sense of the -axis is towards the positive direction of rotation around the -axis. The positive direction of rotation of each joint can be easily found by moving the robot and reading the joint positions on the robot controller display. According to the [2, 19], the base coordinate frame (robot reference) may be assigned with axes parallel to the world coordinate frame. The origin of the base frame is coincident with the origin of joint 1 (first joint). This assumes that the axis of the first joint is normal to the plane. This location for the base frame coincides with many manufacturers' defined base frame. Afterwards coordinate frames are attached to the link at its distal joint (joint farthest from the base). A frame is internal to the link it is attached to (there is no movements relative to it), and the succeeding link moves relative to it. Thus, coordinate frame is at joint , that is, the joint that connects link to link . The origin of the frame is placed as following: if the joint axes of a link intersect, then the origin of the frame attached to the link is placed at the joint axes intersection; if the joint axes are parallel or do not intersect, then the frame origin is placed at the distal joint; subsequently, if a frame origin is described relative to another coordinate frame by using more than one direction, then it must be moved to make use of only one direction if possible. Thus, the frame origins will be described using the minimum number of link parameters. The -axis or the -axis have their direction according to the convention used to parameterize the transformations between links. At this point the homogeneous transformations between joints must have already been determined. The other axis ( or ) can be determined using the right-hand rule. A coordinate frame can be attached to the end of the final link, within the end-effector or tool, or it may be necessary to locate this coordinate frame at the tool plate and have a separate hand transformation. The -axis of the frame is in the same direction as the -axis of the frame assigned to the last joint (). The end-effector or tool frame location and orientation is defined according to the controller conventions. The homogeneous transformations between joints follow the Denavit-Hartemberg convention (D-H), making two consecutive links related through 4 basic transformations that only depend on the link geometry [2]. The transformations are:(1)a rotation about the axis by the angle between links (), (2)a translation along the axis of the distance between the links (), (3)a translation along the axis (rotated axis) of the length of the link (), (4)a rotation about the axis of the twist angle ().

, , , and represent the D-H parameters of the link . Identifying these parameters, the matrix that relates two consecutive links can be obtained. Multiplying the sequence of the matrices (see (1)) a homogeneous matrix transformation is obtained, solving the direct kinematics problem.

3.2. Direct kinematics for a Spherical Robot Manipulator

The robot described in this paper has a spherical topology with 5 degrees of freedom, with two rotational joints in the base followed by a prismatic joint and two rotational joints located in manipulator hand. The first three joints are responsible for the wrist position and the two last ones are responsible for the hand or tool orientation. Figure 1 shows the proposed robot topology.

Figure 2 shows the assigned coordinate frame to each link following the convention D-H for the manipulator depicted in Figure 1. Table 1 presents the D-H parameters.

Table 1 can be used for computing the matrix that relates two consecutive axis frames of the robot, and finally, by multiplying the sequence of the matrices, a homogeneous matrix transformation is obtained. The matrix can be formulated as shown in Equation (3), where the entries of follow a notation that represents, for example, as the projection of the axis of the last coordinate frame in the axis of the base frame, and is the component of the origin coordinates of the last frame represented in the base frame.

By using the notation, the matrix is summarized by the (4) to (15). In these equations the notation and was used.

It can be observed in the (4) to (15) that some intermediate computations are found in two equations. For example, the intermediate computation () is presented in the (4) and (7). In this way, the intermediate computation should be computed only once, saving processing time.

Although parameter is equal to zero, it will be taken into account for future calibration purposes.

4. Description and Analysis of the Floating-Point Operators

4.1. IEEE-754 Format

The IEEE-754 format [20] is a floating-point number representation characterized by three components: a sign , a biased exponent with bit-width and a mantissa with bit-width as shown in Figure 3. A zero sign bit denotes a positive number and a one sign bit denotes a negative number. A constant (bias) is added to the exponent in order to make the exponent's range nonnegative and the mantissa represents the magnitude of the number. The standard also includes extensive recommendations for advanced exception handling, additional operations (such as trigonometric functions), and expression evaluation for achieving reproducible results.

This standard allows the user to work not only with the 32-bit single precision and 64-bit double precision, but also with a suitable precision according to the application. This is suitable for supporting variable precision floating-point operations [21].

4.2. An Architectural Approach for Floating-Point Operators

As stated in (4) to (15), the direct kinematics computation of the spherical robot described above requires the computation of add/sub and multiplication arithmetic operators, as well as the sine and cosine transcendental functions. Previous works covering hardware implementations of floating-point transcendental functions on FPGAs are based on CORDIC algorithms using simple and double precision formats [2224]. In [25] a HOTBM method for computing trigonometric functions in FPGAs is presented, achieving a reduced area and high performance without sacrificing accuracy. However, these works have not received enough attention on the cost in area associated with the precision level, as well as, their respective error analysis.

In this work, the Taylor series expansion has been implemented on FPGAs for computing, in an integrated approach, the sine, cosine, and arctangent functions, using a floating-point arithmetic based on the IEEE-754 standard. Our previous results point out that the Taylor expansion approach has a lower execution time and a lower cost in logic area than the CORDIC-based solution. However, the CORDIC algorithm presents a better performance in terms of precision [2628]. As will be explained below, this work focuses on solving the resource and timing constrains (see Section 4); therefore, the choice of using a Taylor expansion approach for computing the floating-point trigonometric operators can be justified. This work considers accuracy as a design criterion and provides an error analysis associated to the bit-width representation and the area cost, as well as, an error analysis associated to the number of iterations of the Taylor series. These results are useful for choosing a suitable format for representing real numbers according to general-purpose applications.

The Taylor series of sine, cosine and arctangent functions, around 0 are given, respectively, as

The factors and are precomputed and stored in a ROM avoiding additional operations. Before the first iteration, the term is computed and stored. In each iteration one floating-point addition of the accumulated approximation with the th term is necessary. Additionally, for computing the th term two floating-point multiplications are needed: (1) for computing either or for sine/arctangent or cosine, respectively, and (2) for computing either for sine, for arctangent and for cosine.

Figure 4 shows the proposed architecture for the Taylor series expansion. This approach was achieved using only one , an , a Finite State Machine (FSM), and three ROMs for storing the precomputed and factors. The op signal chooses between a sin, cos or atan functions. The FSM synchronizes the , units and alternates the signal. At the first iteration, the factor is computed and stored in a register and fed back to the FSM. Afterward, the precomputed coefficients are selected in order to compute the current term either for sine, for arctangent or for cosine. The unit computes and accumulates both the addition or subtraction operations. After iterations, the ready signal indicates a valid output.

Table 2 shows the Mean Square Error (MSE) results of the arithmetic and transcendental functions architectures for different bit-width representations (24, 32, 43, and 64 bits), using the Matlab results as statistical estimators. The Taylor architecture uses 5 nonzero powers (5 terms of the series expansion). As expected, the best results were achieved using the double precision format (64 bits); however, one can expect a large cost in logic area.

Table 3 presents the MSE and latency in clock cycles for the Taylor series expansion, using a simple precision format (32 bits). It can be observed that the smaller error achieved by the FPTaylor core is over , for five powers ( polynomial degree). With more than five powers in the series expansion, the error due to the Runge's phenomenon is increased [29] when using polynomial interpolation with polynomials of high degree.

5. The FPGA Implementations

The FPGA implementation of the direct kinematics is carried out using Time-Constrained Scheduling (TC) [30], as showed in Figure 5. In fact, the circuit explicitly implements (4) to (15). To do that, some strategies can be used in order to allocate the respective hardware resources, achieving a previously defined cost function, which can represent time or/and resource restrictions. Time constrained and resource constrained scheduling algorithms are well-known techniques for leading with obtaining a suitable time execution and allocating the available hardware resources (for instance, arithmetic operators). For achieving the circuit depicted in Figure 5 (from (4) to (15)), only resource restrictions were imposed, where 4 multipliers and two add/sub units were used, apart from 8 trigonometric units. With the proposed scheduling the lowest time execution was tried to be accomplished. In contrast, a time constrained scheduling would need more arithmetic units for accomplishing a predefined restriction in execution time. Anyway, this kind of algorithms does not guarantee an optimal solution [30].

The control of the overall architecture was achieved by a Finite State Machine (FSM), and the overall architecture was described in VHDL using the Xilinx ISE 10.1 development tool, making use of the floating-point arithmetic units. Eight cores for computing transcendental functions were required: four for computing the sin function and the other four for computation the cos function. The necessary addition and subtraction operations were achieved by applying the and cores presented in [28] (see Section 3). For computing the sin and cos functions, a Taylor series expansion was implemented, using a and a cores, with a similar architecture to the presented in [26]. The and were achieved using 10 and 11 nonzero powers for cosine and sine respectively and an elapsed time of 23 clock cycles for cosine and 26 for sine.

It can be noticed that the orientation and position results are calculated from the following steps: in step 3 the and , in step 4 the parameter , in step 5 the and , in step 6 the parameter is computed, in step 7 the and , in step 10 the and , and finally in step 11 the and (see Figure 5).

In order to compare the performance of the proposed architecture, the direct kinematics was also implemented in software using a C code language running on an FPGA embedded PowerPC processor, using the Xilinx Platform Studio development tool. This approach allows the designer for having a just comparison between both hardware and software developments, which are running in the same environment (namely, the same FPGA with the same clock). To do that, two hardware peripherals were added to the PLB bus of the PowerPC processor: the first one for counting the clock cycles and the second one for accomplishing a RS-232 serial communication to a Matlab environment. The counter is used for measuring the processing time in software by means of enable and stop ports (both addressed by the PLB bus), which are controlled by specific software instructions. Figure 6 shows the connection between the PowerPC processor and the hardware peripherals.

6. Results

6.1. Synthesis Results

Synthesis results are shown in Table 4 for the Xilinx Virtex2 family (chip xc2vp30) using the ISE 10.1 development tool. It can be observed that the cost and performance of the architecture are presented in Figure 5 for different bit-width (exponent and mantissa), including both the simple precision and double precision (IEEE-754 standard). As expected, large bit-width representations are more expensive in terms of area if compared with small bit-width representations. Additionally, the performance is better when using small bit-widths.

It can be seen that the 43 and 64 bit-widths representations exceeded the hardware resources available in the specific Virtex2 FPGA (see Table 4). However, it is important to take into account that the xc2vp30 is not the largest device of the Xilinx Virtex2 FPGA family. Modern FPGAs devices have a large number of configurable logic blocks (CLB), dedicated DSP blocks, among other facilities which are suitable for mapping complex algorithms directly in hardware. Table 5 shows the cost in logic area of the same architecture using a Virtex5 FPGA (chip xc5vlx110t). It is important to take into account that the Virtex5 FPGA family uses embedded DSP48Es blocks, which considers bits multipliers. It can be observed that the operational frequency has been increased; however the double precision representations exceed the available number of LUTs. However, as will be explained below, simulation results point out that the single precision format (32 bits) achieves similar error results to large bit-width representations. Therefore, one can conclude that the 32 bit implementation is suitable in terms of hardware resources consumption and error associated for computing the direct kinematics of a 5 D.o.f spheric robot.

6.2. Simulation Results

The implemented direct kinematics hardware architecture (see Figure 5) was simulated using the ModelSim 6.3 g simulator tool. A simulation environment to validate the architecture behavior was developed in Matlab. This developed tool created the floating-point test vectors for different bit-widths. The binary floating-point results were analyzed in the simulator environment in order to calculate the Mean Square Error (MSE) of the implemented architecture, for which the set of Matlab results in double precision arithmetic were used as a statistical estimator.

For computing the MSE of the hardware architecture, 100 input values were used. The input ranges were −90° to 90° for the angles , , and , −35° to 20° for the angle , and 160 mm to 760 mm for the parameter.

Table 6 shows the MSE for different bit-width representations. Notice that the magnitude order of MSE does not have significant changes. Simulations using a bit-width representation smaller than a 32 bit-width representation did not present satisfactory results, due to saturation in the floating-point operators.

The software implementation (see Figure 6) was validated using a simulation environment developed in Matlab and the serial RS232 interface for sending the input values and receiving the output values (position, orientation, and time processing in clock cycles). The aim of this simulation is to compare the processing time of the direct kinematics between software and hardware approaches and not to compare the error in the computations.

The elapsed time using the hardware architecture for each variable and each step is shown in Table 7. The overall direct kinematics is computed in 67 clock cycles. According to synthesis results, the maximum operational frequency is around 54 MHz for a single precision format, thus, all the computations are performed in 1.24 μs. The software implementation has a processing time of 1.61036 ms for all computations of the direct kinematics in floating-point using a single precision. For comparison purposes we have only used the obtained operational frequency on the VirtexII FPGA family (54 MHz) due to the fact that this device has a PowerPC embedded processor. It can be observed that the hardware architecture is over 1298 times faster than the software implementation, which means a considerable speed-up in the processing time of the direct kinematics. These elapsed time results can be dependent on the operation scheduling used in the architecture.

7. Conclusion

An FPGA implementation of the direct kinematics of a spherical robot manipulator using floating-point units was presented. The proposed architecture was designed using a Time-Constrained Scheduling, using 8 cores for computing transcendental functions (sine and cosine), and sharing 4 multiplier floating-point cores and 2 addition/subtraction floating-point cores for computing the arithmetical operations. Synthesis results show that the proposed hardware architecture for direct kinematics of robots is feasible in modern FPGAs families, in which there are plenty of logic elements available.

The overall computations were successfully performed and were validated using the Matlab results as a statistical estimator. The computation time to implement the direct kinematics in hardware is over 1.24 μs, whereas the same formulation implemented in software using a PowerPC processor needs 1.61 ms, obtaining a considerable speed-up in the processing time.

As future works, a hardware architecture for computing the inverse kinematics of the proposed robot manipulator will be implemented using the floating-point cores currently developed, such as division, square root, and arctangent functions. In addition, we intend to perform a performance comparison of the proposed hardware architecture with a multicore embedded processor approach.