Abstract
In finite element methods, numerical simulation of the problem requires the generation of a linear system based on an integral form of a problem. Using C++ metaprogramming techniques, a method is developed that allows writing code that stays close to the mathematical formulation. We explain the specifics of our method, which relies on the Boost.Proto framework to simplify the evaluation of our language. Some practical examples are elaborated, together with an analysis of the performance. The abstraction overhead is quantified using benchmarks.
1. Introduction
The application of the finite element method (FEM) requires the discretization of the integral form of the partial differential equations that govern the physics, thus transforming the problem into a set of algebraic equations that can be solved numerically. In essence, the resulting numerical model only depends on the governing equations and the set of basis and test functions. From the point of view of a model developer, it would therefore be ideal to only need to specify these parameters in a concise form that closely resembles the mathematical formulation of the problem, without sacrificing computational efficiency.
The current work is part of Coolfluid 3 [1], a C++ framework intended primarily for the solution of fluid dynamics problems and available as open source under the LGPL v3 license. The system is designed around a Component class that can provide a dynamic interface through Python scripting and a GUI. Model developers will typically write a set of components to build a solver for a given set of equations, using functionality provided by our framework and external libraries. Problemdependent settings such as the mesh, boundary conditions, and model parameters can then be controlled through Python or a GUI, though these aspects are beyond the scope of the current paper. In this context, we provide an Embedded Domain Specific Language (EDSL) that can be used to implement finite element solver components. It uses a notation similar to the mathematical formulation of the weak form of a problem, allowing the programmer to focus on the physics rather than coding details. We assume a typical finite element workflow, where a global linear system is assembled from element matrix contributions. Our language can also describe boundary conditions and field arithmetic operations.
All language elements consist of standard C++ code, so they easily embed into the components of our framework. An extension mechanism is available, allowing any developer to implement his own language elements without touching the library code. The implementation uses the Boost.Proto [2] framework, which builds on template meta programming techniques for generating efficient code. The actual performance overhead will be quantified and compared to the global solution time for some applications.
The automatic generation of code based on an intuitive specification of the physical problem is of course a feature that is highly desirable for any numerical simulation framework. One example of previous work providing such functionality is OpenFOAM [3], a C++ library primarily intended for fluid mechanics. It allows for the easy expression of differential equations by providing an embedded tensor manipulation language, specific to finite volume discretizations. The FEniCS project [4] provides an easy and efficient method for developing finite element discretizations. The approach differs from ours in that the language is embedded into Python and compiled using a justintime compiler. It also offers the option of automated analytical evaluation of integrals [5]. Another example is the FEEL++ project [6], which also provides a domain specific language embedded into C++ using expression templates.
Notwithstanding the existence of all these excellent projects, we decided to implement our own framework for tight integration with the Coolfluid data structures and to support our workflow of building userconfigurable components. The implementation relies on expression templates, just like FEEL++, but we use Boost.Proto to handle the expression template generation and to define a grammar for our language. This results in simpler code and has a limited impact on runtime performance, as we will show. Even though our language is not as featurecomplete as the more established FEEL++ and FEniCS projects, we do believe that the easy integration of userdefined extensions into the language is a unique feature. The purpose of the current paper is to present how Proto can be used to construct a language for finite element modeling—based on a simple example building a mass matrix—and to show the capabilities and performance of the language we developed. Aspects that are new compared to previous work in the field are the use of Proto as a new way to construct the language and the possibility to add userdefined terminals to the language.
This paper is structured as follows. In Section 2 the mathematical formulation of the integral form will be laid out in order to clearly define the class of problems that is supported. Next, in Section 3 the mechanics for constructing and interpreting the EDSL are explained in detail, based on a simple application. Section 4 contains some application examples and Section 5 a performance analysis. Finally, in Section 6 we present our conclusions and suggest future work.
2. Finite Element Discretization
In this section, we introduce the notation used for finite element discretizations, starting from the Poisson problem as an example. A more generic and much more thorough introduction to the finite element method is given in, for example, [7]. Considering a domain with boundary , the differential equations describing the Poisson problem are
To transform these continuous equations into a discrete problem, the unknown is interpolated using a linear combination of shape functions with the unknown coefficients :
The shape functions depend only on the spatial coordinates and to simplify the notation we will from here on just write instead of . The shape functions have a local support and the interpolation happens on a perelement basis, so is equal to the number of nodes in an element. The interpolation can also be written in vector form as . Here, is a row vector with the shape functions associated with unknown and is the column vector of the unknown coefficients for the element.
To obtain a linear system for the discrete problem, we need to multiply the equations with a weighting function and integrate over the domain. Using the Galerkin method (i.e., the weighting functions are the shape functions) as well as integrating perelement yields the following linear system for the Poisson problem, using the weak formulation here:
Due to the local support of the shape functions, the integral is written as a sum of integrals over all the elements. Each element integral results in an element matrix on the left hand side and an element vector on the right hand side. Summing up the contributions of all elements that share a given node assembles a global linear system that has the total number of nodes as dimension.
The indices are useful for problems with multiple unknowns. In this case, each unknown can be associated with its own shape function, and the element matrices and vectors are built up of blocks associated with each variable and equation.
The matrix assembly procedure is the same for all problems; only the values of the element matrices depend on the initial equation (1). This observation drives the current work: we will present a convenient language to write out the element equations and automate the steps that are common for all finite element discretizations. The code for the Poisson problem is presented in Listing 1, where lines and map directly to the element matrix and vector, as defined in (4), and lines and represent the assembly into the global system. Lines and form the core of our language, and the techniques for implementing this will be explained in detail in the next section. The remainder of Listing 1 will be further explained in Section 4.1.

From this simple example, the basic requirements for our language can be identified: provide access to the shape functions and their derivatives, as well as nodal values of the variables; control the evaluation of integrals; and allow the application of boundary conditions. Finally, we must be able to express the assembly into global matrices and vectors. The embedding of such a language into C++ is the main topic of the remainder of this paper.
3. Construction of the Language
Our language consists of three levels of implementation, as shown in Figure 1. The top level contains the language itself, which consists of keywords that the user combines into an expression. Because the language is embedded, only valid C++ expressions are allowed, but we can define the interpretation freely via operator overloading. This is the job of the algorithm implementation layer. Here, the language is parsed and appropriate actions—called semantic actions in language grammar terms—are linked to each part of the input. Some of the actions may result in calls to a shape function library or matrix operations, so the final layer contains all external libraries needed to execute these.
The remainder of this section will focus on each layer in more detail, using a simple standalone example. This allows us to easily demonstrate the techniques used in Coolfluid 3 while avoiding the need to explain the details of our mesh structure and linear algebra interface, which are beyond the scope of this paper. The example will result in a program that builds the mass matrix for a finite element problem with a single scalar variable. It is a standalone program that depends only on the Boost libraries and on the Eigen library [8] for matrix operations.
The mass matrix is assembled from the outer product of the weight and shape functions; that is, The shape function has no index referring to a specific variable here, since we only consider a single, unspecified scalar variable. We want to ensure that the following code evaluates (5): here, for_each_element represents a generic function, taking the finite element mesh structure as first argument and a Proto expression as second argument. The expression consists here of a call to element_quadrature to evaluate the integral over an element and the += operator should be interpreted as an assembly into the mass matrix M. The outer product of the shape function vectors maps directly to transpose(N)*N. In what follows, we show how we can evaluate this expression.
3.1. The Language Layer
Using the Boost.Proto [2] library, expressions are constructed from socalled “terminals.” These are C++ objects that are instances of the Proto terminal class, which overloads all of the C++ operators. Combining different terminals using operators builds an expression that can be evaluated later. This expression is analogous to the expression templates used in, for example, FEEL++ [6]. Listing 2 presents a minimal program that allows compiling the expression for (5). Line contains the expression, built from the terminals defined at lines to . Because all operations are defined on a Proto terminal, the code compiles. The result of line is an expression tree, retaining the concrete type of each element in the expression. This is the same kind of data structure that we encounter in any expression template library, but it is generated automatically by Proto here.

Figure 2 shows the expression tree corresponding to line . Leaf nodes in this tree correspond to Proto terminals. Each function call operator (denoted by ()) has two child nodes: the terminal representing the function itself (e.g., transpose) and the argument to the function. The terminal that represents the function is used to link it with the correct code of the function, based on the unique type of that terminal. This implies that functions can easily be renamed by creating a new terminal of the same type. The binary operators += and * also have two children, corresponding to the left and right hand side.
So far, the expression on line does nothing. The parsing and the implementation of the appropriate actions to evaluate this are the subject of the next section.
3.2. The Algorithm Implementation Layer
The algorithm implementation layer in Figure 1 takes care of the interpretation of expressions and the execution of the associated actions. In this section, we take a topdown approach for evaluating our example expression. To this end, we first define a class—grammar in Proto terminology—capable of parsing the expression. The class defined in Listing 3 describes the language that we use in our example expression. The grammar is defined entirely in a templated type from which fem_grammar derives. This notation is actually the embedded domain specific language provided by Proto itself for defining a grammar. The or_ construct allows us to list a series of rules (each a Proto grammar in itself) that are checked in sequence. The when statement first describes what kind of syntax to expect, followed by a second argument—the semantic action—that describes how to evaluate the expression matching the when clause. In the case of terminals, we can use the type to distinguish them. The grammar can be used to check the validity of an expression at compile time, using the proto::matches metafunction, which does not evaluate the expression but simply checks if it matches a given grammar, taking into account only the when clauses. This functionality can be used to generate compilation errors that can help in finding errors in expressions, but this is not further discussed in the current work.

The main use of the grammars is the actual evaluation of the expressions. To this end, we can use fem_grammar as a C++ functor on an expression, in which case the grammar acts as a Proto “transform” (i.e., a kind of functor), evaluating the expression by executing the semantic actions embedded into the grammar. We will now describe each rule in the grammar in detail.
First, at line , shape function expressions are matched and evaluated. In (6), shape functions appear as the terminal N. The grammar matches this using proto::terminal with the appropriate type on line . The second argument in the when clause (line ) performs the evaluation. All the grammar code is inside template parameters, so everything is actually a type. Line is then not really a function call, but the type of a function that returns something of the type eval_shape_func and takes proto::_data as argument. This function is never really defined, but Proto uses the type to call an eval_shape_func functor and pass its arguments. The argument consists of context data here—the “data” block from Figure 1—and will be discussed in detail later.
The implementation of the eval_shape_func functor is given in Listing 4. To be usable in a Proto grammar, a functor must inherit the proto::callable class. Lines to are used by Proto to compute the result type of the functor. In C++11, it is possible to obtain this automatically, so this part of the code can be omitted. The function itself is defined on lines and on line we return the shape value functions, which are just cached in the context data. As will be seen later, this cached value is actually computed in the element_quadrature function.

The next grammar rule, on line in Listing 3, matches and evaluates the matrix transpose. In the matching rule on line , it is prescribed that a function call using a terminal of type transpose_tag and with as argument an expression that matches fem_grammar is the correct form of a transpose expression. To evaluate the argument, we make a recursive call to fem_grammar itself (using _child1 to isolate the function argument in the expression) and apply the eval_transpose functor to the result. Listing 5 shows the relevant code for evaluating the transpose. It is a functor that takes any matrix type from the Eigen library [8] as argument and calls the transpose function on line .

The rule describing the integration over an element starts on line in Listing 3. On line , we impose that only expressions of the plus_assign type, that is, +=, can be used as an argument to the element_quadrature function. This expression is then picked apart on line using the _left and _right Proto operations, where the value of the left hand side (an Eigen matrix representing the global system matrix), the right hand side expression, and the context data are passed to the eval_element_quadrature functor. This functor is defined in Listing 6. On line , the numerical integration is performed by looping over the Gauss points defined for the element. First, the shape function values and Jacobian determinant are computed and stored in the context data (lines and ). Next, the element matrix is updated with the value of the expression at the Gauss point, multiplied with the appropriate weight and the Jacobian determinant. The expression itself is evaluated using the call: Here, our grammar is defaultconstructed (i.e., the first ()) and then called using three arguments: the expression, the state (0 here), and the context data. The state is similar to the data, but we chose not to use it in the present work and just pass an integer. By supplying the context data, any shape function evaluations that occur in expr will use the correct value at the current Gauss point as computed on line . This also means that the shape function is evaluated only once, no matter how many times it occurs in the expression.

In this example, the element quadrature function also places the result in the global system matrix (i.e., M in (6)). This happens on lines , where the global node index for every entry in the element matrix is obtained from the context data and the corresponding global matrix entry is updated.
The final rule in the grammar (line in Listing 3) is a fallback to the default C++ behavior. It tells Proto to perform the default C++ action when none of the earlier rules are matched, recursively calling fem_grammar to interpret any subtrees. In the case of our example expression, it ensures that the matrix product is evaluated using the operators defined in the Eigen library, after evaluating the left and right operand using fem_grammar.
The definition of the context data used by the functors is given in Listing 7. It holds the knowledge about the concrete element type and the finite element mesh that we use. We are completely free to define this type as we see fit, since Proto passes it along as a template parameter. In our example, we provide storage for results such as the shape function values and the node coordinates. The data changes for each element and is updated by the set_element function that is called by the element looping algorithm (still to be discussed). This function updates the node coordinates and the mapping between local element node index and global node index. The entire class is templated on the element type, which allows using highperformance fixed size matrices from the Eigen library.

Using predefined transforms from the Proto framework and userdefined function objects allows for the creation of grammars which can call themselves transforms recursively. This concise domain specific language describing grammars is a key feature of Proto: it makes it possible to indicate how to evaluate an expression without resorting to complicated metaprogramming techniques, resulting in code that is easier to read and maintain. The complexity of the grammar and the expressions it can parse is limited only by the time and memory available for the compilation.
To obtain a working code, we still need to implement the loop over the elements and generate the data that is used in expression evaluation. The data is templated on the concrete element type, but as can be seen from Listing 14 the mesh data lacks this information at compiletime. This is logical, since in a real application the user loads the mesh, so the software cannot know what concrete type is used at compile time. To select the element at run time, we need to generate code for all the element types that are supported by the solver. In this example, we will support both 1D line elements and 2D triangle elements. The difference in the mesh data lies in the number of columns in the coordinates and connectivity tables, so we can use that to determine which mesh we are using. By checking for each supported element type if the mesh matches, we can execute the correct code at runtime. We use the MPL functor defined in Listing 8 to match the correct element type to the mesh. The functor is executed for each item in a list of allowed elements (line ), resulting in cleaner code than a long list of ifelse statements. Once the element type check at line passes, the data can be constructed and we can start the loop, updating the data for each element at line . The actual expression is executed using the grammar at line , in the same way as in the previously discussed element integration functor (Listing 6, line ). The expression itself remains a template parameter, allowing us to write any kind of Proto expression in a call to for_each_element and keeping all compiletime information. Both of these properties are key advantages over a similar system that could be set up using virtual functions. By generating code for a list of predefined element types, the technique can be seen as a special case of generative programming [9], where the code to generate is defined through the MPL vector of element types.

In Coolfluid 3, the process is more complicated, since we can have different shape functions for the geometry and each variable that appears in an expression. This means we must now generate code for every possible combination of shape functions. We chose to organize the data so that each unknown that appears in an expression has its own data structure, aggregated into a global data structure for the expression. This approach makes it possible to have a unified data structure supporting expressions with an arbitrary number of variables. Another complication is the possibility to mix different element types in one mesh, for example, triangles and quadrilaterals. We solve this by organizing the mesh into sets that contain only one element type and then deal with each set in turn. The complete algorithm to create the data and loop over a set of elements of the same type is as in Algorithm 1.

At this point we might wonder what happens if we try to evaluate an expression that makes no sense (e.g., N + transpose). Unfortunately, this often results in a very long list of compiler messages (430 lines for the example), exposing Proto implementation details and long template types. In the current example, it would be easy to fix using compiletime error messages, but for a more complex language grammar this is difficult and we have not yet implemented a satisfactory error handling system. We do intend to handle some common cases, and already errors from Eigen indicating incompatible matrix expressions come through, but they are often obscured by a great number of other errors.
3.3. External Libraries
The example code uses the simplest possible data structures: the mesh simply contains arrays for the coordinates and the element nodes. The shape functions are simplified to only provide the functionality needed for the examples and they also include a fixed set of Gauss points for numerical integration. Listings 13 and 14 show the code for a first order 1D line element of the Lagrange family and the mesh data structure, respectively.
In Coolfluid 3, we provide our own mesh data structure, shape function library, and numerical integration framework. For operations at the element level, the Eigen [8] library is used, since it provides highly optimized routines for small, dense matrices. Finally, sparse matrix operations—required for the solution of the linear system arising from the elementwise assembly—are performed using the Trilinos [10] library.
3.4. User Defined Terminals
If we want to extend the language with new functionality, this currently requires modification to the grammar (Listing 3). Since this is part of the programming framework, it is not something that is supposed to be modified by the user of the language. To provide more flexibility, we allow users to define new terminals that can be used in expressions without modifying the grammar. To the best of our knowledge, this is a unique feature for this kind of embedded language. We will show here how it works, by providing an overview of how this could be added to our example code. We return to the definition of terminals, tagged using empty classes as in Listing 9. We created a terminal with the template class user_op as type. Its use as a function (line ) can be matched using the grammar: proto::function<proto::terminal< user_ op<proto::_> >, proto::vararg<proto::_> >Here, proto::_ is a wildcard that matches any type and vararg allows the function to have any number of arguments. This kind of construct is exactly what we need to make my_op a user defined terminal, though we still have to provide a way to evaluate the expression. Adding a when clause as in Listing 10 yields the code that we need to add to fem_grammar in Listing 3. This uses fem_grammar to evaluate all function arguments and then calls the evaluate_user_op functor to evaluate the call (not listed here for the sake of brevity). In evaluate_user_op, we use the template argument to user_op (my_callable here) as a functor to evaluate the operator. This allows the user to define how the terminal should be evaluated without ever needing to touch the core grammar, by simply implementing the correct function call operator in the type that is passed to user_op.


In Coolfluid 3, this technique is used to implement specialized code for certain solvers, directly setting entries of the element matrix when analytical expressions are known. Some core functions, such as the gradient and divergence operations, are also defined this way. Listing 11 shows the implementation for the divergence operation. On line , we have the function call implementation, where var is expanded into the data associated with the unknown for which we want to compute the divergence. This data is then used to access the gradient matrix (line ), much like we computed the shape function value in our previous example (i.e., line in Listing 6). The terminal is statically created on line , using the MakeSFOp metafunction to avoid a long Proto type name. We can easily overload the function signature using additional function call operators. This is used on line to provide an alternative divergence call that does not take a mapped coordinate as an argument but evaluates the divergence at the current quadrature point instead.

3.5. Integration into a Framework
So far, we have focused on building a small language, with the for_each_element looping function as entry point. In Coolfluid 3, our EDSL is used to implement reusable solver components, so before showing the concrete examples in Section 4 we have to present a few concepts about our framework and how it interfaces with our EDSL.
Each class in Coolfluid 3 is derived from the Component base class, which offers infrastructure to set options at run time and holds an arbitrary number of child components, thus creating a tree. A particular subclass Action implements the Command pattern [11], and we derive a class from that for working with expressions, called ProtoAction. Expressions can then be added to an object tree by creating a ProtoAction and setting its expression using the set_expression function. The framework executes it transparently just like any other Action. Expressions in Coolfluid 3 can loop over either elements or nodes, using different grammars. The node expressions are primarily used for setting boundary and initial conditions and to update the solution.
We also need a way of accessing the mesh and the unknowns, which we do by making use of FieldVariable terminals. A basic action for looping over nodes could be added to a parent component as in Listing 12, in this case setting the temperature to 288 K. Here, the temperature is defined on line , indicating that it is stored as variable T in the field temperature_field. This seems redundant here, but fields can store more than one variable. Furthermore, the parser needs to be able to distinguish each variable at compile time, which is why each variable has a unique number in the first template argument. Each distinct variable that is used in the same expression must have a different number. The second template argument specifies whether we are dealing with a scalar or a vector. We then create the expression component and set its expression on line , choosing an expression that loops over nodes and simply assigns the value 288 to the temperature field here. Boundary and initial conditions are provided in a similar fashion, but there the component is parametrized on the field and variable names, so it can be reused in any solver.



3.6. Compatibility with Matrix Expression Templates
At the element level, matrix calculations are delegated to the Eigen library, which uses its own expression templates. As it turns out, such libraries cause problems when embedded into a Proto expression tree. A library like Eigen will, in the case of complex operations, create temporary matrices on the fly. When a temporary object is passed on through the functors that evaluate our expressions we return references to these temporary objects, resulting in memory errors. The problem is common to all modern matrix expression template libraries, and Iglberger et al. [12] review some cases where these temporaries might appear. To handle this issue, we preprocess the expressions using a special grammar: whenever a matrix product is found, the multiplication expression is modified to store a temporary matrix—allocated only once—that can hold the result of the multiplication. Any matrix multiplication result is then stored in the Proto expression itself and a reference to it is returned during matrix product evaluation. The preprocessing happens by calling an additional functor in the element looping function, so the whole process is transparent to the user of the expressions and allows writing matrix arithmetic of arbitrary complexity.
4. Application Examples
In this section we work out a few examples to illustrate the mapping between the mathematical formulation and the code for different problems.
4.1. Poisson Problem
The Poisson problem was already used as introductory example in Section 2. Here, we provide some more details on the code from Listing 1 that was skipped in the introduction.
First, there is the declaration of the unknown f and the source term g on lines and , respectively. This follows the same mechanism as in Listing 12, using a distinct number to distinguish the variables at compile time. On line , we create an Action component to hold the expression. Finally, the expression is set on line . The elements_expression function indicates that the loop will happen over the elements and is comparable to the for_each_element function in the simplified example. On line , the applicable element types are chosen (all elements from the first order Lagrange family here). The actual expression starts on line , with a call to group to combine multiple expressions into one. This is an advantage if multiple steps are required in the assembly, since they can be combined into one loop over the elements. The assembly makes use of an element matrix _A and element vector _a, which are set to zero on line .
An expression of the form _A(f) returns only the rows of the element matrix that refer to the equation for (i.e., all rows in this case). If we pass a second variable as argument, the returned matrix only contains the columns that refer to it, so _A(f, f) would return a block that only contains the contributions from the equation (again, all rows and columns in this case). This notation is convenient in the presence of multiple variables to select only the relevant entries in an element matrix. The size of the element matrix is also determined by looking for blocks like this, so here we only use _A(f) to indicate that the element matrix contains only a single equation for . The same applies to the element right hand side vector _a using square brackets.
The element integral fills these matrices on lines and , in a similar fashion to the simple example. The assembly into the global system is a separate step and occurs on lines and . The system_matrix and system_rhs terminals keep track of wrapper objects for a Trilinos matrix and vector, using a generic linear solver API that is part of Coolfluid. They are initialized in a base class that provides the functionality for working with expressions and linear systems.
The code in Listing 1 builds a component that assembles the linear system. At run time, it can be combined with other components to add initial and boundary conditions and to solve the linear system, all of which can be controlled from a Python script.
4.2. NavierStokes Equations Using Chorin’s Method
The NavierStokes equations for incompressible flow with velocity vector , pressure , and kinematic viscosity are
We will first solve this system of equations using Chorin’s method [13], since this will allow us to easily compare performance later on with the same example from the FEniCS project. Chorin proposed an iterative time stepping scheme, computing first an auxiliary velocity , followed by the pressure and finally the corrected velocity at the new time step. The weak form (written again for a single element) for the equation is
The superscript indicates the known solution at time step . The index represents the th component of the velocity here, which is a scalar in the case of the interpolated value and a vector of values for each node of the element for the unknowns . This means that, for each value of , we insert a square block with dimension equal to the number of nodes into the element matrix and a corresponding segment into the element right hand side vector. We split up the element matrix as indicated by the braces, where indicates the block corresponding to the rows and columns of component of the velocity. The code to build this linear system is presented in Listing 15.

To get a stable solution, we will interpolate the velocity and pressure using second and first order shape functions, respectively (the TaylorHood element). If we restrict the solver to triangles, we can obtain this using the typedef on line . The actual shape function is chosen at run time, but we enforce the use of second order for the velocity here through the third constructor argument for the velocity variable on line . We use separate actions for the system matrix and right hand side assembly, since in this case the matrix coefficients do not change with time, so we can run the assembly procedure only once. The definition of the element matrices and is provided on lines and . The Laplacian is written exactly the same as in the Poisson problem (Section 4.1), but a new aspect is the introduction of the index _i when indexing into the element matrix. It is automatically expanded into a loop over the number of physical dimensions, addressing the diagonal blocks of the element matrices as defined in (10). The same applies for the mass matrix on line , and both matrices are combined and assembled into the global system on line . The auxiliary_lss component provides access to the terminals related to the linear system and the time. Here, invdt returns a reference to the inverse of the time step, which is set by the user running the simulation.
In the right hand side expression on line , the index _i is used again, using u[_i] to get each component of the interpolated velocity vector . The last term represents the advection, and it requires access to a single component of the nodal values vector . We store the nodal values for a vector variable as a matrix, with each column corresponding to one physical component of the vector. This matrix is obtained using the nodal_values function while individual columns can be addressed using _col. The notation is a bit verbose, but if the user determines this to be a problem it is easy to introduce a user defined terminal for the advection operation.
The next step in Chorin’s algorithm calculates the pressure, through the following Poisson problem: Listing 16 shows the code for this system, again using two assembly actions. For the right hand side assembly on line , we used the divergence function that was defined in Listing 11. We could also write this in terms of the nodal values matrix like this: element_quadrature(_a[p] += transpose(N(p)) *nabla(u)[_i]*_col(nodal_values(u),_i))

On line , we call the lit function on invdt. This is actually a Proto function that constructs a terminal inplace, and it is needed to delay the evaluation of the minus sign, which would otherwise be evaluated right away by C++, resulting in the storage of a copy of the negative inverse timestep at expression creation, rather than a reference to the current value.
The final step of the algorithm updates the velocity, using the gradient of the newly calculated pressure: The system matrix is the mass matrix, so as seen in Listing 17 we assemble it in its own action. Note the similarity here with the standalone example equation (6). The gradient function on line is defined using the user defined function mechanism, and just as is the case with the divergence it can be written using nodal values as well: transpose(N(u)) *(u[_i]  lit(correction_lss>dt()) *(nabla(p)[_i]*nodal_values(p))0)

The implementation of Chorin’s method shows how different systems can be combined to solve a problem with multiple unknowns, each interpolated using a different shape function. The coding of the assembly procedure remains concise and follows the structure of the mathematical equations.
4.3. PSPG/SUPG Stabilized Incompressible NavierStokes
As an alternative to the use of the TaylorHood element used in the previous example, we can use equal interpolation order for the pressure and velocity, provided that we add a stabilization term to the continuity equation. We follow the method presented in [14], which adds PSPG and SUPG stabilization, as well as a bulkviscosity term. When using CrankNicolson time stepping, the method is second order accurate in both time and space. We also start from the skew symmetric momentum equation, yielding the following form to replace (9): In the absence of stabilization, the discretization of the skew symmetric form preserves the kinetic energy [15]. The weak form of the equations for a single element, after discretization in time, can be written as We applied a theta scheme here for the time discretization, solving for the difference between two time steps, . The vector of unknowns is arranged in blocks for each unknown, that is, first the pressures for all nodes, then the velocities in the direction, and so on. This results in a corresponding blocked structure for the element matrices and , where the blocks are given by Each stabilization term is multiplied with a corresponding coefficient . Splitting the equation into blocks helps in managing the added complexity due to the stabilization and skewsymmetric terms. Block , for example, represents the pressure Laplacian that arises from the PSPG stabilization. The SUPG terms in blocks , , and are all written as a modification to the weighting function, adding more weight to upstream nodes. The bulk viscosity and skew symmetric terms in the momentum equation fill the offdiagonal blocks, as indicated by the use of both indices and . Equation (14) is assembled into a single linear system using the elements expression defined in Listing 18 (showing only the part relevant to the assembly itself).

Since only one linear system is involved, the code is integrated into the framework in the same way as in Listing 1. The value of the coefficients depends on local element properties [16]. We calculate them using a userdefined terminal compute_tau, passing a reference to a double for each coefficient (line ). On line , we use indices _i and _j to create a nested loop over the dimension of the problem, filling all blocks. The right hand side from (14) is built by applying the element matrix to the current element unknowns , represented by _x. On line , we divide by to avoid the use of the scheme on the continuity equation, improving stability. Finally, on line we write the system matrix as in the left hand side of (14).
5. Performance Analysis
In this section we discuss the results of some performance tests, using the application examples from the previous section. Table 1 lists our system characteristics. The PSPG/SUPG tests were run on a cluster with 28 nodes, connected using 1 Gb Ethernet. To avoid any difficulties in installing DOLFIN on the cluster, the tests comparing our results with DOLFIN were run on a separate desktop computer.
5.1. Poisson Problem
Our first performance test concerns the Poisson problem. The element equations (4) can easily be calculated analytically when using linear shape functions, allowing a comparison between manually coded versions, a code using a virtual function interface to the shape functions and code generated using our language. Additionally, we compare with a specialized userdefined terminal containing the manually coded version and with DOLFIN [4] from the FEniCS project. Problem (4) is completed with boundary conditions and a source term identical to the Poisson demo case from FEniCS: When using linear shape functions, the solution for the discrete system captures the analytical solution up to machine precision at the nodes. As an illustration, the code for the specialized userdefined terminal is presented in Listing 19. The terminal assemble_triags can then be used to directly assemble the linear system as shown on line . The manually coded version uses the same algorithm, but here we also loop over elements directly, avoiding the use of Proto entirely.

We first run a test on the unit square, divided into 1000 parts in both the and direction. Each square cell is divided into two triangles and first order shape functions from the Lagrange family are used. Table 2 summarizes the results, with labeling as follows: “Proto” is the code usng our EDSL (see Listing 1); “Proto specialized” is the userdefined terminal from Listing 19; “Manual” is the manually coded assembly loop; “Virtual” is the code using the virtual function interface to the shape functions as it is available in Coolfluid 3; and finally “DOLFIN” is the code generated by the FEniCS project demo. All timings represent the average of 10 assembly runs. Given the large problem size, 10 runs are representative and the variation between subsequent averaged runs was less than 2%. Due to the simplicity of the Poisson problem, the insertion into the global sparse matrix structure can actually be more expensive than the evaluation of the element integrals. We run the benchmark using both the Trilinos backend (using an Epetra CRS matrix) and a “dummy” matrix—not storing any data—to properly time the assembly procedure. As seen from Table 2, the overhead of the matrix insertion is about 0.3 s in Coolfluid 3 and 0.8 s in DOLFIN, that is, at least of the order of the time it takes to compute the element matrix itself. When comparing the timings for the dummy matrix, we see that the generic Proto code—which uses second order Gauss quadrature—is more than 5 times slower than the manually coded version. The difference between the specialized and the manual versions is much smaller. Since the specialized code still uses the Proto element looping mechanism, we can conclude that its inherent overhead is small. We confirmed this by profiling the assembly with gperftools (http://gperftools.googlecode.com/), generating the call graphs shown in Figure 3. Each graph starts in the execute method of the relevant Action class. On the left, the generic Proto code is seen to be mostly inlined into the element looper, with only 15% of the time spent in calls to other functions (mostly the shape functions). The large absolute execution time (numbers next to the arrows) is due to the extra matrix operations involved in the second order quadrature. For the specialized function (middle graph), some operations related to index bookkeeping are performed in the loop function, while the rest is executed in the userdefined PoissonTriagAssembly operator. Finally, in the manual version on the right, all operations are in the specific implementation of the execute function, resulting in 10 fewer ticks for the overall execution time.
The virtual function implementation performs much worse than any other method. Like Proto, it allows writing generic code for all elements but does so by using virtual function calls and dynamically allocated matrices. This makes the method much slower than the Proto code. A separate test comparing dynamically and statically sized matrices (size ) from the Eigen library shows a slowdown by a factor of about 8 for dynamic matrices when running matrixmatrix and matrixvector multiplications, reinforcing the results from Table 2.
In DOLFIN, the right hand side and the matrix are computed using separate assembly loops. The presented timing is the total time spent on assembly, divided by the number of assembles executed. For this particular case, Proto and DOLFIN result in the same performance. This is surprising, since the more advanced integration routines in DOLFIN detect here that first order quadrature is sufficient, while our Proto code assumes the worst case and applies second order integration.
The above observation leads us to perform the test in 3D, using a unit cube with 100 segments in each direction and built up of tetrahedra. We only compare the Proto and DOLFIN versions, since this code can be applied in 3D without modification. The results are shown in Table 3. The effect of the quadrature order is obvious here, with our second order quadrature being almost three times slower than the first order quadrature performed by DOLFIN. To check if this is the only effect, we temporarily forced our code to use first order quadrature, that is, using one quadrature point instead of four. The speedup is as expected, but we do emphasize that it is only obtained after modification of the integration code: we do not have a method for determining the integration order based on the terms appearing in the equations. Instead, our integration system assumes there is a mass matrix term in the equation and proceeds to choose the integration order based on the shape function. If the performance penalty is significant, as is mostly the case with simple problems, it is possible to use a userdefined terminal to override the integration method, or even to avoid numerical integration altogether.
We also include some results for hexahedral elements, where the second order quadrature is necessary. The element matrix dimension is also doubled, resulting in longer computation times for the matrix operations. We see that this is compensated here when inserting into the sparse matrix, due to the lower number of elements (each hexahedron represents 6 tetrahedra).
5.2. Chorin’s Method
In Chorin’s method, there are a total of 6 different assembly loops to be run: one for each system matrix, and one for each right hand side. Even though the matrices only need to be assembled once for the simulation, we present timings here for comparison purposes. Table 4 summarizes the results. We see that, except for the auxiliary matrix assembly, DOLFIN is faster every time when the “dummy” matrix is used, with a very large discrepancy for the correction matrix assembly. This is due to an optimization in DOLFIN, which recognizes the coefficients of the element matrix can easily be precomputed, avoiding quadrature. Our code does not allow this optimization to happen automatically, although it is of course possible to add a userdefined terminal. In the auxiliary matrix, the same term appears, but here it is divided by , causing DOLFIN to apply quadrature.
The Protogenerated code is currently suboptimal for the assemblies of the right hand sides. This is due to some missed chances for matrix reuse: the advection operation in (10), for example, is calculated once for every component. While this effect is significant when we eliminate the influence of the linear system, it is much less apparent when looking at the results for Epetra matrices and vectors in the last column of Table 4. This leads us to conclude that our performance level is adequate for practical use and the element matrix and vector calculations will not be a dominant factor in the total solution time.
5.3. Channel Flow Simulation
In the last performance test, we take a look at a practical example, using the PSPG/SUPG stabilized NavierStokes formulation from Listing 18. The test problem is the flow between two infinite flat plates, that is, a 3D channel flow with two periodic directions. We initialize the flow using a laminar solution with centerline Reynolds number of 11250 with respect to the channel halfheight. We apply periodic boundary conditions in the stream and spanwise directions and a noslip condition at the walls. The average timings for the first 100 timesteps (initial Courant number: 0.57) are presented in Table 5. We ran the test using hexahedra and tetrahedra, where the test on tetrahedra also used a specialized code wrapped into a userdefined terminal (“Tetra specialized” in the table). The linear system was solved using the Belos Block GMRES method from Trilinos, preconditioned using ML algebraic multigrid. We tweaked the settings to obtain the fastest possible solution time. We see that the solution of the system takes about 10 times as long as its assembly using our EDSL. This shows that, even for the relatively complicated assembly expressions of (15), our language can be used to assemble the system efficiently. Any further optimization should first focus on the linear system solution before the assembly will become a bottleneck. In this context, it should be noted that the solution of the linear system does not scale as well as the assembly. This may be related to the relatively slow communication (1 Gb Ethernet) between the nodes.
The userdefined code for tetrahedra results in a further speedup factor of 2.5. In this case, the code was reused from a previous version of the solver, written only for tetrahedra. A domain specific language can also assist in developing handtuned code, however: using the language we can first easily specify the generic formulation and then check the element matrices of manually coded optimizations against the automatically generated matrices.
6. Conclusion and Future Work
We presented a domain specific language for the implementation of finite element solvers, embedded in C++. The language mirrors the mathematical notation faithfully, providing a clean separation between numerics and equations. Our work is set apart from other work in this area by the use of the Boost.Proto library and the possibility to implement user defined terminals. Proto uses concise grammars to describe and extend the functionality of the language, as explained in detail using the standalone example. The addition of user defined terminals allows using handoptimized code when possible, while staying within the automated framework for element looping.
We also analyzed the performance, demonstrating—in our opinion—acceptable abstraction overhead when compared to manual implementations and FEniCS. A large scale test with the PSPG/SUPG method for the incompressible NavierStokes equations showed that assembly took up to 10 % of the linear system solution time. This makes the assembly only the second largest consumer of CPU time by a large margin. It is our opinion that the sacrifice of some speed in the assembly is acceptable in view of the reduced turnaround time for model development.
Possible directions for future development include changes to the numerical integration framework, to better deduce the required quadrature order. On a more technical level, some parts of the code could be simplified by using new features of the C++11 standard, such as variadic templates and automatic return type deduction. Better error handling can also be looked into. One final interesting possibility is the investigation of expression optimization techniques. Using grammars, it is theoretically possible to scan the expressions for recurring matrix products and calculate each product only once, for example.
Appendix
Code Download Information
All of the code used in this work is available under open source licenses. Coolfluid 3 is licensed under the LGPL version 3 and available from https://github.com/coolfluid/coolfluid3/. Most benchmarks are in the plugins/UFEM/test/demo directory. The test comparing dynamically and statically sized Eigen matrices is at test/math/ptesteigenvsmatrixt.cpp. The code for the standalone example can be found in the repository https://github.com/barche/eigenproto/, where fem_example.cpp contains the complete running program.
We also had to adapt DOLFIN, adding a “dummy" linear algebra backend to be able to measure without any overhead from the linear system backend. This code can be found at https://bitbucket.org/barche/dolfin/.
Finally, the code for benchmarks (including FEniCS tests) used in this paper is in the repository https://github.com/barche/coolfluid3benchmarks/.
Nomenclature
:  Block of the element right hand side vector for the th component of the vector variable 
:  Block of the element matrix corresponding to rows for the th component of vector variable and the columns of the th component 
:  Unknown in the Poisson problem 
:  Value of unknown interpolated by shape functions 
:  Vector of values at the element nodes for variable 
:  Source term for the Poisson problem 
:  Element shape function vector for variable 
:  Number of nodes in an element 
:  Number of elements in the mesh 
:  Pressure 
:  Time 
:  Velocity vector 
:  The problem domain boundary 
:  Kinematic viscosity 
:  The problem domain. 
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The authors thank the Coolfluid 3 development team for the many hours of work and helpful discussions, without which this work would not have been possible. In particular they thank Tiago Quintino for laying out the basic framework; Willem Deconinck for the work on the mesh structure; and Quentin Gasper for the work on the GUI.