An Efficient Algorithm for On-the-Fly Data Race Detection Using an Epoch-Based Technique
Data races represent the most notorious class of concurrency bugs in multithreaded programs. To detect data races precisely and efficiently during the execution of multithreaded programs, the epoch-based FASTTRACK technique has been employed. However, FASTTRACK has time and space complexities that depend on the maximum parallelism of the program to partially maintain expensive data structures, such as vector clocks. This paper presents an efficient algorithm, called iFT, that uses only the epochs of the access histories. Unlike FASTTRACK, our algorithm requires operations to maintain an access history and locate data races, without any switching between epochs and vector clocks. We implement this algorithm on top of the Pin binary instrumentation framework and compare it with other on-the-fly detection algorithms, including FASTTRACK, which uses a state-of-the-art happens-before analysis algorithm. Empirical results using the PARSEC benchmark show that iFT reduces the average runtime and memory overhead to 84% and 37%, respectively, of those of FASTTRACK.
Synchronization in parallel or multithreaded programs is an enforcing mechanism used to coordinate thread execution and manage shared data in various computational systems, including HPC (High Performance Computing). However, multithreaded programs may contain synchronization defects such as data races, which occur when two concurrent threads access a shared memory location without explicit synchronization, and at least one of them is a write. It is well known that data races are the hardest defect to handle in multithreaded programs, because of their nondeterministic interleaving of concurrent threads [1–4].
Dynamic techniques for detecting data races are usually classified into postmortem methods [4, 5], which analyze traced information or replay the program after execution, and on-the-fly methods, which use one of the following techniques: happens-before analysis (like FASTTRACK , SigRace , Dijit+ , ThreadSanitizer , etc. [9–13]), lockset analysis (like Eraser ), or hybrid analysis (like VisualThread , Hegrind+ [16–18], MultiRace , ACCULOCK , RaceTrack , etc. ).
The main drawback of dynamic detection techniques is the additional overhead of monitoring program execution and analyzing every conflicting memory operation. A sampling approach was introduced to solve the overhead problem of dynamic data race detection. Sampling-based techniques [23–25] can be performed efficiently when testing multithreaded programs via local thread burst-sampling  or a global execution time sampling strategy . Although they provide significantly reduced runtime overheads, these techniques are still ineffective in detecting data races when the sampling rates are low.
FASTTRACK is a state-of-the-art happens-before algorithm and is an improved version of the Djit+ algorithm with vector clocks (VCs) [26, 27]. This technique exploits the idea that full generality of VCs is often unnecessary for data race detection. The technique replaces heavyweight VCs with a lightweight identifier, called an epoch, that uses only the tuple of the clock value and the thread id. Epoch-based happens-before analysis decreases the runtime and memory overhead of almost all VC operations from to in the detection of data races, where designates the maximum number of simultaneously active threads during an execution. However, FASTTRACK requires a time and space overhead of for the shared read accesses to shared memory locations. Therefore, the overhead problem still exists, because the small fraction of shared read accesses make it difficult to dynamically analyze programs with a large number of concurrent threads .
This paper presents an efficient algorithm, called FT, that uses only epochs to detect data races. Thus, FT represents an improvement over the FASTTRACK method. Our algorithm maintains only two epochs of earlier read accesses to shared memory locations, instead of the full VCs, using the left-of-relation . Thus, it requires only runtime and memory overhead to maintain the access history and locate data races, without any switching between epochs and VCs, unlike FASTTRACK. Furthermore, the technique is guaranteed to report a subset of data races detected by FASTTRACK.
We implement the new algorithm on top of the Pin instrumentation framework , which uses a just-in-time (JIT) compiler to recompile target program binaries for dynamic instrumentation. To compare the accuracy of FT for on-the-fly data race detection, we also implement two other detection algorithms, Djit+ and FASTTRACK, on top of the same framework, and employ the same optimized VC primitives. We compare the efficiency of FT with Djit+ and FASTTRACK, which use a happens-before analysis to detect data races. The experimental results on C/C++ benchmarks using Pthreads show that our algorithm reduces the runtime and memory overheads compared with the other algorithms, while soundly detecting similar data races to FASTTRACK.
In summary, the contributions of our work are as follows:(i)FT provides a significant improvement in efficiency, exhibiting an runtime and memory overhead for each access history, whereas FASTTRACK requires VC operations.(ii)FT matches the well-established precision of FASTTRACK, although it uses only two epochs instead of the full VCs for earlier read accesses to shared memory locations.(iii)FT reduces the average runtime and memory overhead to 84% and 37%, respectively, of those of FASTTRACK.
The remainder of this paper is organized as follows. Section 2 discusses important concepts of happens-before analysis with VCs, and Section 3 introduces the FASTTRACK algorithm and its limitations. We present our improved algorithm in Section 4 and evaluate it empirically in Section 5 by comparing with existing techniques for data race detection. We introduce some related work in Section 6 and conclude our argument in Section 7.
On-the-fly methods of detecting data races typically use VCs to precisely analyze the happens-before relation. This section presents important rules for allocating VCs to the concurrent thread segments introduced in this paper and describes how VCs represent the happens-before relation during the execution of multithreaded programs.
2.1. Execution of Multithreaded Programs
In this work, we consider multithreaded programs using the POSIX thread standard (Pthread) as a model of concurrent threads. Pthread is widely used not only on C/C++ applications, but also on many Unix-like operating systems (Linux, Solaris, Mac OS, FreeBSD, etc.), because it provides various APIs and libraries for creating, manipulating, and synchronizing threads.
In a multithreaded program, a block of thread that is partially serially executed is represented as a thread segment, denoted by . Thus, a thread can be represented as a set of thread segments, denoted by (). A thread segment is delimited by thread operations that can take one of the following forms:(i) models the creation of a thread segment and the start of the execution of thread .(ii) models the creation of a thread segment from the current thread segment and the start of a new thread segment on the same thread .(iii) models the termination of a thread segment and the creation of a new thread segment on the same thread from the current thread segment .A thread segment contains a finite sequence that consists of at least one event , denoted by . denotes the sequence of events generated on a thread segment . An event takes one of the following forms:(i)Access Events and . The former models the reading of a shared memory location , and the latter simulates the updating of .(ii)Mutual Exclusion Events and . The former models the acquisition of a lock to enter a critical section. The latter models the release of a lock to leave a critical section and the start of a new thread segment on the same thread .(iii)Condition Variable Synchronization Events and . The former models the wait for condition variable until another thread wakes and the subsequent start of a new thread segment on the same thread . The latter models the wake-up of a thread waiting on and the start of a new thread segment on the same thread .(iv)Barrier Event . This models the waiting of multiple threads until the number of waiting threads is and the start of a new thread segment on each of the waiting threads.In this work, we consider the above thread operations and events as synchronization primitives rather than access events.
2.2. VC-Based Happens-Before Analysis
Happens-before analysis uses a representation of Lamport’s happens-before relation  to determine the logical concurrency between two thread segments. According to this relation, if a thread segment must happen at an earlier time than another thread segment , happens before or happens before , denoted by or . If neither nor is satisfied, we say that is concurrent with or is concurrent with , denoted by or .
VCs are widely used to analyze the happens-before relation , because they can inform the execution order of thread segments and the synchronization order of thread operations and events. A vector clock : Tid Nat records a clock value for each thread while the program is executing. Thus, thread segment maintains a VC , which has entries if the maximum number of active threads in the execution of a multithreaded program is . The VC of each thread segment is partially ordered () pointwise, with a minimum element and associated synchronization primitives that define pointwise maximums. For instance, the entry for any thread segment stores the latest clock value of that happened before the current synchronization primitive of .
During program execution, the VCs of the thread segments are maintained according to the following rules:(i) init() ; ;(ii) fork() ; ; ;(iii) join() ; ;(iv) acq() , where is a vector clock for each lock ;(v) rel() ;
The other synchronization events, , , and , can be modeled with the operation.
Figure 1 represents a multithread execution with synchronization primitives as a directed acyclic graph, called a Partial Order Execution Graph (POEG) [10, 11]. In the POEG, a vertex is either a thread operation or a synchronization event, and an arc represents a logical thread segment started by the synchronization primitives. The dashed lines indicate the synchronization order in the execution of the program. The events and , represented by small disks on the arcs, denote read and write events at a shared memory location, respectively. The numbers attached to each thread segment and event name indicate an observed order, and the VCs are allocated for each thread segment by the above rules.
Using the VCs of each thread segment, we simply analyze the happens-before relation between any two thread segments. If the clock value of a thread segment is less than or equal to the corresponding clock value of another thread segment , we can conclude that happens before . Otherwise, is concurrent with . Formally,Obviously, means that thread segment was synchronized from an earlier thread segment by one of the synchronization primitives. Then, is partially ordered with , denoted by , and is never involved in any race. Finally, the happens-before analysis locates a data race during the execution of a multithreaded program whenever any two events on two concurrent thread segments access a shared memory location, and at least one of the events is a write.
Definition 1. Given two access events and to a shared memory location from two distinct thread segments and , respectively, if the two events are not synchronized (i.e., neither nor ) and at least one of the events is a write, there exists a data race between and .
For example, in Figure 1, consider two events and on two different thread segments and , respectively. The two events constitute a data race, because neither nor is satisfied, as and , and therefore .
3. FastTrack Algorithm
VC-based happens-before techniques, such as Djit+ , obviously require space to maintain the VCs for each thread segment and access history and also require time for VC operations (e.g., join, copy, and comparison).
FASTTRACK , which improves on Djit+, exploits the insight that the full generality of VCs is often unnecessary for data race detection. The key ideas behind this insight are as follows: (1) all writes to a shared memory location are totally ordered by a happens-before analysis, which assumes no data races have been detected on so far, and (2) writing to could potentially conflict with the last read of performed by any other thread, although reads are not totally ordered, even in race-free programs. By exploiting these results, FASTTRACK replaces heavyweight VCs with a lightweight identifier for a thread segment, called an epoch, using only the tuple of clock value and thread id , denoted by . Thus, FASTTRACK reduces the runtime and space overhead of almost all VC operations from to in the detection of data races.
For a shared memory location , the FASTTRACK algorithm defines an access history using two entries:(i): it records a VC for all concurrent read events or an epoch for the last read event of .(ii): it records only an epoch for the last write event to .FASTTRACK reports data races by analyzing and simply maintains epochs or VCs by updating the access histories. For the algorithm, some notions are used to analyze using the epoch. The function is shorthand for , and denotes that the epoch happens before a vector clock , where if and only if .
When a new event occurs on thread segment , the algorithm for reporting data races and maintaining each entry is as follows.
Upon a Read Event of by Thread (1)If the epoch of the current is the same as that of , , the algorithm takes no action.(2)If , then the algorithm checks to report a data race between an earlier write event and .(3)If is satisfied, only is kept in . Otherwise, is updated to , which maintains a full VC.
Upon a Write Event to by Thread (1)If the epoch of the current is the same as that of , , then the algorithm takes no action.(2)If , then the algorithm checks to report a data race between an earlier write event and .(3)If there exists only one epoch in , then the algorithm checks to report a data race between an earlier read event and . Otherwise, the algorithm checks for a full VC maintained in .(4)The previous epoch or VC is removed from , and is inserted into .
Table 1 explains how the FASTTRACK algorithm reports data races and manages the access history during the execution of the program shown in Figure 1. Initially, starts from , indicating that the shared memory location has not yet been written. When the first read event occurs on thread segment , the epoch is recorded in instead of a full VC, where indicates the thread id for . When the second read on thread segment accesses , shares with the first read event , because , where we say that is in a Read Shared state. In this state, as read may consist of either one or more data races with a later write event, the VCs of all shared reads of are kept in . Thus, switches to a VC representation to record the clocks of the last reads by the two thread segments in Table 1. With this adaptive switching between epochs and VCs in , FASTTRACK greatly reduces the overhead of the VC operations.
When read event occurs on , is directly updated in the corresponding entry of , although maintains a VC for the Read Shared state. Thus, the updating takes time. A data race is reported because Definition 1 is satisfied (i.e., neither nor is true) when a write event to occurs. The VC of prior read events in is removed by resetting to , and the epoch for , , is stored in . When a read of occurs on , only the epoch of is kept in , because the read event is not shared with any others, and a data race is reported. Finally, three concurrent events, , , and , give rise to two data races, , , because and are not satisfied.
A common problem with using VCs for happens-before analysis is the space and time overhead, which depends on the number of threads in the multithreaded programs, whereas the FASTTRACK algorithm provides a significant performance improvement over the lockset analysis by utilizing the lightweight epoch clock. Moreover, it suggests the design of a hybrid technique with both precision and efficiency, such as ACCULOCK . However, there is further room for improvement, because the algorithm requires VC operations to guarantee no loss of precision when shared data enters the Read Shared state, such as and in Figure 1. Therefore, the overhead problem still exists, because the shared read accesses make it difficult to dynamically analyze programs with a large number of concurrent threads.
4. Efficient Data Race Detection
FASTTRACK precisely reports data races with significantly improved performance, because epochs require only a constant space and a constant time for almost all VC operations. However, the algorithm still needs VC operations whenever a shared memory location has shared read events on concurrent thread segments. As this situation makes it impossible to dynamically analyze programs with a large number of concurrent threads , the overhead problem potentially exists, with the space overhead being more critical than the time overhead. Thus, we efficiently improve the FASTTRACK algorithm to reduce this overhead problem.
Our improved FASTTRACK (FT) algorithm reports data races in a constant amount of time and space, even in the worst case, because it maintains only two epochs instead of full VCs for using the left-of-relation. The notion of the left-of-relation was originally suggested by Mellor-Crummey . Mellor-Crummey’s technique maintains two concurrent read events in an access history to detect data races with a write event. Techniques based on the left-of-relation guarantee that a program is free of data races, although it maintains only two read events in each access history, because it locates at least one data race (if any exist). However, Mellor-Crummey’s technique does not support synchronization primitives other than fork/join operations, such as thread locking and wait-signals. Moreover, the left-of-relation does not apply to VC-based detectors, because VCs cannot analyze the logical position of thread segments, unlike Mellor-Crummey’s OS labeling .
We simply define a left-of-relation that is a partial ordering of two concurrent thread segments and for two distinct events on and on in an execution graph, such as the POEG of Figure 1, and the events are not related to . To apply the left-of-relation to the FT algorithm, we use a breadth value instead of the thread id of the original FASTTRACK algorithm. The breadth value is produced by performing a left-to-right preorder numbering or an English Order numbering of the EH labeling scheme  and is used to identify the position of a current thread considering its sibling threads. If a thread segment precedes another thread segment and in an execution of a multithreaded program, for is less than for .
Thus, an epoch of thread segment is redefined as the tuple of clock value and breadth value , denoted by . Now, the left-of-relation between any two thread segments is simply analyzed by comparing their breadth values from each epoch.
Definition 2. Given two read events and to a shared memory location on two concurrent thread segments and , respectively, if for is less than for , one says that is left of , denoted by . Formally,where represents the event type (read or write) of . By applying the left-of-relation, we employ the leftmost event, denoted by , and rightmost event, denoted by , concepts to maintain only two concurrent events in . We use and to denote the leftmost event and rightmost event, respectively, in . If the current event satisfies , is the leftmost event. This event is recorded in instead of , where the prior event always satisfies the left-of-relation with in ; therefore, . Similarly, the current event is the rightmost event and is recorded in instead of , if it satisfies .
We now provide a detailed description of how FT locates three kinds of data races for concurrent events: read-write races, write-write races, and write-read races.
Read-Write Races. Detection is possible because a write event to a shared memory location can conflict with prior read events of performed by any other thread. To detect read-write races, we consider two read states: Exclusive state, where a read event of is performed exclusively on a thread segment, and Read Shared state, where has read events that are shared by two or more concurrent thread segments. In the Exclusive state, because read events of occur on the same thread, they are totally ordered, and the epoch of the last read event is recorded in . Read events of that are shared by multiple threads are unordered in a read-only manner, and each read event may consist of a data race with a later write event. Thus, if is in the Read Shared state, two epochs of the two concurrent read events are recorded in by the left-of-relation.
Using , which maintains only two epochs instead of a full VC, FT detects data races as well as FASTTRACK, because it locates one or two of the read-write data races.
Lemma 3. If data races exist between earlier reads and a current write event , FT locates one or two of those located by FASTTRACK.
Proof. Two distinct shared read events toward are kept in and by the left-of-relation. Since , we guarantee the following: (1)If and , then FT reports a data race between and , because , and neither nor is satisfied.(2)If and , then FT reports a data race between and , because , and neither nor is satisfied.(3)If and , then FT reports two data races between and both shared read events.(4)If and , then FT fails to report any data races.
Figure 2 shows three examples of read-write data races during the execution of a multithreaded program with nondeterministic interleaving of concurrent threads. In Figure 2(a), three shared read events, , , and , happen before the two write events, and . The leftmost event and the rightmost event are kept in and , respectively, by the left-of-relation. Thus, FT can report a data race between and , because and . When occurs on thread segment that is concurrent with the others, FT reports two data races , .
In Figure 2(b), and are synchronized by a lock variable , and is also synchronized with by a lock variable . For the execution of Figure 2(b), FT records two read events and in and , respectively. It reports only the data race between and , because is satisfied. Therefore, by the synchronization between and . FT records from in instead of if the acquiring lock is reserved, because by the thread interleaving . Finally, FT reports two read-write data races , for the execution.
In Figure 2(c), there are two kinds of synchronization events, locking and a signal-wait. Because is satisfied by lock variable , FT records in as the leftmost event, and is recorded in . Thus, FT locates no data races, because by the acquiring lock , and by the signal-wait event. If a pair of wait and signal events does not occur between and , FT obviously locates the data race , as it analyzes that the rightmost event is concurrent with .
Lemma 4. If data races exist between and a current write event, the races located by FT are a subset of those located by FASTTRACK.
Proof. Suppose that the same fixed program execution order is provided to both analyses. Let () be the set of races located by FT (FASTTRACK), and let () be the read events recorded in by FT (FASTTRACK). Because in the execution order, we guarantee the following:(1)If , then is satisfied because it is impossible to satisfy .(2)If , then is satisfied because cannot be satisfied by Lemma 3.Therefore, is satisfied.
For example, in Figure 2(a), the three data races , , located by FT are a subset of the five data races , , , , located by FASTTRACK.
Write-Write Races. These involve two concurrent write events to . All write events to are totally ordered, with the assumption that no data races have been detected on . Thus, FT records the epoch of the write event in and locates a write-write race between and a later write event to by analyzing the epoch of and the current VC of the write event, .
Write-Read Races. These involve a write event to that is concurrent with a later read event of . FT locates such a data race by analyzing .
Lemma 5. If FASTTRACK locates a write-write race or a write-read race during the execution of a program, FT can locate the data race from the same fixed execution.
Proof. Let () be a write event recorded in by FT (FASTTRACK). Then, holds, because in the execution order, and both analyses employ only to analyze .
Algorithm 1 presents the pseudocode for FT, which consists of three algorithms: ReadCheck, WriteCheck, and Maintain. ReadCheck and WriteCheck mainly focus on filtering events, reporting data races, and maintaining an access history for a shared memory location whenever an event on thread segment accesses . To report data races, we use the inversion of , denoted by , to catch instances where the current event is concurrent with a prior event. In ReadCheck and WriteCheck, denotes that neither nor is satisfied. IsOrdered is used by Maintain to check the happens-before relation between the current event and prior events in . Maintain manages access histories for every and employs IsMostL and IsMostR to maintain only two concurrent events in by applying the left-of-relation.
Table 2 shows the changing state of an access history for detecting the data races appearing in Figure 1 using the FT algorithm, where we assume that the breadth values are allocated as , , and . In the figure, the epoch of read event on , 2@0, is recorded in , as the read event of is performed exclusively. When the rightmost read occurs on , enters the Read Shared state. The epoch of (1@1) is recorded with the epoch of , instead of the full VC of FASTTRACK in Table 1, because is less than , and therefore . Because is the last read event on thread when the event occurs, the epoch of the prior leftmost event is updated to the epoch of , 3@0. When occurs on , the data race is reported, as for the FASTTRACK algorithm. However, FT only compares two epochs in without any VC operations. FT also reports the data race and two data races , , as does the FASTTRACK algorithm, when and occur. Consequently, the results in Table 2 show that FT detects apparent data races as well as FASTTRACK, although the new algorithm maintains only two epochs for concurrent read events in .
Theorem 6. FT efficiently and soundly locates data races if it maintains only two epochs in .
Proof. The FT algorithm has time and space overheads for detecting data races, because it removes the switching between epochs and VCs for of the FASTTRACK algorithm by maintaining only two concurrent epochs for the Read Shared state of . From Lemmas 3, 4, and 5, the algorithm soundly locates data races because it reports a subset including at least one of the data races located by the FASTTRACK algorithm.
We empirically evaluated the efficiency and precision of FT in comparison with other dynamic detection algorithms that use the happens-before analysis. The experimental results show that our technique not only soundly reports data races, but also reduces the time and space overhead of data race detection for programs with a large number of concurrent threads.
5.1. Implementation and Experimentation
We implemented the FT algorithm and two other dynamic detection algorithms on top of the Pin instrumentation framework , which uses a JIT compiler to recompile target program binaries for dynamic instrumentation. Building a lightweight tool for monitoring memory access is easier with Pin than with other dynamic binary instrumentation frameworks, such as Valgrind . The two algorithms used for comparison are Djit+  (a high performance VC-based happens-before analysis algorithm) and FASTTRACK  (a state-of-the-art happens-before analysis algorithm).
Figure 3 depicts the architecture of the detectors. Each detector consists of an Instrumentor and a Race Detector to report data races during program execution. The Instrumentor consists of two modules: ThreadMonitor and EventMonitor. These, respectively, track thread operations and event instances for every shared memory location considering synchronization primitives. The Race Detector performs the thread identification routines to generate and manage VCs for each active thread segment, as well as the detection routines to report data races.
The thread identification routines employ the VC primitives discussed in Section 2. These are commonly used to analyze the happens-before relation in the detection routines of all algorithms. A lock-free algorithm was used in the detection routines to remove the centralized bottleneck of access histories. Whenever the Instrumentor catches one of the thread operations or events, it calls either the thread identifier routines or the detection routines to add instrumentation at each interesting point of the running target binaries. Because the Instrumentor and Race Detector use only the shadow binaries of the target programs, which are generated by the JIT compiler of the Pin framework, no source code annotation is required to monitor memory access events or synchronization primitives.
To supplement the correct identification of concurrent thread segments, we used a special structural table for each thread. The table consists of four important items of information, the system thread id, Pthread id, Pin thread id, and clock value. The system thread id is the thread id allocated by the operating system, and the Pthread id is allocated by Pthread functions such as pthread_create(). The Pin thread id is the logical identifier created in sequence whenever the Pin framework catches a thread start operation. Thus, we employed the Pin thread id as the breadth value of an epoch () in the FT algorithm. The clock value is used to form a VC of a thread segment using synchronization primitive operations.
Our experimentation focused on comparing the soundness and the efficiency of on-the-fly data race detection in programs with a large number of concurrent threads. To evaluate the FT algorithm, we compared the data races reported by each detector and measured the execution time and the memory consumed by the execution instances of a set of C/C++ benchmarks using Pthread. For this purpose, we used 12 applications from the PARSEC 2.1 benchmark suite . These target different areas, including HPC, with applications such as data mining, financial analysis, and computer vision. All applications were executed with the default simulation inputs of the PARSEC benchmark suite to produce proper runtime overheads and memory consumption.
Before conducting the experiments, we investigated the benchmark applications in terms of the frequency of access events and synchronization primitives. The results of this analysis with the FASTTRACK algorithm are given in Table 3. We used sim-medium simulation inputs in the execution of each application. In the table, “Same Epoch” means that read/write events to a shared memory location have been filtered out by FASTTRACK as they occurred after the first read/write event on the same thread segment. “Exclusive” indicates that only epochs were used to locate data races, because read/write events exclusively accessed . “Shared” indicates the Read Shared state in which has shared read events being performed by concurrent thread segments. “VC Scan” indicates that a current write event was compared with when entered the Read Shared state. Thus, two memory operations, Shared and VC Scan, require VC operations that require time and space overheads in FASTTRACK.
From this investigation, we can see that 78.3% of all operations and events were read events and 21.6% were write events. Other operations and events accounted for less than 0.1% of the total. These results reaffirm that almost all parts of data race detection involve tracing access events to shared memory locations, because this accounts for more than 99% of operations in the benchmarks. Fortunately, the convergence of memory operations is again removed, as there is a possibility that this will affect the tracing of events for data race detection. For example, in the table, 90.7% of read events and 80.2% of write events occurred in the same epoch. VC operations are rarely needed, accounting for an average of only 1.3% of all read/write events. Thus, the switching approach in FASTTRACK is quite effective in improving the performance of happens-before analysis.
The implementation and experimentation were carried out on a system with two 2.4 GHz Intel Xeon quad-core processors and 32 GB of memory under Linux Kernel 2.6. We installed the most recent version of the Pin framework (Version 2.12), and the applications were compiled with gcc 4.4.4 for all detectors. We used a programmed logging method to measure the execution time and memory consumption of each application. This method uses system files in the proc directory, which provides real-time information on the system, including meminfo, iomem, and cpuinfo. The average runtime and memory overheads of all applications were measured for ten executions under each detector. Figure 4 shows the resulting analyzed information, such as thread creation, detected data races, execution time, and memory consumption, during an execution of the x264 application using our implemented FT detector.
5.2. Results and Analysis
We acquired the reported data race results to evaluate the precision of iFT. Three detectors were applied on the same Pin framework for fair experimentation. All applications of PARSEC benchmark were run with sim-medium simulation inputs, and two real applications were run with both of server program and several client programs. The two real applications used for the experimentation are MySQL (an open source DBMS) and Cherokee (an open source web and server application). These applications were repeatedly tested until each detector had fixed all warnings. The number of data races located by the three detectors is given in Table 4.
All of the detectors reported that there were no data races in six of the applications in the PARSEC benchmarks, blackscholes, dedup, facesim, raytrace, swaptions, and vips. This agrees with prior research , which considered an implementation of FASTTRACK on top of the DynamoRIO instrumentation framework. Djit+ and FASTTRACK reported exactly the same data races for all applications, as found in [6, 20], because these two detectors are based on identical precision. Similarly, iFT reported the same data races as FASTTRACK, with the exception of the bodytrack and x264 applications.
All the detectors located a data race in canneal and fluidanimate, which run into user-defined synchronization functions, such as atomic and barrier_wait. They reported two data races in ferret; these were caused by a shared counter variable and a shared Boolean flag for a queue in the application. The three detectors reported four data races for streamcluster. These were caused by using the same user-defined synchronization, barrier_wait, and object pointers to a shared structure without explicit synchronization. All of the detectors reported eight data races in MySQL due to object pointers to a shared structure without any proper synchronization and shared flags for thread termination. The three detectors located seven data races in Cherokee. A data race in Cherokee was the result of log corruption similar to a well-known bug in Apache’s logging code (Apache bug #25520).
For bodytrack, all detectors found six data races, which were caused by the initialization of objects in shared structures without synchronization and the misuse of condition variables. Djit+ and FASTTRACK also reported two data races involving two kinds of unprotected counter variables for a user-defined wait-notify operation, whereas iFT reported only one of the data races. iFT located two data races for x264, caused by two pointers in different functions that were referring to a shared structure and its members. The pointers allowed the shared memory locations to be concurrently accessed by read/write events from each function without any proper synchronization. The other detectors reported three data races, including two detected by iFT; the other one was caused by the same bug via a pointer to the same shared structure.
In bodytrack and x264, shared read events that are not the leftmost or rightmost events can be exempted from relevant events of the data race detection process by our iFT algorithm. Hence, iFT reported fewer data races for these two applications, and the reported data races were a subset of those given by FASTTRACK. For example, in the result of x264, a prior read access of a shared structure in a file (frame.c) was removed from of an , since a new read access of the same shared structure in another file (analyse.c) occurred on the leftmost thread. iFT reported only a data race between the leftmost read access and a later write access to the same shared structure in a file (encorder.c), whereas FASTTRACK reported two data races between these read accesses and the later write. However, iFT located the missed data race after we had fixed the previously reported data race by using a local pointer variable.
From this experiment, we can conclude that iFT is sound, because the precision of the iFT algorithm is fixed relevant to the well-established precision of FASTTRACK.
We measured the runtime and memory consumption of the benchmarks over three detectors to evaluate the efficiency of iFT. Figure 5 depicts the measured runtime and memory overhead results for 11 applications of PARSEC with sim-medium simulation inputs. The graph shows the average runtime and memory overheads for each of the detectors as a proportion of the original run. Because facesim is a representative long-running application that uses a small number of concurrent threads and naturally requires quite high runtime and memory overheads for on-the-fly data race detection, the application was excluded from the efficiency test.
(a) Average runtime overhead
(b) Average memory overhead
From Figure 5(a), almost all of the FT results are lower than those of the other detectors. FT incurred an average runtime overhead of 8.5x, whereas FASTTRACK and Djit+ required average runtime overheads of 9.2x and 11.2x, respectively. In particular, iFT required explicitly lower runtime overheads for two applications, dedup and ferret, which use more than 20 active threads during program execution. For instance, iFT incurred an average runtime overhead of 23.5x for dedup, whereas FASTTRACK and Djit+ incurred average runtime overheads of 27.6x and 37.3x, respectively. In the case of ferret, the incurred runtime overhead of iFT was 7.5x, while FASTTRACK and Djit+ incurred average runtime overheads of 10x and 16x, respectively. Several applications, such as blackscholes, canneal, and raytrace, have lower overheads than the others because of their model of parallelism (e.g., fork-join parallelism).
In Figure 5(b), we see that FT incurred an average memory overhead of 4.3x, whereas FASTTRACK incurred an average memory overhead of 6.0x. This means that FT reduced the average memory overhead to 58% of that of Djit+ and 72% of that of FASTTRACK for 11 applications. If we consider the three applications that use several ten dynamic threads, FT incurred an average memory overhead of 1.9x, while FASTTRACK required an average memory overhead of 5.4x. Thus, the proposed FT reduced the average memory overhead to 37% of that recorded by FASTTRACK.
We measured average memory consumption for two real applications under our Pin framework. The results of the measurement appear in Figure 6. For the experiments, MySQL used 78 multiple threads during 60 seconds for an execution, and 126 threads were used for Cherokee. We employed four monitoring steps, Native, Pin-only, Monitoring, and Detecting, to show how many additional overheads were incurred by instrumentation work under Pin framework. Native means the original execution without our Pin framework, and Pin-only indicates the measured results that the applications were run on the Pin framework without monitoring and instrumentation work. Monitoring means that only the thread executions and memory accesses were traced under the Pin framework. Detecting means that we measured the memory consumption of the execution of the applications under the three detectors that were implemented on top of the Pin framework.
In Figure 6, we see that Pin-only incurred an average memory consumption of 2.2x and Monitoring incurred an average memory consumption of 2.6x. iFT incurred an average memory consumption of 2.8x, whereas FASTTRACK incurred an average memory consumption of 3.6x. This means that FT reduced the average memory consumption to 62% of that of Djit+ and 76% of that of FASTTRACK for two applications. If we exclude Pin-only step that incurred 1,128 MB in the average case, FT incurred an average memory consumption of 1.7x, while FASTTRACK required an average memory consumption of 2.3x. For the two real applications, iFT reduced the average memory consumption to 49% of FASTTRACK.
We chose the x264 application from the PARSEC benchmark for additional comparison, because it employs a different number of concurrent threads to process the virtual pipelined stages for each input frame. In contrast, the other applications use a fixed number of threads, although they use different inputs. The comparison used all six simulation inputs provided by the PARSEC suite, because these lead to an increasing thread size in each input frame.
Figure 7 depicts the measured runtime and memory overhead results for the x264 application. In the experiment, FT incurred an average runtime overhead of 6.6x, whereas the other detectors averaged more than 8x slowdown. In particular, in the executions with the sim-large input (256 threads), FT reduces the runtime overhead to 74% of that of the other detectors. FT performs well in reducing the memory overhead, averaging just 1.3x, whereas the memory overhead of the other detectors increased by a factor of more than 95% relative to that of FT. Under FT, the application ran with native input using 1,024 concurrent threads, but the other detectors ran out of memory with the native input because of the 32 GB limitation of our system. In this case, FT required a runtime overhead of 11.5x and a memory overhead of 1.6x to locate two data races. It is noteworthy that the distinguished performance of FT is caused by the elimination of the VC operations used in the FASTTRACK algorithm.
(a) Average runtime overheads
(b) Average memory overheads
The results in Figure 7 show that FT reduced the memory overhead by 11.4x and gave a speedup of 1.3x compared to the other dynamic detectors. The overheads of FT were similar to those of the other algorithms for small-size inputs, as x264 uses fewer than 20 threads for these inputs. However, with the larger inputs, FT reduced the runtime and memory overheads compared to the other detectors. For example, FT required just 82% of the runtime and 8% of the memory overhead of FASTTRACK for these larger inputs. The results emphasize again that FT is practically useful for detecting data races on-the-fly in programs with a large number of concurrent threads.
The empirical results from Table 4 to Figure 7 show that our iFT algorithm is a sound and practical method for on-the-fly data race detection, because it reduces the average runtime and memory overhead to 84% and 37%, respectively, of those recorded by FASTTRACK.
6. Related Work
Most prior dynamic techniques have focused on detecting data races more precisely or efficiently. Since FASTTRACK was introduced, several detectors have been designed to combine lockset analysis with happens-before analysis by leveraging the lightweight nature of epochs.
ACCULOCK  was the first solution to use this combined approach, achieving comparable performance to FASTTRACK and limited false positives. This detector applies a new, efficient lockset algorithm to FASTTRACK to enforce a thread locking discipline. This uses the notion of potential data races, called -races, in which any two concurrent read/write events access a shared memory location without a common lock. The detector considers the sensitivity to thread interleaving using thread locking, as it excludes the subset of happens-before relations found with lock acquirements and releases from VCs. However, ACCULOCK still requires operations to maintain an access history and locate data races, similar to FASTTRACK.
ThreadSanitizer  is another hybrid detector based on the same combination approach. This detector provides improved precision in the detection of data races by adapting the fastidious aspect of thread synchronizations and race patterns appearing in C/C++ applications. However, unlike ACCULOCK, it uses VCs to analyze the happens-before relation and multiple locksets for concurrent writes. Thus, the detector offers the same time and memory overhead as earlier hybrid detectors such as MultiRace . Recently, a new version of ThreadSanitizer was released (but not reported officially). This included the FASTTRACK algorithm and epochs instead of the VCs of the old version.
In our prior work , we presented an on-the-fly Race Detector for OpenMP programs. This detector uses a thread identifying technique to analyze the happens-before relation and a data race detection protocol that utilizes the lockset analysis. A significant improvement in efficiency was obtained because the left-of-relation was also applied to the protocol, and it is able to precisely report data races for OpenMP programs with a large number of concurrent threads. However, our prior detector may lose its soundness or efficiency when handling general threading models, like Pthread, because it only considers the structured fork-join parallel program model, such as OpenMP.
There is a trade-off between efficiency and precision in the detection of data races using the happens-before or lockset analysis. FASTTRACK is the fastest happens-before analysis algorithm to provide comparable performance to the lockset analysis. However, there is still room for improvement, as the algorithm requires some VC operations. In this paper, we presented an improved FASTTRACK algorithm, called FT, that uses only the epochs in each access history by applying the left-of-relation. This algorithm is practically sound, needing only an runtime and memory overhead to maintain an access history and providing similar performance to the well-established FASTTRACK algorithm.
We implemented our algorithm as a Pin-tool on top of the Pin instrumentation framework and compared it empirically with other detection algorithms, including FASTTRACK. Empirical results from a set of C/C++ benchmarks showed that our FT algorithm is a practical and sound method for on-the-fly data race detection, reducing the average runtime and memory overhead to 84% and 37%, respectively, of those required by FASTTRACK. This low overhead of the FT algorithm is significant, because it can be used for on-the-fly detection based on both happens-before analysis and a hybrid technique, as presented here for an empirical comparison of efficiency. Thus, we believe that the light weight of FT algorithm can apply to production algorithms which include fault tolerance techniques and testing tools for developing dependable software as well as safety critical software such as avionics and nuclear power systems. Future work will focus on improving the FT algorithm via a hybrid detection technique, similar to that of ACCULOCK but without the false positive problem, and the enhancement of precision to handle more variant synchronization primitives, as in ThreadSanitizer.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2014R1A1A2060082).
U. Banerjee, B. Bliss, Z. Ma, and P. Petersen, “A theory of data race detection,” in Proceedings of the Workshop on Parallel and Distributed Systems: Testing and Debugging (PADTAD ’06), pp. 69–78, ACM, New York, NY, USA, 2006.View at: Google Scholar
E. Pozniansky and A. Schuster, “Efficient on-the-fly data race detection in multithreaded c++ programs,” in Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '03), pp. 179–190, ACM, June 2003.View at: Google Scholar
A. Muzahid, D. Suárez, S. Qi, and J. Torrellas, “Sigrace: signaturebased data race detection,” SIGARCH's Computer Architecture News, vol. 37, no. 3, pp. 337–348, 2009.View at: Google Scholar
J. Mellor-Crummey, “On-the-fly detection of data races for programs with nested fork-join parallelism,” in Proceedings of the ACM/IEEE conference on Supercomputing (Supercomputing '91), pp. 24–33, ACM, New York, NY, USA, November 1991.View at: Google Scholar
J. J. Harrow, “Runtime checking of multithreaded applications with visual threads,” in SPIN Model Checking and Software Verification: 7th International SPIN Workshop, Stanford, CA, USA, August 30 - September 1, 2000. Proceedings, vol. 1885 of Lecture Notes in Computer Science, pp. 331–342, Springer, Berlin, Germany, 2000.View at: Publisher Site | Google Scholar
A. Jannesari, B. Kaibin, V. Pankratius, and W. F. Tichy, “Helgrind+: an efficient dynamic race detector,” in Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (IPDPS '09), pp. 1–13, IEEE Computer Society, Rome, Italy, May 2009.View at: Publisher Site | Google Scholar
X. Xie and J. Xue, “Acculock: accurate and efficient detection of data races,” in Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '11), pp. 201–212, IEEE Computer Society, Chamonix, France, April 2011.View at: Google Scholar
R. O'Callahan and J.-D. Choi, “Hybrid dynamic data race detection,” in Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '03), pp. 167–178, ACM, New York, NY, USA, June 2003.View at: Google Scholar
K. Zhai, B. Xu, W. K. Chan, and T. H. Tse, “CARISMA: a context-sensitive approach to race-condition sample-instance selection for multithreaded applications,” in Proceedings of the 21st International Symposium on Software Testing and Analysis (ISSTA '12), pp. 221–231, ACM, New York, NY, USA, July 2012.View at: Publisher Site | Google Scholar
R. Baldoni and M. Raynal, “Fundamentals of distributed computing: a practical tour of vector clock systems,” IEEE Distributed Systems Online, vol. 3, no. 2, 2002.View at: Google Scholar
M. A. Bender, J. T. Fineman, S. Gilbert, and C. E. Leiserson, “On-the-fly maintenance of series-parallel relationships in Fork-Join multithreaded programs,” in Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '04), pp. 133–144, ACM, New York, NY, USA, June 2004.View at: Google Scholar
C. Bienia and K. Li, “Parsec 2.0: a new benchmark suite for chipmultiprocessors,” in Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation, June 2009.View at: Google Scholar
M. Olszewski, Q. Zhao, D. Koh, J. Ansel, and S. Amarasinghe, “Aikido: accelerating shared data dynamic analyses,” ACM SIGARCH Computer Architecture News, vol. 40, no. 1, pp. 173–184, 2012.View at: Google Scholar
O.-K. Ha, I.-B. Kuh, G. M. Tchamgoue, and Y.-K. Jun, “On-the-fly detection of data races in openmp programs,” in Proceedings of the Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging (PADTAD '12), pp. 1–10, ACM, New York, NY, USA, 2012.View at: Publisher Site | Google Scholar