Table of Contents Author Guidelines Submit a Manuscript
Scientific Programming
Volume 2015 (2015), Article ID 937694, 20 pages
Research Article

Cache Locality-Centric Parallel String Matching on Many-Core Accelerator Chips

1Department of Computer Science and Engineering, Myongji University, 116 Myongji Ro, Cheo-In Gu, Yong In, Kyungki Do 449-728, Republic of Korea
2Korea Institute of Science and Technology Information (KISTI), 245 Dae Hak Ro, Yu Seong Gu, Daejeon 305-806, Republic of Korea

Received 2 April 2015; Revised 6 August 2015; Accepted 7 September 2015

Academic Editor: Bronis R. de Supinski

Copyright © 2015 Nhat-Phuong Tran et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Aho-Corasick (AC) algorithm is a multiple patterns string matching algorithm commonly used in computer and network security and bioinformatics, among many others. In order to meet the highly demanding computational requirements imposed on these applications, achieving high performance for the AC algorithm is crucial. In this paper, we present a high performance parallelization of the AC on the many-core accelerator chips such as the Graphic Processing Unit (GPU) from Nvidia and the Intel Xeon Phi. Our parallelization approach significantly improves the cache locality of the AC by partitioning a given set of string patterns into multiple smaller sets of patterns in a space-efficient way. Using the multiple pattern sets, intensive pattern matching operations are concurrently conducted with respect to the whole input text data. Compared with the previous approaches where the input data is partitioned amongst multiple threads instead of partitioning the pattern set, our approach significantly improves the performance. Experimental results show that our approach leads up to 2.73 times speedup on the Nvidia K20 GPU and 2.00 times speedup on the Intel Xeon Phi compared with the previous approach. Our parallel implementation delivers up to 693 Gbps throughput performance on the K20.