Lattice Quantum Chromodynamic (QCD) models subatomic interactions based on a four-dimensional discretized space–time continuum. The Lattice QCD computation is one of the grand challenges in physics especially when modeling a lattice with small spacing. In this work, we study the implementation of the main kernel routine of Lattice QCD that dominates the execution time on the Cell Broadband Engine. We tackle the problem of efficient SIMD execution and the problem of limited bandwidth for data transfers with the off-chip memory. For efficient SIMD execution, we present runtime data fusion technique that groups data processed similarly at runtime. We also introduce analysis needed to reduce the pressure on the scarce memory bandwidth that limits the performance of this computation. We studied two implementations for the main kernel routine that exhibit different patterns of accessing the memory and thus allowing different sets of optimizations. We show the attributes that make one implementation more favorable in terms of performance. For lattice size that is significantly larger than the local store, our implementation achieves 31.2 GFlops for single precision computations and 16.6 GFlops for double precision computations on the PowerXCell 8i, an order of magnitude better than the performance achieved on most general-purpose processors.