Abstract
Traditionally, numerical analysts have evaluated the performance of algorithms by counting the number of floating-point operations. On the algorithmic side, tremendous strides have been made; many algorithms now require only a few floating-point operations per mesh point. However, on the hardware side, memory system performance is improving at a rate that is much slower than that of processor performance. The result is a mismatch in capabilities: algorithm design has minimized the work per data item, but hardware design depends on executing an increasing large number of operations per data item. The importance of memory bandwidth to the overall performance is suggested by the available results. These show that the STREAM results are much better indicator of performance than the peak numbers. The chapter illustrates the performance limitations caused by insufficient available memory bandwidth with a discussion of sparse matrix-vector multiply, a critical operation in many iterative methods used in implicit CFD codes. It also focuses on the per-processor performance of compute nodes used in parallel computers. Experiments have shown that PETSc-FUN3D has good scalability. In fact, since good per-processor performance reduces the fraction of time spent computing as opposed to communication, achieving the best per-processor performance is a critical prerequisite to demonstrating uninflated parallel performance.