Abstract
Scientific programmers are accustomed to expressing in their programs the “who” (variable declarations) and the “what” (operations), in some sequentialized order, and leaving to the systems software and hardware the questions of “when” and “where”. This act of delegation is appropriate at the small scales, since programmer management of pipelines, multiple functional units, and multilevel caches is presently beyond reward, and the depth and complexity of such performance-motivated architectural developments are sure to increase. However, disregard for the differential costs of accessing different locations in memory (the “flat memory” model) can put unnecessary amounts of synchronization and data motion on the critical path of program execution. Different organization of algorithms leading to mathematically equivalent results can have very different levels of exposed synchronization and data motion, and algorithmicists of the future will have to be conscious of and adapt to the distributed and hierarchical aspects of memory architecture.
Plenty of examples of architecturally motivated algorithmic adaptations can be given today; we illustrate herein with examples from recent aerodynamics simulations. For this purpose, pseudo-transient Newton-KrylovSchwarz methods are briefly introduced and their parallel scalability in bulk synchronous SPMD applications is explored. We also indicate some fundamental limitations of bulk synchronous implicit solvers and propose asyn-chronous forms of nonlinear Schwarz methods as perhaps better adapted both to massively parallel architectures and strongly nonuniform applications. Suitably adapted PDE solvers seem to be readily extrapolated to the 100 Tflop/s capabilities envisioned in the corning decade. Making use of some novel quantitative metrics for the memory access efficiencies of high performance applications (“memtropy”) and for the local strength of nonlinearity (“tensoricity”) in applications with spatially nonuniform characteristics, we propose a migration path for scientific and engineering simulations towards the distributed and hierarchical Teraflops world, and we consider what simulations in this world will look like.