163x Filetype PDF File size 0.24 MB Source: people.eecs.berkeley.edu
Optimizing Matrix Multiply using PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology y z x Je Bilmes, Krste Asanovic, Chee-Whye Chin, Jim Demmel fbilmes,krste,cheewhye,demmelg@cs.berkeley.edu CS Division, University of California at Berkeley Berkeley CA, 94720 International Computer Science Institute Berkeley CA, 94704 Abstract 1 Introduction Modern microprocessors can achieve high performance The use of a standard linear algebra library interface, on linear algebra kernels but this currently requires ex- such as BLAS [LHKK79, DCHH88, DCDH90], enables tensive machine-specic hand tuning. Wehave devel- portable application code to obtain high-performance oped a methodology whereby near-peak performance on provided that an optimized library (e.g., [AGZ94, a wide range of systems can be achieved automatically KHM94]) is available and aordable. for such routines. First, by analyzing current machines Developing an optimized library,however, is a dif- and C compilers, we've developed guidelines for writing cult and time-consuming task. Even excluding algo- Portable, High-Performance, ANSI C (PHiPAC, pro- rithmic variants such as Strassen's method [BLS91] for nounced \fee-pack"). Second, rather than code by hand, matrix multiplication, these routines have a large design we produce parameterized code generators. Third, we space with many parameters such as blocking sizes, loop write search scripts that nd the best parameters for a nesting permutations, loop unrolling depths, software given system. We report on a BLAS GEMM compat- pipelining strategies, register allocations, and instruc- ible multi-level cache-blocked matrix multiply genera- tion schedules. Furthermore, these parameters have tor which produces code that achieves around 90% of complicated interactions with the increasingly sophis- peak on the Sparcstation-20/61, IBM RS/6000-590, HP ticated microarchitectures of new microprocessors. 712/80i, SGI Power Challenge R8k, and SGI Octane Various strategies can be used to produced opti- R10k, and over 80% of peak on the SGI Indigo R4k. mized routines for a given platform. For example, the The resulting routines are competitive with vendor- routines could be manually written in assembly code, optimized BLAS GEMMs. but fully exploring the design space might then be in- CS Division, University of California at Berkeley and the feasible, and the resulting code mightbeunusable or International Computer Science Institute. The author acknowl- sub-optimal on a dierent system. edges the support of JSEP contract F49620-94-C-0038. Another commonly used method is to code using a yCS Division, University of California at Berkeley and the high level language but with manual tuning to match International Computer Science Institute. The author acknowl- the underlying architecture [AGZ94, KHM94]. While edges the support of ONR URI Grant N00014-92-J-1617. less tedious than coding in assembler, this approach zCS Division, University of California at Berkeley. The au- thor acknowledges the support of ARPA contract DAAL03-91- still requires writing machine specic code which is not C-0047 (UniversityofTennessee Subcontract ORA4466.02). performance-portable across a range of systems. xCS Division and Mathematics Dept., University of Califor- Ideally, the routines would be written once in a high- nia at Berkeley. The author acknowledges the support of ARPA level language and fed to an optimizing compiler for contract DAAL03-91-C-0047 (UniversityofTennessee Subcon- each machine. There is a large literature on relevant tract ORA4466.02), ARPA contract DAAH04-95-1-0077 (Uni- compiler techniques, many of which use matrix multi- versityofTennessee Subcontract ORA7453.02),DOE grant DE- FG03-94ER25219, DOE contract W-31-109-Eng-38, NSF grant plication as a test case [WL91, LRW91, MS95, ACF95, ASC-9313958,and DOE grant DE-FG03-94ER25206. CFH95, SMP+96]1. While these compiler heuristics generate reasonably good code in general, they tend not to generate near-peak code for any one operation. A high-level language's semantics might also obstruct ag- gressive compiler optimizations. Moreover, it takes sig- nicant time and investment before compiler research appears in production compilers, so these capabilities are often unavailable. While both microarchitectures and compilers will improveover time, we expect it will 1A longer list appears in [Wol96]. be manyyears before a single version of a library routine however, including pointer alias disambiguation, reg- can be compiled to give near-peak performance across ister and cache blocking, loop unrolling, and software a wide range of machines. pipelining, were either not performed or not very eec- Wehave developed a methodology, named PHiPAC, tive at producing the highest quality code. for developing Portable High-Performance linear alge- Although it would be possible to use another target bra libraries in ANSI C. Our goal is to produce, with language, wechose ANSI C because it provides a low- minimal eort, high-performance linear algebra libraries level, yet portable, interface to machine resources, and for a wide range of systems. The PHiPAC methodol- compilers are widely available. One problem with our ogy has three components. First, wehave developed use of C is that wemust explicitly work around pointer a generic model of current C compilers and micropro- aliasing as described below. In practice, this has not cessors that provides guidelines for producing portable limited our ability to extract near-peak performance. high-performance ANSI C code. Second, rather than We emphasize that for both microarchitectures and hand code particular routines, we write parameterized compilers we are determining a lowest common denom- generators [ACF95, MS95] that produce code according inator. Some microarchitectures or compilers will have to our guidelines. Third, we write scripts that automat- superior characteristics in certain attributes, but, if we ically tune code for a particular system byvarying the code assuming these exist, performance will suer on generators' parameters and benchmarking the resulting systems where they do not. Conversely, coding for the routines. lowest common denominator should not adversely aect Wehave found that writing a parameterized genera- performance on more capable platforms. tor and search scripts for a routine takes less eort than For example, some machines can fold a pointer up- hand-tuning a single version for a single system. Fur- date into a load instruction while others require a sep- thermore, with the PHiPAC approach, development ef- arate add. Coding for the lowest common denomina- fort can be amortized over a large number of platforms. tor dictates replacing pointer updates with base plus And by automatically searching a large design space, constant oset addressing where possible. In addi- we can discover winning yet unanticipated parameter tion, while some production compilers have sophisti- combinations. cated loop unrolling and software pipelining algorithms, Using the PHiPAC methodology,wehave produced many do not. Our search strategy (Section 4) empiri- a portable, BLAS-compatible matrix multiply genera- cally evaluates several levels of explicit loop unrolling tor. The resulting code can achieveover 90% of peak and depths of software pipelining. While a naive com- performance on a variety of currentworkstations, and piler might benet from code with explicit loop un- is sometimes faster than the vendor-optimized libraries. rolling or software pipelining, a more sophisticated com- Wefocus on matrix multiplication in this paper, but we piler might perform better without either. have produced other generators including dot-product, AXPY, and convolution, whichhave similarly demon- 2.1 PHiPAC Coding Guidelines strated portable high performance. Weconcentrate on producing high quality uniproces- The following paragraphs exemplify PHiPACCcode sor libraries for microprocessor-based systems because generation guidelines. Programmers can use these cod- multiprocessor libraries, such as [CDD+96], can be read- ing guidelines directly to improve performance in critical ily built from uniprocessor libraries. For vector and routines while retaining portability, but this does come other architectures, however, our machine model would at the cost of less maintainable code. This problem is likely need substantial modication. mitigated in the PHiPAC approach, however, by the use Section 2 describes our generic C compiler and mi- of parameterized code generators. croprocessor model, and develops the resulting guide- lines for writing portable high-performance C code. Sec- Uselocalvariablesto explicitly remove false dependen- tion 3 describes our generator and the resulting code cies. for a BLAS-compatible matrix multiply. Section 4 de- Casually written C code often over-species operation scribes our strategy for searching the matrix multiply order, particularly where pointer aliasing is possible. parameter space. Section 5 presents results on several C compilers, constrained by C semantics, must obey architectures comparing against vendor-supplied BLAS these over-specications thereby reducing optimization GEMM. Section 6 describes the availability of the dis- potential. We therefore remove these extraneous depen- tribution, and discusses future work. dencies. 2 PHiPAC For example, the following code fragment contains a false Read-After-Write hazard: By analyzing the microarchitectures of a range of ma- a[i] = b[i]+c; chines, suchasworkstations and microprocessor-based a[i+1] = b[i+1]*d; SMP and MPP nodes, and the output of their ANSI C compilers, we derived a set of guidelines that help us The compiler may not assume &a[i] != &b[i+1] and attain high performance across a range of machine and is forced to rst store a[i] to memory before loading compiler combinations [BAD+96]. b[i+1].Wemay re-write this with explicit loads to From our analysis of various ANSI C compilers, we local variables: determined we could usually rely on reasonable reg- float f1,f2; ister allocation, instruction selection, and instruction f1 = b[i]; f2 = b[i+1]; scheduling. More sophisticated compiler optimizations, a[i] = f1 + c; a[i+1] = f2*d; The compiler can nowinterleave execution of both orig- Convert integer multiplies to adds. inal statements thereby increasing parallelism. Integer multiplies and divides are slow relativetointeger Exploit multiple integer and oating-point registers. addition. Therefore, we use pointer updates rather than subscript expressions. Rather than: We explicitly keep values in local variables to reduce for (i=...) {row_ptr=&p[i*col_stride];...} memory bandwidth demands. For example, consider the following 3-point FIR lter code: we produce: while (...) { for (i=...) {...row_ptr+=col_stride;} *res++ = filter[0]*signal[0]+ filter[1]*signal[1]+ Minimize branches, avoid magnitude compares. filter[2]*signal[2]; signal++; } Branches are costly, especially on modern superscalar processors. Therefore, we unroll loops to amortize The compiler will usually reload the lter values every branch cost and use C do fg while (); loop control loop iteration because of potential aliasing with res.We whenever possible to avoid any unnecessary compiler- can remove the alias by preloading the lter into local produced loop head branches. variables that may be mapped into registers: Also, on many microarchitectures, it is cheaper to perform equality or inequality loop termination tests float f0,f1,f2; than magnitude comparisons. For example, instead of: f0=filter[0];f1=filter[1];f2=filter[2]; while ( ... ) { for (i=0,a=start_ptr;i