Coding The Matrix Pdf 186681

Partial capture of text on file.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Optimizing Matrix Multiply using PHiPAC: a Portable,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          High-Performance, ANSI C Coding Methodology
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          y                                                                                                                                                                                                                                                                                                                                                                            z                                                                                                                                                                                                                                                                                        x
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Je Bilmes, Krste Asanovic, Chee-Whye Chin, Jim Demmel
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       fbilmes,krste,cheewhye,demmelg@cs.berkeley.edu
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       CS Division, University of California at Berkeley
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Berkeley CA, 94720
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  International Computer Science Institute
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Berkeley CA, 94704
                                                                                                                                                                                                                                                                                                            Abstract                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         1 Introduction
                                                                                                                                                                                                                                                                                                            Modern microprocessors can achieve high performance                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The use of a standard linear algebra library interface,
                                                                                                                                                                                                                                                                                                            on linear algebra kernels but this currently requires ex-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        such as BLAS [LHKK79, DCHH88, DCDH90], enables
                                                                                                                                                                                                                                                                                                            tensive machine-specic hand tuning. Wehave devel-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               portable application code to obtain high-performance
                                                                                                                                                                                                                                                                                                            oped a methodology whereby near-peak performance on                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              provided that an optimized library (e.g., [AGZ94,
                                                                                                                                                                                                                                                                                                            a wide range of systems can be achieved automatically                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            KHM94]) is available and aordable.
                                                                                                                                                                                                                                                                                                            for such routines. First, by analyzing current machines                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Developing an optimized library,however, is a dif-
                                                                                                                                                                                                                                                                                                            and C compilers, we've developed guidelines for writing                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          cult and time-consuming task. Even excluding algo-
                                                                                                                                                                                                                                                                                                            Portable, High-Performance, ANSI C (PHiPAC, pro-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 rithmic variants such as Strassen's method [BLS91] for
                                                                                                                                                                                                                                                                                                            nounced \fee-pack"). Second, rather than code by hand,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           matrix multiplication, these routines have a large design
                                                                                                                                                                                                                                                                                                            we produce parameterized code generators. Third, we                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              space with many parameters such as blocking sizes, loop
                                                                                                                                                                                                                                                                                                            write search scripts that nd the best parameters for a                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          nesting permutations, loop unrolling depths, software
                                                                                                                                                                                                                                                                                                            given system. We report on a BLAS GEMM compat-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   pipelining strategies, register allocations, and instruc-
                                                                                                                                                                                                                                                                                                            ible multi-level cache-blocked matrix multiply genera-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           tion schedules.                                                                                                                                                                                                                                                                                       Furthermore, these parameters have
                                                                                                                                                                                                                                                                                                            tor which produces code that achieves around 90% of                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              complicated interactions with the increasingly sophis-
                                                                                                                                                                                                                                                                                                            peak on the Sparcstation-20/61, IBM RS/6000-590, HP                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ticated microarchitectures of new microprocessors.
                                                                                                                                                                                                                                                                                                            712/80i, SGI Power Challenge R8k, and SGI Octane                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Various strategies can be used to produced opti-
                                                                                                                                                                                                                                                                                                            R10k, and over 80% of peak on the SGI Indigo R4k.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                mized routines for a given platform. For example, the
                                                                                                                                                                                                                                                                                                            The resulting routines are competitive with vendor-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              routines could be manually written in assembly code,
                                                                                                                                                                                                                                                                                                            optimized BLAS GEMMs.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            but fully exploring the design space might then be in-
                                                                                                                                                                                                                                                                                                                                                      CS Division, University of California at Berkeley and the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             feasible, and the resulting code mightbeunusable or
                                                                                                                                                                                                                                                                                                            International Computer Science Institute. The author acknowl-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    sub-optimal on a dierent system.
                                                                                                                                                                                                                                                                                                            edges the support of JSEP contract F49620-94-C-0038.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Another commonly used method is to code using a
                                                                                                                                                                                                                                                                                                                                                        yCS Division, University of California at Berkeley and the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           high level language but with manual tuning to match
                                                                                                                                                                                                                                                                                                            International Computer Science Institute. The author acknowl-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    the underlying architecture [AGZ94, KHM94]. While
                                                                                                                                                                                                                                                                                                            edges the support of ONR URI Grant N00014-92-J-1617.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             less tedious than coding in assembler, this approach
                                                                                                                                                                                                                                                                                                                                                        zCS Division, University of California at Berkeley. The au-
                                                                                                                                                                                                                                                                                                            thor acknowledges the support of ARPA contract DAAL03-91-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        still requires writing machine specic code which is not
                                                                                                                                                                                                                                                                                                            C-0047 (UniversityofTennessee Subcontract ORA4466.02).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           performance-portable across a range of systems.
                                                                                                                                                                                                                                                                                                                                                        xCS Division and Mathematics Dept., University of Califor-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Ideally, the routines would be written once in a high-
                                                                                                                                                                                                                                                                                                            nia at Berkeley. The author acknowledges the support of ARPA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     level language and fed to an optimizing compiler for
                                                                                                                                                                                                                                                                                                            contract DAAL03-91-C-0047 (UniversityofTennessee Subcon-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         each machine. There is a large literature on relevant
                                                                                                                                                                                                                                                                                                            tract ORA4466.02), ARPA contract DAAH04-95-1-0077 (Uni-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          compiler techniques, many of which use matrix multi-
                                                                                                                                                                                                                                                                                                            versityofTennessee Subcontract ORA7453.02),DOE grant DE-
                                                                                                                                                                                                                                                                                                            FG03-94ER25219, DOE contract W-31-109-Eng-38, NSF grant                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          plication as a test case [WL91, LRW91, MS95, ACF95,
                                                                                                                                                                                                                                                                                                            ASC-9313958,and DOE grant DE-FG03-94ER25206.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     CFH95, SMP+96]1. While these compiler heuristics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             generate reasonably good code in general, they tend not
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             to generate near-peak code for any one operation. A
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             high-level language's semantics might also obstruct ag-
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             gressive compiler optimizations. Moreover, it takes sig-
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             nicant time and investment before compiler research
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             appears in production compilers, so these capabilities
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             are often unavailable.                                                                                                                                                                                                                                                                                                                                                                                              While both microarchitectures
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             and compilers will improveover time, we expect it will
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       1A longer list appears in [Wol96].
                                                                                                                                                                                                                                                                                                            be manyyears before a single version of a library routine                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        however, including pointer alias disambiguation, reg-
                                                                                                                                                                                                                                                                                                            can be compiled to give near-peak performance across                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             ister and cache blocking, loop unrolling, and software
                                                                                                                                                                                                                                                                                                            a wide range of machines.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        pipelining, were either not performed or not very eec-
                                                                                                                                                                                                                                                                                                                                                                     Wehave developed a methodology, named PHiPAC,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           tive at producing the highest quality code.
                                                                                                                                                                                                                                                                                                            for developing Portable High-Performance linear alge-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Although it would be possible to use another target
                                                                                                                                                                                                                                                                                                            bra libraries in ANSI C. Our goal is to produce, with                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            language, wechose ANSI C because it provides a low-
                                                                                                                                                                                                                                                                                                            minimal eort, high-performance linear algebra libraries                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         level, yet portable, interface to machine resources, and
                                                                                                                                                                                                                                                                                                            for a wide range of systems. The PHiPAC methodol-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                compilers are widely available. One problem with our
                                                                                                                                                                                                                                                                                                            ogy has three components. First, wehave developed                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                use of C is that wemust explicitly work around pointer
                                                                                                                                                                                                                                                                                                            a generic model of current C compilers and micropro-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             aliasing as described below. In practice, this has not
                                                                                                                                                                                                                                                                                                            cessors that provides guidelines for producing portable                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          limited our ability to extract near-peak performance.
                                                                                                                                                                                                                                                                                                            high-performance ANSI C code. Second, rather than                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         We emphasize that for both microarchitectures and
                                                                                                                                                                                                                                                                                                            hand code particular routines, we write parameterized                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            compilers we are determining a lowest common denom-
                                                                                                                                                                                                                                                                                                            generators [ACF95, MS95] that produce code according                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             inator. Some microarchitectures or compilers will have
                                                                                                                                                                                                                                                                                                            to our guidelines. Third, we write scripts that automat-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         superior characteristics in certain attributes, but, if we
                                                                                                                                                                                                                                                                                                            ically tune code for a particular system byvarying the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           code assuming these exist, performance will suer on
                                                                                                                                                                                                                                                                                                            generators' parameters and benchmarking the resulting                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            systems where they do not. Conversely, coding for the
                                                                                                                                                                                                                                                                                                            routines.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        lowest common denominator should not adversely aect
                                                                                                                                                                                                                                                                                                                                                                     Wehave found that writing a parameterized genera-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       performance on more capable platforms.
                                                                                                                                                                                                                                                                                                            tor and search scripts for a routine takes less eort than                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For example, some machines can fold a pointer up-
                                                                                                                                                                                                                                                                                                            hand-tuning a single version for a single system. Fur-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           date into a load instruction while others require a sep-
                                                                                                                                                                                                                                                                                                            thermore, with the PHiPAC approach, development ef-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              arate add. Coding for the lowest common denomina-
                                                                                                                                                                                                                                                                                                            fort can be amortized over a large number of platforms.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          tor dictates replacing pointer updates with base plus
                                                                                                                                                                                                                                                                                                            And by automatically searching a large design space,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             constant oset addressing where possible.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In addi-
                                                                                                                                                                                                                                                                                                            we can discover winning yet unanticipated parameter                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              tion, while some production compilers have sophisti-
                                                                                                                                                                                                                                                                                                            combinations.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    cated loop unrolling and software pipelining algorithms,
                                                                                                                                                                                                                                                                                                                                                                     Using the PHiPAC methodology,wehave produced                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            many do not. Our search strategy (Section 4) empiri-
                                                                                                                                                                                                                                                                                                            a portable, BLAS-compatible matrix multiply genera-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              cally evaluates several levels of explicit loop unrolling
                                                                                                                                                                                                                                                                                                            tor. The resulting code can achieveover 90% of peak                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              and depths of software pipelining. While a naive com-
                                                                                                                                                                                                                                                                                                            performance on a variety of currentworkstations, and                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             piler might benet from code with explicit loop un-
                                                                                                                                                                                                                                                                                                            is sometimes faster than the vendor-optimized libraries.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         rolling or software pipelining, a more sophisticated com-
                                                                                                                                                                                                                                                                                                            Wefocus on matrix multiplication in this paper, but we                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           piler might perform better without either.
                                                                                                                                                                                                                                                                                                            have produced other generators including dot-product,
                                                                                                                                                                                                                                                                                                            AXPY, and convolution, whichhave similarly demon-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2.1                                                                                           PHiPAC Coding Guidelines
                                                                                                                                                                                                                                                                                                            strated portable high performance.
                                                                                                                                                                                                                                                                                                                                                                     Weconcentrate on producing high quality uniproces-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The following paragraphs exemplify PHiPACCcode
                                                                                                                                                                                                                                                                                                            sor libraries for microprocessor-based systems because                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           generation guidelines. Programmers can use these cod-
                                                                                                                                                                                                                                                                                                            multiprocessor libraries, such as [CDD+96], can be read-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ing guidelines directly to improve performance in critical
                                                                                                                                                                                                                                                                                                            ily built from uniprocessor libraries.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    For vector and                                                                                                                                                                                                                                                                                                                                                         routines while retaining portability, but this does come
                                                                                                                                                                                                                                                                                                            other architectures, however, our machine model would                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            at the cost of less maintainable code. This problem is
                                                                                                                                                                                                                                                                                                            likely need substantial modication.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             mitigated in the PHiPAC approach, however, by the use
                                                                                                                                                                                                                                                                                                                                                                     Section 2 describes our generic C compiler and mi-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      of parameterized code generators.
                                                                                                                                                                                                                                                                                                            croprocessor model, and develops the resulting guide-
                                                                                                                                                                                                                                                                                                            lines for writing portable high-performance C code. Sec-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Uselocalvariablesto explicitly remove false dependen-
                                                                                                                                                                                                                                                                                                            tion 3 describes our generator and the resulting code                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            cies.
                                                                                                                                                                                                                                                                                                            for a BLAS-compatible matrix multiply. Section 4 de-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Casually written C code often over-species operation
                                                                                                                                                                                                                                                                                                            scribes our strategy for searching the matrix multiply                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           order, particularly where pointer aliasing is possible.
                                                                                                                                                                                                                                                                                                            parameter space. Section 5 presents results on several                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           C compilers, constrained by C semantics, must obey
                                                                                                                                                                                                                                                                                                            architectures comparing against vendor-supplied BLAS                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             these over-specications thereby reducing optimization
                                                                                                                                                                                                                                                                                                            GEMM. Section 6 describes the availability of the dis-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           potential. We therefore remove these extraneous depen-
                                                                                                                                                                                                                                                                                                            tribution, and discusses future work.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            dencies.
                                                                                                                                                                                                                                                                                                            2 PHiPAC                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  For example, the following code fragment contains a
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             false Read-After-Write hazard:
                                                                                                                                                                                                                                                                                                            By analyzing the microarchitectures of a range of ma-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          a[i] = b[i]+c;
                                                                                                                                                                                                                                                                                                            chines, suchasworkstations and microprocessor-based                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            a[i+1] = b[i+1]*d;
                                                                                                                                                                                                                                                                                                            SMP and MPP nodes, and the output of their ANSI C
                                                                                                                                                                                                                                                                                                            compilers, we derived a set of guidelines that help us                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           The compiler may not assume &a[i] != &b[i+1] and
                                                                                                                                                                                                                                                                                                            attain high performance across a range of machine and                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            is forced to rst store a[i] to memory before loading
                                                                                                                                                                                                                                                                                                            compiler combinations [BAD+96].                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  b[i+1].Wemay re-write this with explicit loads to
                                                                                                                                                                                                                                                                                                                                                                     From our analysis of various ANSI C compilers, we                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       local variables:
                                                                                                                                                                                                                                                                                                            determined we could usually rely on reasonable reg-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          float f1,f2;
                                                                                                                                                                                                                                                                                                            ister allocation, instruction selection, and instruction                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     f1 = b[i]; f2 = b[i+1];
                                                                                                                                                                                                                                                                                                            scheduling. More sophisticated compiler optimizations,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       a[i] = f1 + c; a[i+1] = f2*d;
                                                                                                                                                                                                                                                                                                            The compiler can nowinterleave execution of both orig-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Convert integer multiplies to adds.
                                                                                                                                                                                                                                                                                                            inal statements thereby increasing parallelism.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Integer multiplies and divides are slow relativetointeger
                                                                                                                                                                                                                                                                                                            Exploit multiple integer and 
oating-point registers.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            addition. Therefore, we use pointer updates rather than
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             subscript expressions. Rather than:
                                                                                                                                                                                                                                                                                                            We explicitly keep values in local variables to reduce                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  for (i=...) {row_ptr=&p[i*col_stride];...}
                                                                                                                                                                                                                                                                                                            memory bandwidth demands. For example, consider
                                                                                                                                                                                                                                                                                                            the following 3-point FIR lter code:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            we produce:
                                                                                                                                                                                                                                                                                                                                                                        while (...) {                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               for (i=...) {...row_ptr+=col_stride;}
                                                                                                                                                                                                                                                                                                                                                                                                                                   *res++ = filter[0]*signal[0]+
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  filter[1]*signal[1]+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Minimize branches, avoid magnitude compares.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  filter[2]*signal[2];
                                                                                                                                                                                                                                                                                                                                                                                                                                   signal++; }                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Branches are costly, especially on modern superscalar
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             processors.                                                                                                                                                                                                                          Therefore, we unroll loops to amortize
                                                                                                                                                                                                                                                                                                            The compiler will usually reload the lter values every                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          branch cost and use C do fg while (); loop control
                                                                                                                                                                                                                                                                                                            loop iteration because of potential aliasing with res.We                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         whenever possible to avoid any unnecessary compiler-
                                                                                                                                                                                                                                                                                                            can remove the alias by preloading the lter into local                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          produced loop head branches.
                                                                                                                                                                                                                                                                                                            variables that may be mapped into registers:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Also, on many microarchitectures, it is cheaper to
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             perform equality or inequality loop termination tests
                                                                                                                                                                                                                                                                                                                                                                        float f0,f1,f2;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      than magnitude comparisons. For example, instead of:
                                                                                                                                                                                                                                                                                                                                                                        f0=filter[0];f1=filter[1];f2=filter[2];
                                                                                                                                                                                                                                                                                                                                                                        while ( ... ) {                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  for (i=0,a=start_ptr;i
The words contained in this file might help you see if this file matches what you are looking for:

...Optimizing matrix multiply using phipac a portable high performance ansi c coding methodology y z x je bilmes krste asanovi chee whye chin jim demmel fbilmes cheewhye demmelg cs berkeley edu division university of california at ca international computer science institute abstract introduction...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area