Performance Ppt 73219 | Ltrf Latency Tolerant Gpu Register File Asplos18 Talk

286x Filetype PPTX File size 0.41 MB Source: people.inf.ethz.ch

File: Performance Ppt 73219 | Ltrf Latency Tolerant Gpu Register File Asplos18 Talk

register file size limits gpu scalability register fle rf already accounts for 60 of on chip storage but there is still demand for more registers to maximum required register file ...

Filetype Power Point PPTX | Posted on 01 Sep 2022 | 3 years ago

Download

Partial capture of text on file.

           Register file size limits GPU 
                                   scalability 
        • Register fle (RF) already accounts for 60% of on-
           chip storage
        • But, there is still demand for more registers to 
               Maximum Required Register File                                       5.9x
           achieve maximum performance and concurrency
                 Average Required Register File                2.3x
                          Available Register File
                                               0 200 400 600 8001000120014001600 (KB)
         
        • Future slow memory accesses call for more 
           threads 
             • Multi-socket, multi-GPU, RDMA, NVM, etc.
          Need mechanisms to expand RF 
        • Compiler optimizations call for more registers per 
          capacity (without large area/power 
           thread
             •                       overheads)
                Loop unrolling, thread coarsening, etc.                                       2
              How to make register files larger?
           • Emerging technologies [Jing’13][Mao’14][Wang’15][Abdel-
              Majid’17]
           • Register fle compression [Lee’15]
           • Register fle virtualization [Jion’15][Vijaykumar’16]
              [Kloosterman’17]
         C
         P
         I
          
         d
         e
         z
         i
         l • Common challenge: Latency overhead
         a                                                                                                         No latency 
         m
         r        •                                                                                                 overhead
         o            Example: 8x larger register fle with NTV TFET
         N
                2
             1.5                                                                                                   5.3x slower
                                                                                                                 Ideal
                1
             0.5                                                                                                 Real
                0 lavaMD      lbm    leukocyte myocyte    NN       sad    sgemm      STO      WP     GMEAN
       Goal: Tolerate register file latencies
                                                                                                                                3
               Contributions
    • Latency Tolerant Register File (LTRF)
       • “2-level” main register fle + register cache
       • Performs prefetch ops while executing other 
        warps
       • Paves the way for several power/area 
        optimizations
    • Compiler-driven Register Prefetching
       • Break control flow graph into “prefetch 
 LTRF tolerates up to 6x slower register 
        subgraphs”
       •              files
        Prefetch registers at the beginning of each 
           Example LTRF use case: 
        subgraph
       •
  8× larIntegerrv aRl Fan al y3si4s .to8% id ehintigherfy pre fperetch fosubrmagrapnhcse
                                               4
               Outline
   • Background and challenges
   • The case for compiler-driven register prefetching 
    in GPUs
   • LTRF architecture and compiler support
   • Evaluation methodology
   • Results
                                   5
              Register file caching [Gebhart’ 
                                             ISCA11]
        • Promising approach for latency tolerant 
           register fles
                                                            Warp Scheduler
                                                                         r
                                                                         o
                                                                         t                 s
              Main                 r       Register               r      c                 t
                                                                         e                 i
                                                                  a      l
                                   a                                     l                 n
                                   b                              b      o                 U
            Register               s           File               s      C                  
                                   s                              s       
                                   o                              o      d                 D
                                   r                              r      n
                                                                         a                 M
                File               C         Cache                C      r                 I
                                                                         e                 S
                                                                         p
           (multiple banks)               (multiple banks)               O
     Unfortunately, classic demand fetch 
         and replace yields low hit rate in 
                               register caches
                                                                                                6

The words contained in this file might help you see if this file matches what you are looking for:

...Register file size limits gpu scalability fle rf already accounts for of on chip storage but there is still demand more registers to maximum required x achieve performance and concurrency average available kb future slow memory accesses call threads multi socket rdma nvm etc need mechanisms expand compiler optimizations per capacity without large area power thread overheads loop unrolling coarsening how make files larger emerging technologies compression virtualization c p i d e z l common challenge latency overhead a no m r o example with ntv tfet n slower ideal real lavamd lbm leukocyte myocyte nn sad sgemm sto wp gmean goal tolerate latencies contributions tolerant ltrf level main cache performs prefetch ops while executing other warps paves the way several driven prefetching break control flow graph into tolerates up subgraphs at beginning each use case subgraph larintegerrv arl fan al ysis id ehintigherfy pre fperetch fosubrmagrapnhcse outline background challenges in gpus archite...

Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area