231x Filetype PPTX File size 0.41 MB Source: people.inf.ethz.ch
Register file size limits GPU scalability • Register fle (RF) already accounts for 60% of on- chip storage • But, there is still demand for more registers to Maximum Required Register File 5.9x achieve maximum performance and concurrency Average Required Register File 2.3x Available Register File 0 200 400 600 8001000120014001600 (KB) • Future slow memory accesses call for more threads • Multi-socket, multi-GPU, RDMA, NVM, etc. Need mechanisms to expand RF • Compiler optimizations call for more registers per capacity (without large area/power thread • overheads) Loop unrolling, thread coarsening, etc. 2 How to make register files larger? • Emerging technologies [Jing’13][Mao’14][Wang’15][Abdel- Majid’17] • Register fle compression [Lee’15] • Register fle virtualization [Jion’15][Vijaykumar’16] [Kloosterman’17] C P I d e z i l • Common challenge: Latency overhead a No latency m r • overhead o Example: 8x larger register fle with NTV TFET N 2 1.5 5.3x slower Ideal 1 0.5 Real 0 lavaMD lbm leukocyte myocyte NN sad sgemm STO WP GMEAN Goal: Tolerate register file latencies 3 Contributions • Latency Tolerant Register File (LTRF) • “2-level” main register fle + register cache • Performs prefetch ops while executing other warps • Paves the way for several power/area optimizations • Compiler-driven Register Prefetching • Break control flow graph into “prefetch LTRF tolerates up to 6x slower register subgraphs” • files Prefetch registers at the beginning of each Example LTRF use case: subgraph • 8× larIntegerrv aRl Fan al y3si4s .to8% id ehintigherfy pre fperetch fosubrmagrapnhcse 4 Outline • Background and challenges • The case for compiler-driven register prefetching in GPUs • LTRF architecture and compiler support • Evaluation methodology • Results 5 Register file caching [Gebhart’ ISCA11] • Promising approach for latency tolerant register fles Warp Scheduler r o t s Main r Register r c t e i a l a l n b b o U Register s File s C s s o o d D r r n a M File C Cache C r I e S p (multiple banks) (multiple banks) O Unfortunately, classic demand fetch and replace yields low hit rate in register caches 6
no reviews yet
Please Login to review.