You can find my publications here.
Energy-Efficient and Resilient GPU Computing
My work [HPCA’ 15] performs characterization of voltage noise phenomena, and the root-cause analysis of voltage droops in GPU architecture using the simulation infrastructure. The analysis insights are distinctive from the traditional wisdom that has been studied in CPUs and inspire the hierarchical voltage smoothing mechanism that can effectively smooth the voltage noise with negligible performance overhead. The work also looks at the compiler-assisted voltage smoothing mechanisms. Our vision is that the co-designed resilient architecture can smooth and tolerate the voltage noise, thus improves the energy efficiency.
• Multi-GPU Computing System
Single GPU is not sufficient for current large-scale applications such deep learning and big data analytics. Industry has already been deploying the multi-GPU system for exascale computing. Thus it is important to optimize the energy efficiency for such system. My research also studies how to build high-performant and efficient multi-GPU systems as well as how to schedule big data analytics and deep learning workloads on these systems. We first characterize energy and performance trade-off and characteristics of different workloads on different GPU architectures. Our ultimate goal is to build a balanced multi-GPU engine, comprised of multiple commercial off-the-shelf GPUs, with smartly designed runtime system, which can run the bid data analytics applications with much better-improved energy efficiency than current commercial multi-GPU systems.
• Resilient Accelerator-Rich System
As emerging workloads continue to demand more computation horsepower, accelerators are getting lots of attention due to their superior performance and energy efficiency. Building resilient systems is becoming paramount in the face of increasing CMOS (un)reliability issues. However, as accelerators are often integrated into the system as black box third-party IP components, a fault in one or more accelerators could threaten system reliability if there are no established failure semantics for how an error propagates from the accelerator to the main CPU. Therefore, we are studying a new architectural design paradigm that specifies how faults in accelerators can be isolated and contained from the rest of the system. We advocate maintaining reliability at the system level, centered around a hardened CPU, rather than at the individual accelerator level by separating the accelerators and the CPU into resilient computing domains. The accelerators reside in the weak resilient domain and only perform error detection. The CPU resides in the strong resilient domain and performs checkpointing and recovery. This broader principle is a generic paradigm that can be applied to any accelerator-based system. It simplifies accelerator design and enables scalable and flexible integration of accelerators.