Title: Deployment-Aware Neural Architecture Search and Serving of Deep Neural Networks
Date: September 24, 2024
Time: 11:00 AM -1:00 PM EDT
Location: KACB 3402
Virtual: https://gatech.zoom.us/my/alindkhare?pwd=cS9iV1pWbzI4R0dNVUJPRmtoV3FFUT09
Name: Alind Khare
School of Computer Science
College of Computing
Georgia Institute of Technology
Committee:
Alexey Tumanov (Advisor) - School of Computer Science, Georgia Institute of Technology,
Ling Liu - School of Computer Science, Georgia Institute of Technology,
Ada Gavrilovska - School of Computer Science, Georgia Institute of Technology,
Tushar Krishna - School of Electrical and Computer Engineering & School of Computer Science, Georgia Institute of Technology
Myungjin Lee - Cisco Research
Abstract:
Deep Neural Networks (DNNs) are leading the current AI wave and are used in numerous real-world interactive applications. They are increasingly deployed on the critical path of production applications in data centers and at the edge, operating across various hardware and low-power embedded devices. As a result, these applications increasingly operate under dynamic and unpredictable deployment conditions, including bursty traffic, etc., while requiring maximum service quality w.r.t. DNN model accuracy and interactive response. This necessitates a careful balance between their accuracy and latency requirements.
My thesis addresses these technical challenges and resolves this tension by innovating (a) novel NAS techniques that yield optimal accuracy for multiple latency targets, and (b) inference serving systems that enable dynamic latency/accuracy tradeoffs for these interactive applications in real-time. Overall the thesis has two complementary thrusts: a) Deployment-Aware Neural Architecture Search (NAS): where the thesis introduces novel hardware-aware NAS techniques CompOFA/DES/SuperFedNAS to produce specialized DNN architecture for different hardware at less training cost in both centralized and federated settings, and b) Real-time DNN Inference Serving Systems: where the thesis introduces an inference serving system SuperServe to enable fine-grained decision-making in DNN serving and reactively trade-offs accuracy to meet latency requirements in real-time for interactive applications. SuperServe achieves 4.67% higher accuracy for the same latency SLO attainment and 2.85x higher latency SLO attainment for the same accuracy on a trace derived from the real-world Microsoft Azure Functions workload.
Put together, the thesis introduces an efficient AI stack useful for the practical deployment of DNNs across various scenarios including heterogeneous hardware, bursty and unpredictable traffic, etc. The AI stack provides better latency SLO guarantees, accuracy, and resource efficiency with innovations in both systems and ML.