3. 강의목표
This class explores key concepts in system support for machine learning, deep learning, and large language model (LLM) workloads. The objectives of this class are threefold:
(1) to understand the key properties of these workloads.
(2) to study the fundamental mechanisms and policies implemented in contemporary training and inference frameworks.
(3) to examine how these frameworks have evolved to push the boundaries of performance, scalability, and programmability.
4. 강의선수/수강필수사항
Required Prerequisite: Operating Systems, Computer Architecture
5. 성적평가
The grading breakdown is as follows: Paper reviews (10%), Presentation (20%), Midterm (20%), Final (20%), Term project (30%).
Please note that these weights are subject to change.
8. 강의진도계획
[Week 1] Introduction
- 9/1: Course Introduction
- 9/3: Basics for Scheduling and Memory Management
[Week 2] Conference Travel (SIGCOMM)
- 9/8: No class
- 9/10: No class
[Week 3] Data Preprocessing
- 9/15: Basics for Data Preprocessing, MinIO (VLDB'21), Revamper (ATC'21)
- 9/17: FastFlow (VLDB'23), FusionFlow (VLDB'24)
[Week 4] Single-GPU Training
- 9/22: Basics for Single-GPU Training
- 9/24: Zico (ATC'21), Nimble (NeurIPS’20)
[Week 5] Multi-GPU & Multi-node Training
- 9/29: Basics for Distributed Training, ZeRO (SC’20), Parallax (EuroSys’19), GPipe (NeurIPS'19)
- 10/1: PipeDream (SOSP'19), Megatron-LM (SC'21)
[Week 6] National Holiday (Chuseok)
- 10/6: No class
- 10/8: No class
[Week 7] Multi-GPU & Multi-node Training
- 10/13: ByteScheduler (SOSP'19), BytePS (OSDI’20)
- 10/15: Alpa (OSDI’22), Metis (ATC’24)
[Week 8] Midterm Exam
[Week 9] Failure Recovery & Reliability
- 10/27: GEMINI (SOSP’23), Universal Checkpointing (ATC’25)
- 10/29: DeepXplore (SOSP’17), TRAINCHECK (OSDI'25)
[Week 10] Memory Oversubscription
- 11/3: Basics for Memory Oversubscription, vDNN (MICRO’16), Checkmate (MLSys’20), Capuchin (ASPLOS'20)
- 11/5: Zero-Offload (ATC'21), HUVM (ATC'22)
[Week 11] Scheduler & Cluster Manager
- 11/10: Basics for Scheduler & Cluster Manager, Philly (ATC’19), MLaaS in the Wild (NSDI'22)
- 11/12: Gavel (OSDI’20), Pollux (OSDI'21)
[Week 12] (LLM) Serving Systems
- 11/17: AlpaServe (OSDI'23), DeepPlan (EuroSys'23)
- 11/19: Basics for LLM Serving, FlashAttention (NeurIPS’22)
[Week 13] (LLM) Serving Systems
- 11/24: Orca (OSDI'22), PagedAttention (SOSP'23)
- 11/26: DistServe (OSDI'24), Sarathi-Serve (OSDI'24)
[Week 14] (LLM) Serving Systems
- 12/1: InfiniGen (OSDI'24), FlexGen (ICML'23) or ORBITFLOW (VLDB’26)
- 12/3: No class (University Foundation Day)
[Week 15] Final Exam
[Week 16] Project Presentation
9. 수업운영
This course will be based on reading papers and engaging in research-oriented discussions. Each student is expected to: (1) give a 30-minute presentation on one of the papers (selected through a paper bidding process) during the semester, (2) solve system design questions in the midterm and final exams, and (3) submit reviews for a small subset of the papers covered during the semester. Students will also be required to form groups and work with me to identify a small research topic, implement and evaluate their idea, and write a short research paper.
11. 장애학생에 대한 학습지원 사항
- 수강 관련: 문자 통역(청각), 교과목 보조(발달), 노트필기(전 유형) 등
- 시험 관련: 시험시간 연장(필요시 전 유형), 시험지 확대 복사(시각) 등
- 기타 추가 요청사항 발생 시 장애학생지원센터(279-2434)로 요청