Awesome AI System

May 14, 2026 · View on GitHub

This repo is motivated by awesome tensor compilers.

Contents

Paper-Code

Researcher

NameUniversityHomepage
Ion StoicaUC BerkeleyWebsite
Joseph E. GonzalezUC BerkeleyWebsite
Matei ZahariaUC BerkeleyWebsite
Zhihao JiaCMUWebsite
Tianqi ChenCMUWebsite
Stephanie WangUWWebsite
Xingda WeiSJTUWebsite
Zeyu MinSJTUWebsite
Xin JinPKUWebsite
Harry XuUCLAWebsite
Anand IyerGeorgia TechWebsite
Ravi NetravaliPrincetonWebsite
Christos KozyrakisStanfordWebsite
Christopher RéStanfordWebsite
Tri DaoPrincetonWebsite
Mosharaf ChowdhuryUMichWebsite
Shivaram VenkataramanWiscWebsite
Hao ZhangUCSDWebsite
Yiying ZhangUCSDWebsite
Ana KlimovicETHWebsite
Fan LaiUIUCWebsite
Lianmin ZhengUC BerkeleyWebsite
Ying ShengStanfordWebsite
Zhuohan LiUC BerkeleyWebsite
Woosuk KwonUC BerkeleyWebsite
Zihao YeUniversity of WashingtonWebsite
Amey AgrawalGeorgia TechWebsite

LLM Serving Framework

TitleGithub
MLC LLMStar
TensorRT-LLMStar
xFasterTransformerStar
CTranslate2(low latency)Star
llama2.cStar

LLM Evaluation Platform

TitleGithubWebsite
FastChatStarWebsite

LLM Robustness and Debugging

TitlePaperGithubPub. & Date
WFGY 1.0: Self-healing LLM Systems FrameworkDOI
PDF
StarTech report, Oct 13 2025

LLM Inference (System Side)

TitlePaperGithubWebSitePub. & Date
SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-ScalingarXivStar-SIGMETRICS'26
{HydraServe}: Minimizing Cold Start Latency for Serverless {LLM} Serving in Public CloudsarXivStar-NSDI'26
BulletServe:Boosting LLM Serving through Spatial-Temporal GPU Resource SharingarXivStar-ASPLOS'26
Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the MarketarXiv-SOSP'25
DynaPipe: Dynamic Layer Redistribution for Efficient Serving of LLMs with Pipeline ParallelismarXivStar-NeurIPS'25
DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV CompactionarXivStar-SOSP'25
Pie: A Programmable Serving System for Emerging LLM ApplicationsarXivStar-SOSP'25
KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE ModelsarXivStar-SOSP'25
XSched: Preemptive Scheduling for Diverse XPUsarXivStar-OSDI 25
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM InferencearXivStar-Arxiv 25
ServeGen: Workload Characterization and Generation of Large Language Model Serving in ProductionarXivStar-Arxiv 25
Resource Multiplexing in Tuning and Serving Large Language ModelsarXivStar-ATC'25
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM InferencearXivStar-Arxiv May 2025
SpecEE: Accelerating Large Language Model Inference with Speculative Early ExitingarXivStar-ISCA'25
LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL OffloadingarXivStar-ISCA'25
Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference ServingarXivStar-SIGMOD'25
Marconi: Prefix Caching for the Era of Hybrid LLMsarXivStar-MLSys'25
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUsarXivStar-Eurosys'25 Best Paper
NeuStream: Bridging Deep Learning Serving and Stream ProcessingarXivStar-Eurosys'25
Towards End-to-End Optimization of LLM-based Applications with AyoarXivStar-ASPLOS'25
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM InferencearXivStar-MLSYS'25
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge FusionarXivStar-Eurosys'25 Best Paper
Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-FlowarXivStar-ASPLOS'25
GLINTHAWK: A Two-Tiered Architecture for High-Throughput LLM InferencearXivStar-Arxiv'25,Jan
Queue Management for SLO-Oriented Large Language Model ServingarXivStar-SOCC'24
NanoFlow: Towards Optimal Large Language Model Serving ThroughputarXivStar-OSDI'25
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPUarXivStar-SOSP'24
LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence ParallelismarXivStar-SOSP'24
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative InferencearXivStar-MLSYS'24
PLLMCompass: Enabling Efficient Hardware Design for Large Language Model InferencearXivStar-ISCA'24
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert InferencearXivStar-ISCA'24
Prompt Cache: Modular Attention Reuse for Low-Latency InferencearXivStar-MLSYS'24
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-ServearXivStar-OSDI'24
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model ServingarXivStar-OSDI'24
Mooncake: A KVCache-centric Disaggregated Architecture for LLM ServingarXivStar-July'24
Llumnix: Dynamic Scheduling for Large Language Model ServingarXivStar-OSDI'24
Parrot: Efficient Serving of LLM-based Application with Semantic VariablesarXivStar-OSDI'24
CacheGen: Fast Context Loading for Language Model Applications via KV Cache StreamingarXivStar-SIGCOMM'24
Efficiently Programming Large Language Models using SGLangarXivStar-Jan, 2024
Efficient Memory Management for Large Language Model Serving with PagedAttentionarXivStar-SOSP'23
SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree VerificationarXivStar-Dec,2023
Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference-Star-PPOPP'24
Efficiently Programming Large Language Models using SGLangarXivStar-Nurips'24
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured SparsityarXivStar-VLDB'24

Compiler

TitlePaperGithubWebSitePub. & Date
Mercury: Unlocking Multi-GPU Operator Optimization for LLMs via Remote Memory SchedulingarXivStar-SOSP'25
Mirage: A Multi-Level Superoptimizer for Tensor ProgramsarXivStar-OSDI'25

Attention

TitlePaperGithubWebSitePub. & Date
UltraAttn: Efficiently Parallelizing Attention through Hierarchical Context-TilingarXivStar-SC'25
TASP: Topology-aware Sequence ParallelismarXivStar-Arxiv'25
Ring AttnStar-

RAG And ANNS

TitlePaperGithubWebSitePub. & Date
HedraRAG: Co-Optimizing Generation and Retrieval for Heterogeneous RAG WorkflowsarXivStar-SOSP'25
LEANN: A Low-Storage Vector IndexarXivStar-Arxiv 25
OdinANN: Direct Insert for Consistently Stable Performance in Billion-Scale Graph-Based Vector SearcharXivStar-FAST'26
Achieving Low-Latency Graph-Based Vector Search via Aligning Best-First Search Algorithm with SSDarXivStar-OSDI'25
Quake: Adaptive Indexing for Vector SearcharXivStar-OSDI'25
Hermes: Algorithm-System Co-design for Efficient Retrieval Augmented Generation At-ScalearXivStar-ISCA'25
PathWeaver: A High-Throughput Multi-GPU System for Graph-Based Approximate Nearest Neighbor SearcharXivStar-ATC'25
In-Storage Acceleration of Retrieval Augmented Generation as a Service: Artifact Evaluation READMEarXivStar-ISCA'25
RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation ServingarXivStar-ISCA'25

RLHF

TitlePaperGithubWebSitePub. & Date
Optimizing RLHF Training for Large Language Models with Stage FusionarXivStar-NSDI'25
HybridFlow: A Flexible and Efficient RLHF FrameworkarXivStar-Eurosys'25
ReaLHF: Optimized RLHF Training for Large Language Models through Parameter ReallocationarXivStar-June. 2024
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF FrameworkarXivStar-May. 2024

Video

TitlePaperGithubWebSitePub. & Date
Katz: Efficient Workflow Serving for Diffusion Models with Many AdaptersarXivStar-ATC'25
PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via Pool-Based Pipeline ParallelismarXivStar-Nov. 2024
xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive ParallelismarXivStar-Nov. 2024
FastVideoarXivStar-Dec. 2024

LLM Inference(AI Side)

TitlePaperGithubWebSitePub. & Date
InferCept: Efficient Intercept Support for Augmented Large Language Model InferencearXivStar-ICML'24
Online Speculative DecodingarXivStar-ICML'24
MuxServe: Flexible Spatial-Temporal Multiplexing for LLM ServingarXivStar-ICML'24
BitDelta: Your Fine-Tune May Only Be Worth One BitarXivStar-Feb,2024
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding HeadsarXivStar-Jan,2024
LLMCompiler: An LLM Compiler for Parallel Function CallingarXivStar-Dec,2023
Mamba: Linear-Time Sequence Modeling with Selective State SpacesarXivStar-Dec,2023
Teaching LLMs memory management for unbounded contextarXivStar-Oct,2023
Break the Sequential Dependency of LLM Inference Using Lookahead DecodingarXivStar-Feb,2024
EAGLE: Lossless Acceleration of LLM Decoding by Feature ExtrapolationarXivStar-Jan,2024

LLM MoE

TitlePaperGithubWebSitePub. & Date
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert InferencearXivStar-ISCA'24
SIDA-MOE: SPARSITY-INSPIRED DATA-AWARE SERVING FOR EFFICIENT AND SCALABLE LARGE MIXTURE-OF-EXPERTS MODELSarXivStar-MLSYS'24
ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks SchedulingarXivStar-Eurosys'24

LoRA

TitlePaperGithubWebSitePub. & Date
oRAFusion: Efficient LoRA Fine-Tuning for LLMsarXivStar-Eurosys'26
dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM ServingarXivStar-OSDI'24
S-LoRA: Serving Thousands of Concurrent LoRA AdaptersarXivStar-Nov,2023
Punica: Serving multiple LoRA finetuned LLM as onearXivStar-Oct,2023

Framework

Parallellism Training

Training

Communication

Serving-Inference

MoE

GPU Cluster Management

Schedule and Resource Management

Optimization

GNN

Fine-Tune

Energy

Misc

Contribute

We encourage all contributions to this repository. Open an issue or send a pull request.