Awesome-KV-Cache-Management

December 5, 2025 Β· View on GitHub

News

πŸ“’ Multi-turn KV Strategies and Benchmark Released (2025-07-18): "LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues [PDF][Dataset]" πŸš€

πŸ“’ Two Papers Accepted (2025-05-20): Our survey has been accepted by TMLR 2025, and our numerical benchmark paper has been accepted by ACL 2025! πŸš€

πŸ“’ Numerical Benchmark Released (2025-02-18): "Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models [PDF][Dataset]" β€” proposing a long NumericBench to assess LLMs' numerical reasoning! πŸš€

A Survey on Large Language Model Acceleration based on KV Cache Management [PDF]

Haoyang Li 1, Yiming Li 2, Anxin Tian 2, Tinahao Tang 2, Zhanchao Xu 4, Xuejia Chen 4, Nicole Hu 3, Wei Dong 5, Qing Li 1, Lei Chen 2

1Hong Kong Polytechnic University, 2Hong Kong University of Science and Technology, 3The Chinese University of Hong Kong, 4Huazhong University of Science and Technology, 5Nanyang Technological University.

  • This repository is dedicated to recording KV Cache Management papers for LLM acceleration. The survey will be updated regularly. If you find this survey helpful for your work, please consider citing it.
  @article{li2024surveylargelanguagemodel,
      title={A Survey on Large Language Model Acceleration based on KV Cache Management}, 
      author={Haoyang Li and Yiming Li and Anxin Tian and Tianhao Tang and Zhanchao Xu and Xuejia Chen and Nicole Hu and Wei Dong and Qing Li and Lei Chen},
      journal={arXiv preprint arXiv:2412.19442},
      year={2024}
  }
  • If you would like to include your paper or any modifications in this survey and repository, please feel free to send email to (haoyang-comp.li@polyu.edu.hk) or open an issue with your paper's title, category, and a brief summary highlighting its key techniques. Thank you!

Toxonomy and Papers


Token-level Optimization

KV Cache Selection

Static KV Cache Selection (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2024Model Tells You What to Discard: Adaptive KV Cache Compression for LLMsStatic KV Cache SelectionICLRLink
2024SnapKV: LLM Knows What You are Looking for Before GenerationStatic KV Cache SelectionNeurIPSLinkLink
2024A Simple and Effective L2 Norm-Based Strategy for KV Cache CompressionStatic KV Cache SelectionEMNLPLinkLink
2024In-context KV-Cache Eviction for LLMs via Attention-GateStatic KV Cache SelectionarXivLink

Dynamic Selection with Permanent Eviction (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2025Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMsDynamic Selection with Permanent EvictionarXivLinkLink
2025SepLLM: Accelerate Large Language Models by Compressing One Segment into One SeparatorDynamic Selection with Permanent EvictionMLSysLink
2024Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative InferenceDynamic Selection with Permanent EvictionMLSysLink
2024BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM InferenceDynamic Selection with Permanent EvictionarXivLinkLink
2024NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference TimeDynamic Selection with Permanent EvictionACLLinkLink
2023H2O: heavy-hitter oracle for efficient generative inference of large language modelsDynamic Selection with Permanent EvictionNeurIPSLinkLink
2023Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test TimeDynamic Selection with Permanent EvictionNeurIPSLink

Dynamic Selection without Permanent Eviction (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2025LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn DialoguesDynamic Selection without Permanent EvictionarXivLink
2024InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context MemoryDynamic Selection without Permanent EvictionarXivLinkLink
2024Quest: Query-Aware Sparsity for Efficient Long-Context LLM InferenceDynamic Selection without Permanent EvictionICMLLinkLink
2024PQCache: Product Quantization-based KVCache for Long Context LLM InferenceDynamic Selection without Permanent EvictionarXivLink
2024Squeezed Attention: Accelerating Long Context Length LLM InferenceDynamic Selection without Permanent EvictionarXivLinkLink
2024RetrievalAttention: Accelerating Long-Context LLM Inference via Vector RetrievalDynamic Selection without Permanent EvictionarXivLinkLink
2024Human-like Episodic Memory for Infinite Context LLMsDynamic Selection without Permanent EvictionarXivLink
2024ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable CompressionDynamic Selection without Permanent EvictionarXivLink
2024Loki: Low-rank Keys for Efficient Sparse AttentionDynamic Selection without Permanent EvictionarXivLinkLink

KV Cache Budget Allocation

Layer-wise Budget Allocation (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2024PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information FunnelingLayer-wise Budget AllocationarXivLinkLink
2024PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM InferenceLayer-wise Budget AllocationFindingsLinkLink
2024DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMsLayer-wise Budget AllocationICLR sub.Link
2024PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient GenerationLayer-wise Budget AllocationarXivLinkLink
2024SimLayerKV: A Simple Framework for Layer-Level KV Cache ReductionLayer-wise Budget AllocationarXivLinkLink

Head-wise Budget Allocation (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2024Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM InferenceHead-wise Budget AllocationarXivLink
2024Identify Critical KV Cache in LLM Inference from an Output Perturbation PerspectiveHead-wise Budget AllocationICLR sub.Link
2024Unifying KV Cache Compression for Large Language Models with LeanKVHead-wise Budget AllocationarXivLink
2024RazorAttention: Efficient KV Cache Compression Through Retrieval HeadsHead-wise Budget AllocationarXivLink
2024Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and ReasoningHead-wise Budget AllocationarXivLinkLink
2024DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming HeadsHead-wise Budget AllocationarXivLinkLink

KV Cache Merging

Intra-layer Merging (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2025ZeroMerge: Parameter-Free KV Cache Compression for Memory-Efficient Long-Context LLMsIntra-layer MergingarXivLinkLink
2024Compressed Context Memory for Online Language Model InteractionIntra-layer MergingICLRLinkLink
2024LoMA: Lossless Compressed Memory AttentionIntra-layer MergingarXivLink
2024Dynamic Memory Compression: Retrofitting LLMs for Accelerated InferenceIntra-layer MergingICMLLinkLink
2024CaM: Cache Merging for Memory-efficient LLMs InferenceIntra-layer MergingICMLLinkLink
2024D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language ModelsIntra-layer MergingarXivLink
2024AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and PruningIntra-layer MergingarXivLinkLink
2024LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context InferenceIntra-layer MergingEMNLPLinkLink
2024Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context TasksIntra-layer MergingarXivLink
2024CHAI: Clustered Head Attention for Efficient LLM InferenceIntra-layer MergingarXivLink

Cross-layer Merging (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2024MiniCache: KV Cache Compression in Depth Dimension for Large Language ModelsCross-layer MergingarXivLinkLink
2024KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cross-Layer SharingCross-layer MergingarXivLinkLink

KV Cache Quantization

Fixed-precision Quantization (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2024QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero OverheadFixed-precision QuantizationarXivLinkLink
2024PQCache: Product Quantization-based KVCache for Long Context LLM InferenceFixed-precision QuantizationarXivLink
2023FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPUFixed-precision QuantizationICMLLinkLink
2022ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale TransformersFixed-precision QuantizationNIPSLinkLink

Mixed-precision Quantization (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2025Quantize What Counts: Bit Allocation Insights Informed by Spectral Gaps in Keys and ValuesMixed-precision QuantizationarXivLinkLink
2024KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache QuantizationMixed-precision QuantizationarXivLinkLink
2024IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens IntactMixed-precision QuantizationarXivLinkLink
2024SKVQ: Sliding-window Key and Value Cache Quantization for Large Language ModelsMixed-precision QuantizationarXivLinkLink
2024KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV CacheMixed-precision QuantizationarXivLinkLink
2024WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains MoreMixed-precision QuantizationarXivLink
2024GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLMMixed-precision QuantizationarXivLinkLink
2024No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision QuantizationMixed-precision QuantizationarXivLink
2024ZipVL: Efficient Large Vision-Language Models with Dynamic Token SparsificationMixed-precision QuantizationarXivLink
2024ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token IdentificationMixed-precision QuantizationarXivLinkLink
2024PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMsMixed-precision QuantizationarXivLinkLink
2024MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV CacheMixed-precision QuantizationarXivLink

Outlier Redistribution (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2024Massive Activations in Large Language ModelsOutlier RedistributionarXivLinkLink
2024QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMsOutlier RedistributionarXivLinkLink
2024QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM ServingOutlier RedistributionarXivLinkLink
2024SpinQuant: LLM Quantization with Learned RotationsOutlier RedistributionarXivLinkLink
2024DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMsOutlier RedistributionNeurIPSLinkLink
2024SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language ModelsOutlier RedistributionICMLLinkLink
2024Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scalingOutlier RedistributionEMNLPLinkLink
2024AffineQuant: Affine Transformation Quantization for Large Language ModelsOutlier RedistributionarXivLinkLink
2024FlatQuant: Flatness Matters for LLM QuantizationOutlier RedistributionarXivLinkLink
2024AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and AccelerationOutlier RedistributionMLSysLinkLink
2023OmniQuant: Omnidirectionally Calibrated Quantization for Large Language ModelsOutlier RedistributionarXivLinkLink
2023Training Transformers with 4-bit IntegersOutlier RedistributionNeurIPSLinkLink

KV Cache Low-rank Decomposition

Singular Value Decomposition (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2025Q-Filters: Leveraging QK Geometry for Efficient KV Cache CompressionSingular Value DecompositionarXivLink
2024Effectively Compress KV Heads for LLMSingular Value DecompositionarXivLink
2024Eigen Attention: Attention in Low-Rank Space for KV Cache CompressionSingular Value DecompositionarXivLinkLink
2024Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM InferenceSingular Value DecompositionarXivLink
2024LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression StrategySingular Value DecompositionarXivLink
2024ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM InferenceSingular Value DecompositionarXivLinkLink
2024Palu: Compressing KV-Cache with Low-Rank ProjectionSingular Value DecompositionarXivLinkLink
2024Loki: Low-rank Keys for Efficient Sparse AttentionDynamic Selection without Permanent EvictionarXivLinkLink

Tensor Decomposition (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2024Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache CompressionTensor DecompositionACLLinkLink

Learned Low-rank Approximation (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2024Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM InferenceLearned Low-rank ApproximationarXivLinkLink
2024MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal ProjectionLearned Low-rank ApproximationarXivLink

Model-level Optimization

Attention Grouping and Sharing

Intra-Layer Grouping (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2024Optimised Grouped-Query Attention Mechanism for TransformersIntra-Layer GroupingICMLLink
2024Weighted Grouped Query Attention in TransformersIntra-Layer GroupingarXivLink
2024QCQA: Quality and Capacity-aware grouped Query AttentionIntra-Layer GroupingarXivLinkNon-official Link
2024Beyond Uniform Query Distribution: Key-Driven Grouped Query AttentionIntra-Layer GroupingarXivLinkLink
2023GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and ValuesIntra-Layer GroupingNeurIPSLink
2023GQA: Training Generalized Multi-Query Transformer Models from Multi-Head CheckpointsIntra-Layer GroupingEMNLPLinkLink
2019Fast Transformer Decoding: One Write-Head is All You NeedIntra-Layer GroupingarXivLink

Cross-Layer Sharing (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2024Reducing Transformer Key-Value Cache Size with Cross-Layer AttentionCross-Layer SharingarXivLinkNon-official Link
2024Layer-Condensed KV Cache for Efficient Inference of Large Language ModelsCross-Layer SharingACLLinkLink
2024Beyond KV Caching: Shared Attention for Efficient LLMsCross-Layer SharingarXivLinkLink
2024MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer DecodingCross-Layer SharingarXivLinkLink
2024Cross-layer Attention Sharing for Large Language ModelsCross-Layer SharingarXivLink
2024A Systematic Study of Cross-Layer KV Sharing for Efficient LLM InferenceCross-Layer SharingarXivLink
2024Lossless KV Cache Compression to 2%Cross-Layer SharingarXivLink
2024DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads FusionCross-Layer SharingNeurIPSLink
2024Value Residual Learning For Alleviating Attention Concentration In TransformersCross-Layer SharingarXivLinkLink

Architecture Alteration

Enhanced Attention (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2024DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language ModelEnhanced AttentionarXivLinkLink
2022Transformer Quality in Linear TimeEnhanced AttentionICMLLink
2024Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attentionEnhanced AttentionarXivLink

Augmented Architecture (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2024You Only Cache Once: Decoder-Decoder Architectures for Language ModelsAugmented ArchitecturearXivLinkLink
2024Long-Context Language Modeling with Parallel Context EncodingAugmented ArchitecturesACLLinkLink
2024XC-CACHE: Cross-Attending to Cached Context for Efficient LLM InferenceAugmented ArchitecturesFindingsLink
2024Block Transformer: Global-to-Local Language Modeling for Fast InferenceAugmented ArchitecturesarXivLinkLink

Non-transformer Architecture

Adaptive Sequence Processing Architecture (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2023RWKV: Reinventing RNNs for the Transformer EraAdaptive Sequence Processing ArchitectureFindingsLinkLink
2024Mamba: Linear-Time Sequence Modeling with Selective State SpacesAdaptive Sequence Processing ArchitecturearXivLinkLink
2023Retentive Network: A Successor to Transformer for Large Language ModelsAdaptive Sequence Processing ArchitecturearXivLinkLink
2024MCSD: An Efficient Language Model with Diverse FusionAdaptive Sequence Processing ArchitecturearXivLink

Hybrid Architecture (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2024MixCon: A Hybrid Architecture for Efficient and Adaptive Sequence ModelingHybrid ArchitectureIOS PressLink
2024GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache CompressionHybrid ArchitecturearXivLinkLink
2024RecurFormer: Not All Transformer Heads Need Self-AttentionHybrid ArchitecturearXivLink

System-level Optimization

Memory Management

Architectural Design (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2025eLLM: Elastic Memory Management Framework for Efficient LLM ServingArchitectural DesignarXivLink
2025Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference ServingArchitectural DesignACMLink
2024vTensor: Flexible Virtual Tensor Management for Efficient LLM ServingArchitectural DesignarXivLinkLink
2024Unifying KV Cache Compression for Large Language Models with LeanKVArchitectural DesignarXivLink
2023Efficient Memory Management for Large Language Model Serving with PagedAttentionArchitectural DesignSOSPLinkLink

Prefix-aware Design (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2025FlashForge: Ultra-Efficient Prefix-Aware Attention for LLM DecodingPrefix-aware DesignarXivLink
2024ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase PartitionPrefix-aware DesignACLLinkLink
2024MemServe:FlexibleMemPoolforBuilding DisaggregatedLLMServingwithCachingPrefix-aware DesignarXivLink

Scheduling

Prefix-aware Scheduling (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2025Echo: Efficient Co-Scheduling of Hybrid Online-Offline Tasks for Large Language Model ServingPrefix-aware SchedulingarXivLink
2024BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token BatchingPrefix-aware SchedulingarXivLink
2024SGLang: Efficient Execution of Structured Language Model ProgramsPrefix-aware SchedulingNeurIPSLinkLink

Preemptive and Fairness-oriented Scheduling (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2025FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware SchedulingPreemptive and Fairness-oriented SchedulingarXivLink
2024FASTSWITCH: OPTIMIZING CONTEXT SWITCHING EFFICIENCY IN FAIRNESS-AWARE LARGE LANGUAGE MODEL SERVINGPreemptive and Fairness-oriented SchedulingarXivLink
2023Fast Distributed Inference Serving for Large Language ModelsPreemptive and Fairness-oriented SchedulingarXivLink

Layer-specific and Hierarchical Scheduling (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2025Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference ServingArchitectural DesignACMLink
2025Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory ConstraintsArchitectural DesignarXivLink
2024LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache ManagementLayer-specific and Hierarchical SchedulingarXivLinkLink
2024Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttentionLayer-specific and Hierarchical SchedulingUSENIX ATCLink
2024ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV CachingLayer-specific and Hierarchical SchedulingISCALink
2024Fast Inference for Augmented Large Language ModelsLayer-specific and Hierarchical SchedulingarXivLink

Hardware-aware Design

Single/Multi-GPU Design (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2025gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token ThrottlingSingle/Multi-GPU DesignarXivLink
2025FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU InferenceSingle/Multi-GPU DesignarXivLink
2025Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache ManagementSingle/Multi-GPU DesignarXivLink
2024Hydragen: High-Throughput LLM Inference with Shared PrefixesSingle/Multi-GPU DesignarXivLinkLink
2024DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM InferenceSingle/Multi-GPU DesignarXivLink
2024DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model ServingSingle/Multi-GPU DesignOSDILinkLink
2024Multi-Bin Batching for Increasing LLM Inference ThroughputSingle/Multi-GPU DesignarXivLink
2024Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersSingle/Multi-GPU DesignarXivLinkLink
2023Efficient Memory Management for Large Language Model Serving with PagedAttentionSingle/Multi-GPU DesignSOSPLinkLink
2022Orca: A Distributed Serving System for Transformer-Based Generative ModelsSingle/Multi-GPU DesignOSDILink

I/O-based Design (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2024Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMsI/O-based DesignarXivLinkLink
2024Efficient LLM Inference with I/O-Aware Partial KV Cache RecomputationI/O-based DesignarXivLink
2024Fast State Restoration in LLM Serving with HCacheI/O-based DesignarXivLink
2024Compute Or Load KV Cache? Why Not Both?I/O-based DesignarXivLink
2024FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model ServingI/O-based DesignarXivLink
2024FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precisionI/O-based DesignarXivLinkLink
2023FlashAttention-2: Faster Attention with Better Parallelism and Work PartitioningI/O-based DesignarXivLinkLink
2022FlashAttention: Fast and Memory-Efficient Exact Attention with IO-AwarenessI/O-based DesignNeurIPSLinkLink

Heterogeneous Design (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2025HeadInfer: Memory-Efficient LLM Inference by Head-wise OffloadingHeterogeneous DesignarXivLink
2025Parallel CPU-GPU Execution for LLM Inference on Constrained GPUsHeterogeneous DesignarXivLink
2024NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM InferenceHeterogeneous DesignarXivLink
2024FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous PipelinesHeterogeneous DesignarXivLink
2024vTensor: Flexible Virtual Tensor Management for Efficient LLM ServingHeterogeneous DesignarXivLink
2024InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache ManagementHeterogeneous DesignarXivLink
2024Fast Distributed Inference Serving for Large Language ModelsHeterogeneous DesignarXivLink
2024Efficient LLM Inference with I/O-Aware Partial KV Cache RecomputationHeterogeneous DesignarXivLink
2023Stateful Large Language Model Serving with PensieveHeterogeneous DesignarXivLink

SSD-based Design (To TopπŸ‘†πŸ»)

YearTitleTypeVenuePapercode
2024InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM InferenceSSD-based DesignarXivLink
2023FlexGen: High-Throughput Generative Inference of Large Language ModelsSSD-based DesignICMLLinkLink

Datasets and Benchmarks

Please refer to our paper for detailed information on this section.