Literatur vom gleichen Autor/der gleichen Autor*in
plus bei Google Scholar

Bibliografische Daten exportieren
 

Benchmarking KV-Cache Optimizations across Task Quality and System Performance for Long-Context Serving [Experiment, Analysis & Benchmark]

Titelangaben

Agrawal, Nikita ; Mayer, Ruben:
Benchmarking KV-Cache Optimizations across Task Quality and System Performance for Long-Context Serving [Experiment, Analysis & Benchmark].
Bayreuth , 2026 . - 13 S.
DOI: https://doi.org/10.15495/EPub_UBT_00009365

Volltext

Link zum Volltext (externe URL): Volltext

Abstract

Large language model serving is increasingly limited by KV-cache growth under long-context workloads, yet existing KV-cache compression techniques are difficult to compare because they were evaluated on different models, tasks, budgets, and serving stacks. This paper presents a workload-aware benchmark of representative KV-cache optimization mechanisms spanning quantization, pruning, and merging, including KIVI, TurboQuant, SnapKV, and CaM, evaluated on LongBench-style multi-document QA, single-document QA,
few-shot learning, and summarization workloads using Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3. The benchmark measures task quality, mean output throughput, mean time-to-first-token, and realized compression ratio across context-length buckets. The results show that the compression ratio alone is a poor predictor of end-to-end performance. KIVI4 provides the most stable quality across models, SnapKV delivers the strongest long-context throughput, and CaM yields large gains on selected QA workloads but exhibits substantial workload sensitivity in both quality and realized compression ratio. These findings motivate workload-aware selection of KV-cache mechanisms rather than one-size-fits-all compression and provide deployment guidance for long-context serving systems.

Weitere Angaben

Publikationsform: Preprint, Postprint
Zusätzliche Informationen: Eingereicht bei: Proceedings of the VLDB Endowment ISSN 2150-8097
Keywords: KV-caching; Memory-bound computing; LLMs; Long-context inference
Institutionen der Universität: Fakultäten > Fakultät für Mathematik, Physik und Informatik > Institut für Informatik > Lehrstuhl Angewandte Informatik X > Lehrstuhl Angewandte Informatik X - Univ.-Prof. Dr. Ruben Mayer
Fakultäten
Fakultäten > Fakultät für Mathematik, Physik und Informatik
Fakultäten > Fakultät für Mathematik, Physik und Informatik > Institut für Informatik > Lehrstuhl Angewandte Informatik X
Fakultäten > Fakultät für Mathematik, Physik und Informatik > Institut für Informatik
Titel an der UBT entstanden: Ja
Themengebiete aus DDC: 000 Informatik,Informationswissenschaft, allgemeine Werke > 004 Informatik
Eingestellt am: 13 Jun 2026 21:00
Letzte Änderung: 13 Jun 2026 21:00
URI: https://eref.uni-bayreuth.de/id/eprint/98838