New top story on Hacker News: Show HN: KVSplit – Run 2-3× longer contexts on Apple Silicon

Show HN: KVSplit – Run 2-3× longer contexts on Apple Silicon
66 by dipampaul17 | 7 comments on Hacker News.
I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality. I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising: - K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss - K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss - The configurations use the same number of bits, but K8V4 is 7× better for quality This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows. Implementation was straightforward: 1. Added --kvq-key and --kvq-val flags to llama.cpp 2. Applied existing quantization logic separately to K and V tensors 3. Validated with perplexity metrics across context lengths 4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues) Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon. GitHub: https://ift.tt/Q8TP74l

Comments

Popular posts from this blog

Student's emotional allegation of sexual assault by Hong Kong police sparks investigation and anger

Elizabeth Warren Takes on Democratic Rivals on Fundraising in Speech

Furious Over Trump's Decision on Golan Heights, Erdogan Confirms Hagia Sophia Will Become a Mosque