<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>omrimallis</title><description>My personal blog</description><link>https://astro-paper.pages.dev/</link><item><title>Techniques for KV Cache Optimization in Large Language Models</title><link>https://astro-paper.pages.dev/posts/techniques-for-kv-cache-optimization/</link><guid isPermaLink="true">https://astro-paper.pages.dev/posts/techniques-for-kv-cache-optimization/</guid><description>This post explores techniques for optimizing the Key-Value (KV) cache in large language models, from Grouped-query attention to PagedAttention and distributed cache management.</description><pubDate>Sun, 25 Feb 2024 08:00:00 GMT</pubDate></item><item><title>Understanding how LLM inference works with llama.cpp</title><link>https://astro-paper.pages.dev/posts/understanding-how-llm-inference-works-with-llama-cpp/</link><guid isPermaLink="true">https://astro-paper.pages.dev/posts/understanding-how-llm-inference-works-with-llama-cpp/</guid><description>In this post we will understand how large language models (LLMs) answer user prompts by exploring the source code of llama.cpp, a C++ implementation of LLaMA, covering subjects such as tokenization, embedding, self-attention and sampling.</description><pubDate>Sat, 11 Nov 2023 16:00:00 GMT</pubDate></item><item><title>IOPS, the silent killer of cloud databases</title><link>https://astro-paper.pages.dev/posts/iops-the-silent-killer-of-cloud-databases/</link><guid isPermaLink="true">https://astro-paper.pages.dev/posts/iops-the-silent-killer-of-cloud-databases/</guid><description>Despite advancements in cloud infrastructure and storage technology, IOPS is still a significant bottleneck for cloud databases. This post explains the source of this bottleneck and techniques to solve it.</description><pubDate>Sun, 20 Aug 2023 08:00:00 GMT</pubDate></item></channel></rss>