<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Karthik Kaiplody]]></title><description><![CDATA[AI engineer, writing about production AI systems and inference.]]></description><link>https://karthikkaiplody.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!Pxh9!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F448faa80-5f84-46a9-a640-fcfe8c0eb21a_2012x2012.jpeg</url><title>Karthik Kaiplody</title><link>https://karthikkaiplody.substack.com</link></image><generator>Substack</generator><lastBuildDate>Fri, 15 May 2026 13:52:25 GMT</lastBuildDate><atom:link href="https://karthikkaiplody.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Karthik Kaiplody]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[karthikkaiplody@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[karthikkaiplody@substack.com]]></itunes:email><itunes:name><![CDATA[Karthik Kaiplody]]></itunes:name></itunes:owner><itunes:author><![CDATA[Karthik Kaiplody]]></itunes:author><googleplay:owner><![CDATA[karthikkaiplody@substack.com]]></googleplay:owner><googleplay:email><![CDATA[karthikkaiplody@substack.com]]></googleplay:email><googleplay:author><![CDATA[Karthik Kaiplody]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Measuring AI Agent Latency Beyond Single-Call Benchmarks]]></title><description><![CDATA[Part 2 of 3 &#8212; The Experiments]]></description><link>https://karthikkaiplody.substack.com/p/measuring-ai-agent-latency-beyond</link><guid isPermaLink="false">https://karthikkaiplody.substack.com/p/measuring-ai-agent-latency-beyond</guid><dc:creator><![CDATA[Karthik Kaiplody]]></dc:creator><pubDate>Sun, 10 May 2026 23:07:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!IXqr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7f4a02-6dd7-4096-a12e-d9dc5d697f7d_1472x684.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><a href="https://karthikkaiplody.substack.com/p/from-request-to-silicon-inside-an?r=1v1pma">Part 1</a> covered the architecture. This post covers the experiments, what two benchmark tiers showed, and why the results looked completely different depending on which one you ran.</em></p><div><hr></div><p><strong>Same GPU. Same model. One optimization enabled.</strong></p><p>If you only ran the standard gateway-direct test, you would see identical numbers and have no reason to prefer one configuration over the other. The agent end-to-end tier is what shows the actual difference.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IXqr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7f4a02-6dd7-4096-a12e-d9dc5d697f7d_1472x684.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IXqr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7f4a02-6dd7-4096-a12e-d9dc5d697f7d_1472x684.png 424w, https://substackcdn.com/image/fetch/$s_!IXqr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7f4a02-6dd7-4096-a12e-d9dc5d697f7d_1472x684.png 848w, https://substackcdn.com/image/fetch/$s_!IXqr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7f4a02-6dd7-4096-a12e-d9dc5d697f7d_1472x684.png 1272w, https://substackcdn.com/image/fetch/$s_!IXqr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7f4a02-6dd7-4096-a12e-d9dc5d697f7d_1472x684.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IXqr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7f4a02-6dd7-4096-a12e-d9dc5d697f7d_1472x684.png" width="1456" height="677" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da7f4a02-6dd7-4096-a12e-d9dc5d697f7d_1472x684.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:677,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:75805,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://karthikkaiplody.substack.com/i/197144840?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7f4a02-6dd7-4096-a12e-d9dc5d697f7d_1472x684.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IXqr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7f4a02-6dd7-4096-a12e-d9dc5d697f7d_1472x684.png 424w, https://substackcdn.com/image/fetch/$s_!IXqr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7f4a02-6dd7-4096-a12e-d9dc5d697f7d_1472x684.png 848w, https://substackcdn.com/image/fetch/$s_!IXqr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7f4a02-6dd7-4096-a12e-d9dc5d697f7d_1472x684.png 1272w, https://substackcdn.com/image/fetch/$s_!IXqr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7f4a02-6dd7-4096-a12e-d9dc5d697f7d_1472x684.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://karthikkaiplody.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="callout-block" data-callout="true"><p>Quick recap from <a href="https://karthikkaiplody.substack.com/p/from-request-to-silicon-inside-an?r=1v1pma">Part 1</a>: A user query hits a proxy gateway, which routes it to a model server running on a cloud GPU. The agent doesn&#8217;t call the model once it calls it 3 to 5 times per task, with each call carrying the full conversation  history. A single-call benchmark doesn&#8217;t capture any of that.</p></div><h2>Experiment Design: Three Optimization Paths</h2><p>I tested three vLLM server configurations (or &#8220;arms&#8221;) on a single Lambda A10G-24GB using <strong>Qwen2.5-3B-Instruct</strong>. The goal was to isolate how different infrastructure flags handle agentic workloads.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Z0tW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c58a999-c136-4704-8b43-25165e337917_1472x1216.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Z0tW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c58a999-c136-4704-8b43-25165e337917_1472x1216.png 424w, https://substackcdn.com/image/fetch/$s_!Z0tW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c58a999-c136-4704-8b43-25165e337917_1472x1216.png 848w, https://substackcdn.com/image/fetch/$s_!Z0tW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c58a999-c136-4704-8b43-25165e337917_1472x1216.png 1272w, https://substackcdn.com/image/fetch/$s_!Z0tW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c58a999-c136-4704-8b43-25165e337917_1472x1216.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Z0tW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c58a999-c136-4704-8b43-25165e337917_1472x1216.png" width="1456" height="1203" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c58a999-c136-4704-8b43-25165e337917_1472x1216.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1203,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:137652,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://karthikkaiplody.substack.com/i/197144840?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c58a999-c136-4704-8b43-25165e337917_1472x1216.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Z0tW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c58a999-c136-4704-8b43-25165e337917_1472x1216.png 424w, https://substackcdn.com/image/fetch/$s_!Z0tW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c58a999-c136-4704-8b43-25165e337917_1472x1216.png 848w, https://substackcdn.com/image/fetch/$s_!Z0tW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c58a999-c136-4704-8b43-25165e337917_1472x1216.png 1272w, https://substackcdn.com/image/fetch/$s_!Z0tW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c58a999-c136-4704-8b43-25165e337917_1472x1216.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Prompt Growth in Agentic Loops</h2><p>Unlike a simple chatbot, an agent maintains a running transcript of every tool call and search result. By the third turn of a conversation, the prompt often exceeds 2,000 tokens.</p><p>In a default setup, the GPU reads this entire 2,000-token prompt in one uninterrupted block. While it&#8217;s doing that, every other request in the system is forced to wait &#8212; a scheduling bottleneck that compounds under concurrency.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Sqia!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216989f0-7c4b-4439-889f-18f78981fa81_1472x1072.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Sqia!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216989f0-7c4b-4439-889f-18f78981fa81_1472x1072.png 424w, https://substackcdn.com/image/fetch/$s_!Sqia!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216989f0-7c4b-4439-889f-18f78981fa81_1472x1072.png 848w, https://substackcdn.com/image/fetch/$s_!Sqia!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216989f0-7c4b-4439-889f-18f78981fa81_1472x1072.png 1272w, https://substackcdn.com/image/fetch/$s_!Sqia!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216989f0-7c4b-4439-889f-18f78981fa81_1472x1072.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Sqia!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216989f0-7c4b-4439-889f-18f78981fa81_1472x1072.png" width="1456" height="1060" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/216989f0-7c4b-4439-889f-18f78981fa81_1472x1072.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1060,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:168282,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://karthikkaiplody.substack.com/i/197144840?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216989f0-7c4b-4439-889f-18f78981fa81_1472x1072.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Sqia!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216989f0-7c4b-4439-889f-18f78981fa81_1472x1072.png 424w, https://substackcdn.com/image/fetch/$s_!Sqia!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216989f0-7c4b-4439-889f-18f78981fa81_1472x1072.png 848w, https://substackcdn.com/image/fetch/$s_!Sqia!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216989f0-7c4b-4439-889f-18f78981fa81_1472x1072.png 1272w, https://substackcdn.com/image/fetch/$s_!Sqia!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216989f0-7c4b-4439-889f-18f78981fa81_1472x1072.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is the same proxy-to-model path from Part 1, now under concurrent load, the growing transcript is what turns a latency problem into a scheduling problem.</p><h2>Technical Context: The Prefill Phase</h2><p>LLM inference happens in two stages:</p><ul><li><p><strong>Prefill:</strong> The model reads and processes the prompt. This is computationally expensive.</p></li><li><p><strong>Generation:</strong> The model writes the response token-by-token.</p></li></ul><p>Chunked prefill forces the system to process that long prompt in pieces, letting other requests run in between. This keeps concurrent requests from stalling behind a long prefill block.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5cEq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03b5cc1b-b88a-4e4c-9583-60b8cabd5ab4_1472x532.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5cEq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03b5cc1b-b88a-4e4c-9583-60b8cabd5ab4_1472x532.png 424w, https://substackcdn.com/image/fetch/$s_!5cEq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03b5cc1b-b88a-4e4c-9583-60b8cabd5ab4_1472x532.png 848w, https://substackcdn.com/image/fetch/$s_!5cEq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03b5cc1b-b88a-4e4c-9583-60b8cabd5ab4_1472x532.png 1272w, https://substackcdn.com/image/fetch/$s_!5cEq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03b5cc1b-b88a-4e4c-9583-60b8cabd5ab4_1472x532.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5cEq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03b5cc1b-b88a-4e4c-9583-60b8cabd5ab4_1472x532.png" width="1456" height="526" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/03b5cc1b-b88a-4e4c-9583-60b8cabd5ab4_1472x532.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:526,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58143,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://karthikkaiplody.substack.com/i/197144840?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03b5cc1b-b88a-4e4c-9583-60b8cabd5ab4_1472x532.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5cEq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03b5cc1b-b88a-4e4c-9583-60b8cabd5ab4_1472x532.png 424w, https://substackcdn.com/image/fetch/$s_!5cEq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03b5cc1b-b88a-4e4c-9583-60b8cabd5ab4_1472x532.png 848w, https://substackcdn.com/image/fetch/$s_!5cEq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03b5cc1b-b88a-4e4c-9583-60b8cabd5ab4_1472x532.png 1272w, https://substackcdn.com/image/fetch/$s_!5cEq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03b5cc1b-b88a-4e4c-9583-60b8cabd5ab4_1472x532.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>Side note:</strong> Direction this points toward, separating prefill and decode onto different hardware entirely &#8212; disaggregated prefill/decode. Different phases, different GPU requirements. Worth knowing this exists if you&#8217;re thinking about where inference optimization goes next.</p></blockquote><h2>Setting Targets: SLOs and Performance Hypotheses</h2><p>I defined these Service Level Objectives (SLOs) before looking at the data. An SLO written after the fact isn&#8217;t a goal &#8212; it&#8217;s just a description of what happened.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jjKq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78b54e2f-7207-4ea9-9d92-dfe38e4c59a2_1472x414.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jjKq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78b54e2f-7207-4ea9-9d92-dfe38e4c59a2_1472x414.png 424w, https://substackcdn.com/image/fetch/$s_!jjKq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78b54e2f-7207-4ea9-9d92-dfe38e4c59a2_1472x414.png 848w, https://substackcdn.com/image/fetch/$s_!jjKq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78b54e2f-7207-4ea9-9d92-dfe38e4c59a2_1472x414.png 1272w, https://substackcdn.com/image/fetch/$s_!jjKq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78b54e2f-7207-4ea9-9d92-dfe38e4c59a2_1472x414.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jjKq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78b54e2f-7207-4ea9-9d92-dfe38e4c59a2_1472x414.png" width="1456" height="410" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/78b54e2f-7207-4ea9-9d92-dfe38e4c59a2_1472x414.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:410,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:47992,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://karthikkaiplody.substack.com/i/197144840?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78b54e2f-7207-4ea9-9d92-dfe38e4c59a2_1472x414.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jjKq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78b54e2f-7207-4ea9-9d92-dfe38e4c59a2_1472x414.png 424w, https://substackcdn.com/image/fetch/$s_!jjKq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78b54e2f-7207-4ea9-9d92-dfe38e4c59a2_1472x414.png 848w, https://substackcdn.com/image/fetch/$s_!jjKq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78b54e2f-7207-4ea9-9d92-dfe38e4c59a2_1472x414.png 1272w, https://substackcdn.com/image/fetch/$s_!jjKq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78b54e2f-7207-4ea9-9d92-dfe38e4c59a2_1472x414.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Pre-Experiment Predictions</h3><ul><li><p><strong>Chunked prefill should reduce tail latency:</strong> Agents send long prompts but give short answers &#8212; exactly the scheduling pattern this flag is designed for.</p></li><li><p><strong>Speculative decoding is unlikely to help here:</strong> It works best on predictable outputs. Agent tool-call responses are JSON with variable structure &#8212; low acceptance rates for the draft model, which adds overhead rather than saving time.</p></li><li><p><strong>Prefix caching should activate consistently:</strong> Every agent starts with the same ~900-token system prompt. vLLM should reuse that computed result instead of recalculating it every time.</p></li></ul><h2>Functional Validation: The Golden Set</h2><p>Speed is irrelevant if the agent is producing wrong answers. Before running load tests, I ran a &#8220;Golden Set&#8221; of 10 fixed queries to verify accuracy.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aPnh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cb9acb-6911-4f54-8a87-21430575287a_1472x520.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aPnh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cb9acb-6911-4f54-8a87-21430575287a_1472x520.png 424w, https://substackcdn.com/image/fetch/$s_!aPnh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cb9acb-6911-4f54-8a87-21430575287a_1472x520.png 848w, https://substackcdn.com/image/fetch/$s_!aPnh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cb9acb-6911-4f54-8a87-21430575287a_1472x520.png 1272w, https://substackcdn.com/image/fetch/$s_!aPnh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cb9acb-6911-4f54-8a87-21430575287a_1472x520.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aPnh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cb9acb-6911-4f54-8a87-21430575287a_1472x520.png" width="1456" height="514" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54cb9acb-6911-4f54-8a87-21430575287a_1472x520.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:514,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:60223,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://karthikkaiplody.substack.com/i/197144840?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cb9acb-6911-4f54-8a87-21430575287a_1472x520.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aPnh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cb9acb-6911-4f54-8a87-21430575287a_1472x520.png 424w, https://substackcdn.com/image/fetch/$s_!aPnh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cb9acb-6911-4f54-8a87-21430575287a_1472x520.png 848w, https://substackcdn.com/image/fetch/$s_!aPnh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cb9acb-6911-4f54-8a87-21430575287a_1472x520.png 1272w, https://substackcdn.com/image/fetch/$s_!aPnh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cb9acb-6911-4f54-8a87-21430575287a_1472x520.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The one failure (query gs-010) was a model-level limitation where the 3B model answered from its own knowledge instead of calling the search tool. </p><h2>Tiered Benchmarking: Gateway vs. End-to-End</h2><p>When concurrency increased, the infrastructure differences became visible.</p><h3>Tier 1: Gateway-Direct</h3><p>This is what standard benchmarks measure and showing a null result here is the point, not a limitation.</p><p>16 concurrent users each sent a single isolated call. p95 is the response time the slowest 5% of requests experience.</p><p><strong>Finding: Baseline and Chunked Prefill were identical at 2.4s (p95).</strong></p><blockquote><p>If you stopped here, you would conclude the optimization made no difference. That conclusion would be wrong &#8212; gateway-direct measures a single isolated call, not the multi-call sequence an agent actually runs.</p></blockquote><h3>Tier 2: Agent End-to-End</h3><p>4 concurrent users each ran full agent tasks (3&#8211;5 sequential model calls).</p><p><strong>Finding: Baseline latency reached 15.0s, while Chunked Prefill held at 5.8s.</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L10D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72f78ec-4db9-4535-8b6a-844b2148c050_1472x850.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L10D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72f78ec-4db9-4535-8b6a-844b2148c050_1472x850.png 424w, https://substackcdn.com/image/fetch/$s_!L10D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72f78ec-4db9-4535-8b6a-844b2148c050_1472x850.png 848w, https://substackcdn.com/image/fetch/$s_!L10D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72f78ec-4db9-4535-8b6a-844b2148c050_1472x850.png 1272w, https://substackcdn.com/image/fetch/$s_!L10D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72f78ec-4db9-4535-8b6a-844b2148c050_1472x850.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L10D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72f78ec-4db9-4535-8b6a-844b2148c050_1472x850.png" width="1456" height="841" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c72f78ec-4db9-4535-8b6a-844b2148c050_1472x850.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:841,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82091,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://karthikkaiplody.substack.com/i/197144840?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72f78ec-4db9-4535-8b6a-844b2148c050_1472x850.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!L10D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72f78ec-4db9-4535-8b6a-844b2148c050_1472x850.png 424w, https://substackcdn.com/image/fetch/$s_!L10D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72f78ec-4db9-4535-8b6a-844b2148c050_1472x850.png 848w, https://substackcdn.com/image/fetch/$s_!L10D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72f78ec-4db9-4535-8b6a-844b2148c050_1472x850.png 1272w, https://substackcdn.com/image/fetch/$s_!L10D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72f78ec-4db9-4535-8b6a-844b2148c050_1472x850.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>The &#8220;tail&#8221; latency the slowest 5% of requests is 3&#215; better with chunked prefill. By forcing the scheduler to share resources in between chunks, long-prompt requests add 9+ seconds of tail latency in the baseline that chunked prefill avoids.</em></p><h2>Speculative Decoding and Unit Economics</h2><h3>Rejection Overhead</h3><p>Speculative decoding came in at 14.0s p95. Slightly better than baseline (15.0s) but 2.4&#215; behind chunked prefill (5.8s), and with a 15% throughput drop at the gateway tier. Because agent responses (JSON and tool calls) are structured but variable, the draft model kept guessing wrong. Each wrong guess forced the main model to redo the work, adding overhead instead of saving time.</p><h3>Cost Analysis</h3><p>The cost per task appeared high ($0.036), but this was due to the short 15-minute test window being dominated by server setup time. At a sustained production speed, the cost drops to $0.003 per task easily beating the $0.01 target.</p><h2>Summary</h2><ul><li><p><strong>Chunked prefill handles the tail:</strong> It didn&#8217;t make the GPU faster &#8212; it made the scheduler fairer.</p></li><li><p><strong>Speculative decoding is workload-dependent:</strong> It works well for predictable outputs &#8212; code completion, templated generation. For variable agent responses, it can be a net negative.</p></li><li><p><strong>Benchmark the loop:</strong> Measuring the model alone gives you an incomplete picture. To understand agent performance, you must measure the full back-and-forth.</p></li></ul><p><em>Next up: Part 3 covers the observability layer &#8212; the four dashboards that made these results interpretable.</em></p><p><a href="https://github.com/karthikkaiplody/agent-infer-stack">GitHub repo</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://karthikkaiplody.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[From Request to Silicon: Inside an Instrumented LLM Inference Stack]]></title><description><![CDATA[Part 1 of 3 &#8212; The Architecture]]></description><link>https://karthikkaiplody.substack.com/p/from-request-to-silicon-inside-an</link><guid isPermaLink="false">https://karthikkaiplody.substack.com/p/from-request-to-silicon-inside-an</guid><dc:creator><![CDATA[Karthik Kaiplody]]></dc:creator><pubDate>Mon, 04 May 2026 00:20:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Mqoi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22c3e5c-e349-4c3b-9abe-feb6264c20e0_2720x2080.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><strong>Most inference benchmarks give you API-level numbers. They don&#8217;t show you what&#8217;s happening inside the stack when your agent is actually running.</strong></p></blockquote><p>Is the prefix cache being hit? Is a long prompt blocking shorter requests in the scheduler queue? Is the optimization you enabled actually doing anything for your specific workload?</p><p>Those questions only have answers if the instrumentation is already in place. I built an end-to-end instrumented inference stack, ran controlled experiments against it, and wrote up what I found.</p><p><em>First in a three-part series on building an observable LLM inference stack, running controlled experiments, and understanding what the numbers actually mean for agentic workloads.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://karthikkaiplody.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>What I Built</h2><p>The core is an <strong>inference gateway</strong>&#8212;it accepts standard OpenAI-compatible API calls and routes them to different backends: <strong>vLLM</strong> on a Lambda GPU, Modal&#8217;s serverless endpoint, or a <strong>local Ollama</strong> instance. Every layer emits metrics so you can see exactly what&#8217;s happening at each hop.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Mqoi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22c3e5c-e349-4c3b-9abe-feb6264c20e0_2720x2080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Mqoi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22c3e5c-e349-4c3b-9abe-feb6264c20e0_2720x2080.png 424w, https://substackcdn.com/image/fetch/$s_!Mqoi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22c3e5c-e349-4c3b-9abe-feb6264c20e0_2720x2080.png 848w, https://substackcdn.com/image/fetch/$s_!Mqoi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22c3e5c-e349-4c3b-9abe-feb6264c20e0_2720x2080.png 1272w, https://substackcdn.com/image/fetch/$s_!Mqoi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22c3e5c-e349-4c3b-9abe-feb6264c20e0_2720x2080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Mqoi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22c3e5c-e349-4c3b-9abe-feb6264c20e0_2720x2080.png" width="1456" height="1113" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b22c3e5c-e349-4c3b-9abe-feb6264c20e0_2720x2080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1113,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:220305,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://karthikkaiplody.substack.com/i/196277512?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22c3e5c-e349-4c3b-9abe-feb6264c20e0_2720x2080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Mqoi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22c3e5c-e349-4c3b-9abe-feb6264c20e0_2720x2080.png 424w, https://substackcdn.com/image/fetch/$s_!Mqoi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22c3e5c-e349-4c3b-9abe-feb6264c20e0_2720x2080.png 848w, https://substackcdn.com/image/fetch/$s_!Mqoi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22c3e5c-e349-4c3b-9abe-feb6264c20e0_2720x2080.png 1272w, https://substackcdn.com/image/fetch/$s_!Mqoi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22c3e5c-e349-4c3b-9abe-feb6264c20e0_2720x2080.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>Figure 1:</strong> <em>The full request path &#8212; Nginx &#8594; FastAPI gateway &#8594; SSH tunnel &#8594; Lambda vLLM on an A10G GPU. LangSmith captures agent traces. Prometheus collects metrics from the gateway and vLLM engine. Grafana visualizes all of it.</em></p></blockquote><h3>Project Structure</h3><p>The project is intentionally modular to allow for &#8220;pluggable&#8221; observability:</p><pre><code><code>gateway/        &#8594; Request routing, metrics, logging
agent/          &#8594; LangGraph ReAct agent + test runner
ui/             &#8594; Chainlit chat interface
observability/  &#8594; Prometheus, Grafana, OTel Collector
deployments/    &#8594; Lambda and Modal deployment scripts
experiments/    &#8594; vLLM server configurations
loadtest/       &#8594; Load testing scripts
data/samples/   &#8594; Experiment results &#8212; committed to the repo</code></code></pre><p>This isn&#8217;t a production system. It's a testbed for running controlled experiments &#8212; change one variable, run the same tests, read the same dashboards, and understand the tradeoff.</p><div><hr></div><h2>Following a Request Through the Stack</h2><p>The clearest way to explain this architecture is to trace a single request end-to-end.</p><h3><strong>01. The user sends a message</strong></h3><p>The Chainlit UI sends the prompt to the agent. At this exact moment, <strong>LangSmith</strong> begins recording the session automatically&#8212;no code changes required, just an environment variable. Every subsequent tool call and LLM response is captured in this trace.</p><h3><strong>02. The agent decides what to do</strong></h3><p>This is a <strong>LangGraph ReAct agent</strong>. Unlike a standard chatbot, it doesn&#8217;t call the LLM once and exit. It follows an iterative loop:</p><ol><li><p>Call LLM to select a tool.</p></li><li><p>Execute the tool.</p></li><li><p>Send results back to the LLM.</p></li><li><p>Repeat until a final answer is reached.</p></li></ol><blockquote><p><strong>The Context Load:</strong> One user message typically triggers <strong>3 to 5 separate LLM calls</strong>. By the third turn, the context can exceed <strong>2,000 tokens</strong> before the model even generates its first word. This is why agentic workloads are an entirely different beast for inference infrastructure compared to simple chat.</p></blockquote><h3><strong>03. The gateway receives the request</strong></h3><p>The FastAPI gateway intercepts a standard <code>POST /v1/chat/completions</code>. Before routing, it performs three critical tasks:</p><ul><li><p><strong>Header Inspection:</strong> Reads the <code>X-Technique</code> header (e.g., baseline, chunked_prefill).</p></li><li><p><strong>Telemetry:</strong> Records a Prometheus metric tagged with that specific technique.</p></li><li><p><strong>Logging:</strong> Writes a structured entry for later auditing.</p></li></ul><blockquote><p><strong>The Design Choice:</strong> By using the <code>X-Technique</code> header, a single Grafana dashboard shows all three experiment arms side-by-side without changing a single line of observability code between runs.</p></blockquote><h3><strong>04. The backend runs inference</strong></h3><p>The request is routed to one of three backends depending on the experiment configuration:</p><ul><li><p><strong>Lambda A10G:</strong> Primary engine for metrics.</p></li><li><p><strong>Ollama:</strong> Local development at zero cost.</p></li><li><p><strong>Modal:</strong> On-demand serverless scaling.</p></li></ul><p>The inference server handles the heavy lifting: scheduling, memory management, and KV cache orchestration. Its internal metrics are the &#8220;heartbeat&#8221; of this observability layer.</p><h3><strong>05. The GPU does the work</strong></h3><p>Finally, we reach the silicon. This is where tensor operations and token generation happen. The <strong>vLLM metrics endpoint</strong> allows us to see into the black box&#8212;specifically scheduler behavior and cache statistics.</p><p><em>Note: While we track engine metrics, kernel-level profiling (like CUDA stream analysis) is intentionally out of scope for this series.</em></p><div><hr></div><h2>Modularity by Design</h2><p>Adding a new backend is a single-file change. The gateway uses a <code>BaseBackend</code> interface, allowing <code>VLLMBackend</code> or <code>SGLangBackend</code> to be implemented independently.</p><pre><code><code>backends:
  vllm-lambda-baseline:
    url: "&lt;http://localhost&gt;:${BASELINE_PORT}"
    gpu: "a10g-24gb"
    server_profile: "baseline"
    spec_decoding: false
</code></code></pre><p>The practical reason for this? A controlled experiment requires exactly one variable to change. Swapping a backend shouldn&#8217;t touch routing logic, and changing a server flag shouldn&#8217;t require a new observability setup.</p><h2>The Agent Workload</h2><p>The agent has two tools: web search through Tavily, and a task list stored in a local JSON file.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jsz2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071e51cd-ba95-4a3c-9301-0f512344a03d_1862x997.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jsz2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071e51cd-ba95-4a3c-9301-0f512344a03d_1862x997.png 424w, https://substackcdn.com/image/fetch/$s_!jsz2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071e51cd-ba95-4a3c-9301-0f512344a03d_1862x997.png 848w, https://substackcdn.com/image/fetch/$s_!jsz2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071e51cd-ba95-4a3c-9301-0f512344a03d_1862x997.png 1272w, https://substackcdn.com/image/fetch/$s_!jsz2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071e51cd-ba95-4a3c-9301-0f512344a03d_1862x997.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jsz2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071e51cd-ba95-4a3c-9301-0f512344a03d_1862x997.png" width="1456" height="780" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/071e51cd-ba95-4a3c-9301-0f512344a03d_1862x997.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:780,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:185830,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://karthikkaiplody.substack.com/i/196277512?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071e51cd-ba95-4a3c-9301-0f512344a03d_1862x997.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jsz2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071e51cd-ba95-4a3c-9301-0f512344a03d_1862x997.png 424w, https://substackcdn.com/image/fetch/$s_!jsz2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071e51cd-ba95-4a3c-9301-0f512344a03d_1862x997.png 848w, https://substackcdn.com/image/fetch/$s_!jsz2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071e51cd-ba95-4a3c-9301-0f512344a03d_1862x997.png 1272w, https://substackcdn.com/image/fetch/$s_!jsz2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071e51cd-ba95-4a3c-9301-0f512344a03d_1862x997.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>Figure 2:</strong> <em>The Chainlit interface. A response like this comes from 3&#8211;4 sequential LLM calls through the gateway &#8212; tool selection, tool execution, reading the result, generating the final answer.</em></p></blockquote><p>What makes this interesting for infrastructure is <strong>concurrency</strong>. When several users run agent tasks simultaneously, each generates multiple sequential LLM calls with long contexts. This creates the exact conditions where scheduler decisions become visible in the data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xzsB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f32313-9179-4a61-a3b7-22951ce891eb_1864x916.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xzsB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f32313-9179-4a61-a3b7-22951ce891eb_1864x916.png 424w, https://substackcdn.com/image/fetch/$s_!xzsB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f32313-9179-4a61-a3b7-22951ce891eb_1864x916.png 848w, https://substackcdn.com/image/fetch/$s_!xzsB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f32313-9179-4a61-a3b7-22951ce891eb_1864x916.png 1272w, https://substackcdn.com/image/fetch/$s_!xzsB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f32313-9179-4a61-a3b7-22951ce891eb_1864x916.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xzsB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f32313-9179-4a61-a3b7-22951ce891eb_1864x916.png" width="1456" height="716" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5f32313-9179-4a61-a3b7-22951ce891eb_1864x916.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:716,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:195627,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://karthikkaiplody.substack.com/i/196277512?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f32313-9179-4a61-a3b7-22951ce891eb_1864x916.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xzsB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f32313-9179-4a61-a3b7-22951ce891eb_1864x916.png 424w, https://substackcdn.com/image/fetch/$s_!xzsB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f32313-9179-4a61-a3b7-22951ce891eb_1864x916.png 848w, https://substackcdn.com/image/fetch/$s_!xzsB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f32313-9179-4a61-a3b7-22951ce891eb_1864x916.png 1272w, https://substackcdn.com/image/fetch/$s_!xzsB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f32313-9179-4a61-a3b7-22951ce891eb_1864x916.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>Figure 3:</strong> <em>A LangSmith trace &#8212; agent &#8594; ChatOpenAI &#8594; tool decision &#8594; manage_tasks &#8594; agent &#8594; ChatOpenAI. Each row is a separate step with its own timing and token count.</em></p></blockquote><div><hr></div><h2>What&#8217;s Coming</h2><ul><li><p><strong>Part 2</strong> covers the three-arm experiment: baseline, chunked prefill, and speculative decoding. We&#8217;ll walk through why the benchmark tier matters more than the optimization.</p></li><li><p><strong>Part 3</strong> walks through the dashboard layers with real experiment data, including the internal vLLM metrics that confirmed our prefix caching hypotheses.</p></li></ul><div><hr></div><p><strong>The repo is at <a href="https://github.com/karthikkaiplody/agent-infer-stack">github.com/karthikkaiplody/agent-infer-stack</a>. </strong><em><a href="http://RUNBOOK.md">RUNBOOK.md</a> has the full setup instructions.</em></p><p><em>Part 2 drops next week.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://karthikkaiplody.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item></channel></rss>