暂无内容。
暂无内容。

生产级模型 API 的性能观察

Benchmarking gpt-oss, Llama 4, Gemma 3
5
分钟阅读
2026 年 5 月 18 日
分享文章

围绕“生产级模型 API 的性能观察”,拆解企业如何把知识库、智能客服、销售自动化、SOP 助手、模型 API 和行业 Agent 做成可上线、可维护、可复用的业务系统。

先从真实业务问题开始

企业引入 AI 不应只停留在模型、概念或演示效果上。更可靠的路径,是先明确岗位、流程、数据来源、权限边界和目标指标,再判断应该用知识库、智能客服、销售自动化、SOP 助手、模型 API 还是行业专用 Agent 来解决问题。

把方案做成可上线系统

模伐方块科技会把需求拆成可执行的交付清单:资料整理、知识库结构、提示词与工作流、接口接入、权限设置、日志记录、人工复核和培训文档。这样项目不是一次性 Demo,而是能被团队每天使用、持续迭代的业务系统。

适合优先落地的场景

  • 行业知识库与智能客服,解决资料查询、售前问答、售后工单和内部支持。
  • 销售与营销自动化,覆盖获客、跟进、话术、转化和复盘。
  • 企业内部 SOP 与培训助手,把老员工经验、制度文档和操作流程沉淀下来。
  • 报表、合同、邮件和会议纪要自动化,减少重复白领工作。
  • 制造、电商、法律、医疗、教育、金融等行业专用 Agent,用于质检、选品、合规、风控和数据分析。

交付后继续运营

AI 项目上线后,需要持续看使用率、准确率、响应速度、人工接管、成本和业务结果。我们会帮助客户建立复盘机制,让有效流程沉淀为可复用模块,再逐步进入订阅式软件能力和长期维护。

下一步

如果你正在评估「生产级模型 API 的性能观察」相关方向,可以从一次业务诊断开始。带上你的业务流程、客户资料、现有工具和希望优化的指标,我们会判断最适合先落地的 AI 应用路径。

                                                                                                                                   Throughput metrics for models.

Model Mean TTFT (ms) Mean TPOT (ms) Mean ITL (ms)
GPT-OSS-20B 165.50 8.20 8.66
Gemma 3 27B 669.20 19.15 19.72
Llama 4 Scout 17B-16E Instruct 317.03 19.54 20.71

                                                                                                                                Latency metrics for different models.

GPT‑OSS‑20B delivered substantially higher throughput, achieving roughly 2.5× the request and token throughput of both Gemma 3 and Llama 4. It also streamed responses faster, sustaining approximately 121 tokens per second per user—more than double the other models.

Gemma 3 27B‑it and Llama 4 Scout 17B‑16E showed similar streaming characteristics, but with significantly longer time‑to‑first‑token (TTFT). Gemma 3, in particular, exhibited the highest TTFT, indicating that prefill computation was a primary bottleneck under these conditions.

Short prompt workloads emphasize token generation efficiency over prefill latency. Under this regime, GPT‑OSS‑20B’s kernel and attention optimizations translated directly into a more interactive, responsive experience.

Concurrency Ramping and Saturation Behavior

Baseline performance captures only part of the picture. In production, systems must absorb traffic bursts and sustain concurrency without degrading user experience. To evaluate this, prompts filtered from the ShareGPT dataset were sent to single-replica endpoints with progressively increasing maximum concurrency. The same prompt set was used for consistency, and endpoints were reset between runs to eliminate caching effects.

As concurrency increases, the serving stack must handle overlapping workloads, larger batches, and growing contention for GPU memory and compute. The point at which requests begin to fail, latency rises, and throughput plateaus or declines marks the threshold where user experience degrades and additional scaling is required.

Concurrency ramping used identical inputs (1–63 tokens) and a maximum output length of 100 tokens across GPT-OSS-20B, Gemma 3 27B, and LLaMa 4 Scout 17B-16E Instruct. The results are shown in the graphs below.

Figure 1: Token throughput against maximum concurrency for the three models

Figure 1 shows GPT-OSS outperforming Gemma and LLaMa at concurrencies below 1600. Beyond this point, both Gemma and GPT exhibit degraded response times, with rising instability and batch loss. Although Gemma 27B and GPT-OSS-20B scale sharply at low concurrency, both are single-GPU models, which likely limits their ability to sustain higher concurrent loads relative to LLaMa.

LLaMa, by contrast, is a larger Mixture-of-Experts model spanning four GPUs. This leads to slower initial scaling due to routing and synchronization overhead, but supports acceptable performance and request success rates up to roughly 4000 concurrent in-flight requests. LLaMa peaks early (around 500 concurrent requests) before gradually declining in token throughput, unlike Gemma’s flatter plateau or GPT’s later peak before instability.

From a production perspective, GPT-OSS offers the best balance of throughput, scalability, and stability. Gemma saturates early with the lowest token throughput, while LLaMa trades efficiency for greater concurrency tolerance.

A graph with a red lineAI-generated content may be incorrect.
Figure 2: Request throughput (requests per second) against maximum concurrent requests for GPT-OSS-20B

GPT shows the steepest initial scaling, increasing from ~35 req/s at concurrency 32 to over 100 req/s at a maximum concurrency of 256. Throughput then continues to climb, reaching a stable plateau of ~130–136 req/s between ~800 and ~1400 concurrency. The plateau is roughly twice as fast as Gemma’s, indicating superior utilisation of GPU resources and more efficient scheduling. The fluctuations present at high concurrency are due to batches and requests being lost as the service loses stability.

A graph with green linesAI-generated content may be incorrect.
Figure 3: Request throughput (requests per second) against maximum concurrent requests for Gemma 27B

For the Gemma model, request throughput increases from ~8 req/s at concurrency 16 to ~49 req/s at 200, driven by improved batching efficiency. Throughput then rises gradually, reaching a flat plateau of ~56–59 req/s between 400 and ~1400 concurrency, indicating early saturation and limited benefit from additional inflight requests.

Minor fluctuations appear at higher concurrency, including a dip around 1500 concurrency (~53 req/s) caused by request loss. Beyond ~1600 concurrency, throughput becomes unstable and request success rates drop sharply, signaling unreliable operation.

A graph with a line going upAI-generated content may be incorrect.

The LLaMA model shows slower scaling at low concurrency, increasing from ~9 req/s at concurrency 16 to a peak of ~70 req/s at ~512 concurrency. Unlike GPT and Gemma, LLaMA reaches its maximum throughput earlier, after which throughput gradually declines. Throughput decreases slowly from ~70 req/s at 512 to ~62–64 req/s in the 1000–2500 concurrency range. A significant drop occurs at 3250 concurrent requests corresponding to one of the runs at that concurrency losing a large batch of requests.

What These Results Mean in Practice

From a production perspective, three insights stand out:

  1. Throughput alone is not enough. Models can appear performant at low concurrency yet collapse once saturation is reached.
  2. Concurrency has a breakpoint. Beyond this point, adding users leads to request loss rather than higher throughput.
  3. Architecture matters. Single‑GPU models deliver excellent efficiency until they don’t. Multi‑GPU MoE models trade peak efficiency for higher concurrency tolerance.

For most production inference deployments, GPT‑OSS‑20B provided the best balance of throughput, scalability, and stability within a single‑replica setup. Gemma 3 saturated too early for high‑load scenarios, while Llama 4 favored concurrency tolerance over raw efficiency.

Deliver Real-time Inference at Scale with Radiant’s Inference Delivery Network

These results underscore a broader truth: model choice is only half the equation. The model serving stack which includes scheduling, batching, isolation, and orchestration, ultimately determines how much usable performance and economic value can be extracted from GPU infrastructure under real load. In production environments, inefficiencies at this layer compound quickly, translating into unstable latency, wasted capacity, or premature scaling.

Radiant’s inference platform is designed with infrastructure-grade discipline to address these realities. It delivers predictable performance under sustained concurrency, exposes clear saturation points to inform replica and capacity planning, and maintains high GPU utilization without compromising stability or user experience. This combination enables sovereign entities, enterprises, telcos and AI builders to move beyond best-case benchmarks and deploy inference systems that remain reliable, efficient, and economically defensible as demand scale. Talk to an Inference Infrastructure Expert

常见问题

暂无内容。

操作指南

暂无内容。

相关文章