暂无内容。
暂无内容。

开源模型 API 与算力运维的落地路径

Dark grey panels
分钟阅读
2026 年 5 月 18 日
分享文章

围绕“开源模型 API 与算力运维的落地路径”,拆解企业如何把知识库、智能客服、销售自动化、SOP 助手、模型 API 和行业 Agent 做成可上线、可维护、可复用的业务系统。

先从真实业务问题开始

企业引入 AI 不应只停留在模型、概念或演示效果上。更可靠的路径,是先明确岗位、流程、数据来源、权限边界和目标指标,再判断应该用知识库、智能客服、销售自动化、SOP 助手、模型 API 还是行业专用 Agent 来解决问题。

把方案做成可上线系统

模伐方块科技会把需求拆成可执行的交付清单:资料整理、知识库结构、提示词与工作流、接口接入、权限设置、日志记录、人工复核和培训文档。这样项目不是一次性 Demo,而是能被团队每天使用、持续迭代的业务系统。

适合优先落地的场景

  • 行业知识库与智能客服,解决资料查询、售前问答、售后工单和内部支持。
  • 销售与营销自动化,覆盖获客、跟进、话术、转化和复盘。
  • 企业内部 SOP 与培训助手,把老员工经验、制度文档和操作流程沉淀下来。
  • 报表、合同、邮件和会议纪要自动化,减少重复白领工作。
  • 制造、电商、法律、医疗、教育、金融等行业专用 Agent,用于质检、选品、合规、风控和数据分析。

交付后继续运营

AI 项目上线后,需要持续看使用率、准确率、响应速度、人工接管、成本和业务结果。我们会帮助客户建立复盘机制,让有效流程沉淀为可复用模块,再逐步进入订阅式软件能力和长期维护。

下一步

如果你正在评估「开源模型 API 与算力运维的落地路径」相关方向,可以从一次业务诊断开始。带上你的业务流程、客户资料、现有工具和希望优化的指标,我们会判断最适合先落地的 AI 应用路径。

According to NVIDIA's benchmarks, Nemotron 3 Nano performs competitively against models such as Qwen 3 and gpt-oss-20b in reasoning tasks. Because of the MoE architecture with its sparse activation design, it activates only a fraction of its parameters per token, which helps reduce the memory requirements during inference.

Nemotron Nano vs Qwen 3 vs gpt-oss

How to run Nemotron 3 on an H100 GPU

Prerequisites

To get started, create a GPU virtual machine (VM) on Radiant AI 应用落地.

We have selected the NVIDIA H100 for this tutorial as it is a strong combination of availability and cost-efficiency in the market. While upgrading to H200 or B200 hardware would unlock superior performance and larger context windows, the steps outlined in this guide remain consistent across these architectures.

NVIDIA has released the post-trained and pre-trained BF16 variants as well as the quantized FP8 version. We’ll be running the leaner FP8 model on an H100 GPU for this tutorial.

Quick Tip

Use the initialization script during VM creation to pre-install NVIDIA CUDA drivers, PyTorch.

Step 1: SSH into your VM and set up the environment

apt install python3.12-venv
python3.12 -m venv nemo-env
source nemo-env/bin/activate

Step 2: Install the latest vLLM

pip install -U "vllm>=0.12.0"

Step 3: Download the Nemotron 3 Parser

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/resolve/main/nano_v3_reasoning_parser.py

Step 4: Run the vLLM server

We will serve the nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model.

VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve  --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--max-num-seqs 8 \
 --tensor-parallel-size 1 \
 --max-model-len 262144 \
 --port 8000 \
 --trust-remote-code \
 --enable-auto-tool-choice \
 --tool-call-parser qwen3_coder \
 --reasoning-parser-plugin nano_v3_reasoning_parser.py \
 --reasoning-parser nano_v3 \
 --kv-cache-dtype fp8

Here’s a snapshot of the GPU instance to show memory usage of about 73GB VRAM.

How much VRAM memory for Nemotron 3 Nano

Step 4: Test the model with cURL

You can interact with the model using standard tools like curl

curl http://VM-IP:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model":"nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
       "messages":[{"role": "user", "content": "Explain Jevon'\''s Paradox in a single sentence"}],
       "chat_template_kwargs": {"enable_thinking": false}
   }' | jq -r '."choices"[0]."message"."content"'

Step 5: Install Jupyter Notebook for ease of interaction and run OpenAI Python SDK

pip install notebook
jupyter notebook --allow-root --no-browser --ip=0.0.0.0

PythonCopy

from openai import OpenAI

client = OpenAI(
   base_url="http://localhost:8000/v1",
   api_key="EMPTY"
)
messages=[
       {"role": "system", "content": "You are a helpful assistant."},
       {"role": "user", "content": "5.9 - 5.11"}
   ]
response = client.chat.completions.create(model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8", messages=messages, extra_body={"chat_template_kwargs": {"enable_thinking": False}})

print(response.choices[0].message.content)

Step 6: Tool Calling with Nemotron 3 Nano

Call the built-in tip calculator tool. Similarly, you can use other recipes to build agentic workflows with Nemotron 3 Nano

from openai import OpenAI

client = OpenAI(
   base_url="http://localhost:8000/v1",
   api_key="EMPTY"
)

TOOLS = [
   {
       "type": "function",
       "function": {
           "name": "calculate_tip",
           "parameters": {
               "type": "object",
               "properties": {
                   "bill_total": {
                       "type": "integer",
                       "description": "The total amount of the bill"
                   },
                   "tip_percentage": {
                       "type": "integer",
                       "description": "The percentage of tip to be applied"
                   }
               },
               "required": ["bill_total", "tip_percentage"]
           }
       }
   }
]
messages=[
       {"role": "system", "content": "You are a helpful assistant."},
       {"role": "user", "content": "My bill is $50. What will be the amount for 15% tip?"}
   ]
response = client.chat.completions.create(model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8", messages=messages, tools=TOOLS, temperature=0.6, top_p=0.95, max_tokens=512, stream=False )

#print(response.choices[0].message.content)

print(response.choices[0].message.reasoning_content)
#print(response.choices[0].message.tool_calls)

How fast is Nemotron Nano 3?

We were very impressed with the generation speeds from Nemotron Nano 3, outpacing most other open-source models. On the server side, we saw vLLM output 223 tokens per second on an H100 GPU for a single request, demonstrating excellent inference throughput, especially when compared to the 158 tokens/second throughput we observed with gpt-oss-120b earlier this year.

Nemotron Nano 3 Tokens per second

We also created a visualization for the vLLM metrics based on this guide to collect the data in Prometheus coupled with a Grafana dashboard. This dashboard displayed generation speeds of 185 tokens/second. Although there is some variance in the numbers compared to the one provided by the vLLM terminal, it reinforces the high throughput capability of Nemotron's architecture and its efficient continuous batching.

Similarly, time-to-first-token (TTFT) numbers of less than 100 ms indicate a greater degree of responsiveness compared to models such as Qwen 3 and gpt-oss.

Nemotron 3 Nano tokens per second
Nemotron 3 Nano time-to-first-token (ttft)

Initial Impressions

We ran a few standard tests to see how Nemotron 3 Nano handles common logic and coding prompts, including math reasoning tasks.

Tokenization & Logic:

Prompt: How many 'r's in “strawberry”?

Nemotron 3:

Nemotron Strawberry

Prompt: How many 'l's in “strawberry”?

Nemotron 3: The model got this one wrong by stating that the word “strawberry” had one letter ‘l’

Nemotron NVIDIA

Mathematical Reasoning:

Prompt: Find all saddle points of the function $f(x, y) = x^3 + y^3 - 3x - 12y + 20$.

Nemotron 3: The model correctly applied the second derivative test and identified the saddle point without errors in the calculation steps.

Nemotron Math Performance

Prompt: 企业 AI Agent the area of the region enclosed by the graphs of the given equations “y=x, y=2x, and y=6-x”. Use vertical cross-sections

Nemotron 3: The model got this one wrong, because the correct answer is 3 and not 11/4

Nemotron 3 Nano Math

Code Generation:

Prompt: We asked for Python code for the Snake game. Nemotron Nano 3 got it right at the first try with a perfect simulation of the game. Here’s a snapshot of the game

Prompt: Create an SVG of a smiling dog

Again, Nemotron 3 Nano 3 was able to create the SVG correctly on the first pass.

Overall, Nemotron 3 Nano performs reliably for its size. It demonstrates strong logical consistency and multi-step reasoning capabilities, suggesting that the hybrid architecture is effective at maintaining coherence without the computational cost of a dense 70B model.

Scale your AI on Radiant

Deploying hybrid models like Nemotron 3 benefits significantly from robust, high-performance infrastructure. Radiant’s AI 应用落地 provides the flexibility and top-tier compute required to support dynamic workloads, helping you bridge the gap between initial prototyping and production deployment.

  • GPU 实例: Gain instant access to top-tier GPUs required for efficient inference.
  • Supercomputers: Instant, bare-metal GPU clusters connected by Infiniband networking.
  • Inference Endpoints: Integrate the latest open-source models into your applications via scalable, low-latency APIs.
  • GPU Clusters: Orchestrate high-performance clusters for fine-tuning or training foundation models at scale.
  • 模伐 AI 控制台: License the same platform that powers Radiant AI 应用落地 to build your own AI-centric, GPU compute cloud

常见问题

暂无内容。

操作指南

暂无内容。

相关文章