开源模型 API 与算力运维的落地路径

模伐方块科技研究院

Technical Marketing

分钟阅读

2026 年 5 月 18 日

分享文章

围绕“开源模型 API 与算力运维的落地路径”，拆解企业如何把知识库、智能客服、销售自动化、SOP 助手、模型 API 和行业 Agent 做成可上线、可维护、可复用的业务系统。

先从真实业务问题开始

企业引入 AI 不应只停留在模型、概念或演示效果上。更可靠的路径，是先明确岗位、流程、数据来源、权限边界和目标指标，再判断应该用知识库、智能客服、销售自动化、SOP 助手、模型 API 还是行业专用 Agent 来解决问题。

把方案做成可上线系统

模伐方块科技会把需求拆成可执行的交付清单：资料整理、知识库结构、提示词与工作流、接口接入、权限设置、日志记录、人工复核和培训文档。这样项目不是一次性 Demo，而是能被团队每天使用、持续迭代的业务系统。

适合优先落地的场景

行业知识库与智能客服，解决资料查询、售前问答、售后工单和内部支持。
销售与营销自动化，覆盖获客、跟进、话术、转化和复盘。
企业内部 SOP 与培训助手，把老员工经验、制度文档和操作流程沉淀下来。
报表、合同、邮件和会议纪要自动化，减少重复白领工作。
制造、电商、法律、医疗、教育、金融等行业专用 Agent，用于质检、选品、合规、风控和数据分析。

交付后继续运营

AI 项目上线后，需要持续看使用率、准确率、响应速度、人工接管、成本和业务结果。我们会帮助客户建立复盘机制，让有效流程沉淀为可复用模块，再逐步进入订阅式软件能力和长期维护。

下一步

如果你正在评估「开源模型 API 与算力运维的落地路径」相关方向，可以从一次业务诊断开始。带上你的业务流程、客户资料、现有工具和希望优化的指标，我们会判断最适合先落地的 AI 应用路径。

By activating only a fraction of its parameters per token, Nemotron 3 Super delivers the "brainpower" of a frontier model while fundamentally changing the cost curve for long-context agentic work. This makes it ideal for tool calling use-cases, data science workflows, semiconductor design automation, molecular understanding in drug discovery, deep literature search, and AI-native companies looking to automate workflows without the overhead of proprietary models. The model can be fine-tuned locally for domain-specific applications.

Nemotron 3 Super outperforms models such as Qwen 3 and gpt-oss-120b in reasoning tasks and is also significantly faster (2.2x throughput of gpt-oss-120b and 2x of qwen3-122b). This massive speed advantage stems from its architectural trifecta: while OpenAI’s open-source models suffer from the quadratic compute costs of standard Transformers, Nemotron's Mamba layers process long sequences with linear efficiency. When combined with Latent MoE (routing to multiple experts without the usual compute penalty) and MTP (generating several words simultaneously), the model effectively bypasses the memory bandwidth bottlenecks that throttle its peers.

‍

How to run Nemotron 3 on an H100 GPU

Prerequisites

To get started, create a GPU virtual machine (VM) on Radiant AI 应用落地.

We have selected 2x NVIDIA H100 GPUs for this tutorial as it is a strong combination of availability and cost-efficiency in the market. While upgrading to H200 or B200 hardware would unlock superior performance and larger context windows, the steps outlined in this guide remain consistent across these architectures. We’ll be running the FP8 model for this tutorial.

Quick Tip: Use the initialization script during VM creation to pre-install NVIDIA CUDA drivers, PyTorch.

Step 1: SSH into your VM and set up the environment

apt install python3.12-venv
python3.12 -m venv nemo-env
source nemo-env/bin/activate

Step 2: Install the latest vLLM

pip install -U vllm==0.17.1 torch==2.10.0 flashinfer-python==0.6.4 flashinfer-cubin==0.6.4 'nvidia-cutlass-dsl>=4.4.0.dev1' --extra-index-url https://download.pytorch.org/whl/cu128

Step 3: Download the Nemotron 3 Super Parser

wget "https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8/resolve/main/super_v3_reasoning_parser.py"

Step 4: Run the vLLM server

We’re serving nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 via port # 8000 and setting the tensor parallelism to 2 for the dual GPUs used.

vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
  --async-scheduling \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 1 \
  --data-parallel-size 1 \
  --swap-space 0 \
  --trust-remote-code \
  --attention-backend TRITON_ATTN \
  --gpu-memory-utilization 0.9 \
  --enable-chunked-prefill \
  --max-num-seqs 512 \
  --served-model-name nemotron \
  --host 0.0.0.0 \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin "./super_v3_reasoning_parser.py" \
  --reasoning-parser super_v3

Here’s a snapshot of the GPU instance to show memory usage of about 76GB VRAM each

‍Step 4: Test the model with cURL

You can interact with the model using curl

curl http://VM-IP:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model":"nemotron",
        "messages":[{"role": "user", "content": "Explain Jevon'\''s Paradox in a single sentence"}],
        "chat_template_kwargs": {"enable thinking": false}
    }' | jq -r '."choices"[0]."message"."content"'

Step 5: Install Jupyter Notebook for ease of interaction and run OpenAI Python SDK‍

pip install notebook
jupyter notebook --allow-root --no-browser --ip=0.0.0.0

‍

from openai import OpenAI

client = OpenAI(
    base_url="http://VM-IP:8888/v1",
    api_key="EMPTY"
)
messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "5.9 - 5.11"}
    ]
response = client.chat.completions.create(model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8", messages=messages, extra_body={"chat_template_kwargs": {"enable_thinking": False}})

print(response.choices[0].message.content)

‍

How fast is Nemotron Super?

We benchmarked Nemotron 3 Super by using the vllm bench serve command. The mean output token throughput was 329 tokens per second, demonstrating excellent inference performance, especially when compared to the 158 tokens/s throughput we observed with gpt-oss-120b and 223 tokens/s with Nemtron 3 Nano in our previous tests.

NVIDIA Nemotron Super Throughput and Latency

Initial Impressions

We ran a few standard tests to see how Nemotron 3 Super handles common logic and coding prompts, including math reasoning tasks.

Prompt: How many 'r's in “strawberry”?

Nemotron Super:

Prompt: How many 'l's in “strawberry”?

Nemotron Super: The model got this right, unlike our previous experience with Nemotron Nano.

Mathematical Reasoning:

Prompt: Find all saddle points of the function $f(x, y) = x^3 + y^3 - 3x - 12y + 20$.

Nemotron Super: The model correctly applied the second derivative test and identified the saddle point without errors in the calculation steps.

Prompt: 企业 AI Agent the area of the region enclosed by the graphs of the given equations “y=x, y=2x, and y=6-x”. Use vertical cross-sections

Nemotron Super: Again, Nemotron Super beat the Nano version with the correct answer, 3.

Code Generation:

Prompt: We asked for Python code for the Snake game. Nemotron Super got it right at the first try with a perfect simulation of the game. Here’s a snapshot of the game

‍

Prompt: Create an SVG of a smiling dog

Nemotron 3 Super was able to create the SVG correctly on the first pass.

Altogether, Nemotron 3 Super is a high-performance open source model . NVIDIA has demonstrated that smarter routing through larger, sparse hybrid architectures can deliver frontier performance with the nimbleness of smaller models.

Build Big on The AI 应用落地 Designed for Abundance

For organizations that are deploying advanced MoE models like Nemotron 3 Super requires infrastructure engineered for speed, control, and scale. Radiant’s comprehensive AI 应用落地 allows you to train faster, fine-tune easily, deploy anywhere, and operate with total control.

GPU 实例: Launch top-tier, pre-configured NVIDIA GPU instances in seconds. With suspend-and-resume capabilities for idle workloads, you can reduce costs by up to 80% compared to traditional hyperscalers.
Supercomputers: Provision multi-node clusters of thousands of best-in-class NVIDIA GPUs connected by high-speed InfiniBand or RoCE networking for massive distributed training.
Inference Delivery Network (IDN): Deploy models globally with our model-routing layer that minimizes latency and enforces data-sovereignty. Choose between Serverless Endpoints for scale-to-zero flexibility or Dedicated Endpoints for strict isolation and predictable performance.
弹性容器: Free your ML teams from infrastructure management. Run existing container workflows natively while the platform handles sub-second cold starts and auto-scaling from zero to thousands of GPUs in real time.

Start Building on Radiant Today

常见问题

暂无内容。

操作指南