Google 云平台 Vertex AI 服务 流式输出非常慢， gemini ...

问题描述
当从位于美国硅谷的基础设施向 Vertex AI API （ aiplatform.googleapis.com ）模型: gemini-3-pro-preview 发起流式预测调用时，我们观察到响应流中首个 Token 的延迟异常偏高。首 Token 延迟（ TTFT ）持续超过 17 秒，而通常情况应低于 2 秒。
server address: 142.250.191.42
1 、Basic Ping Tests (Connectivity & Baseline Latency)
Run these commands from the affected server/client in Silicon Valley.
ping(base) [root@usa-gg-test01 ~]# ping aiplatform.googleapis.com
PING aiplatform.googleapis.com (142.250.191.42) 56(84) bytes of data.
64 bytes from nuq04s42-in-f10.1e100.net (142.250.191.42): icmp_seq=1 ttl=118 time=2.67 ms
64 bytes from nuq04s42-in-f10.1e100.net (142.250.191.42): icmp_seq=2 ttl=118 time=2.62 ms
64 bytes from nuq04s42-in-f10.1e100.net (142.250.191.42): icmp_seq=3 ttl=118 time=2.64 ms
2 、python code test
Using the model：gemini-3-pro-preview
import requests
import json
import time
def stream_gemini_content():
api_key='xxx'
url = "https://aiplatform.googleapis.com/v1/publishers/google/models/gemini-3-pro-preview:streamGenerateContent?alt=sse"
headers = {
"x-goog-api-key": api_key,
"Content-Type": "application/json"
}
data = {
"contents": [{
      "role": "user",
      "parts": [{
         "text": "请讲一个 200 字的故事，不要用推理，直接回答。"
      }]
}],
"generationConfig": {
      "thinkingConfig": {
         "includeThoughts": False
      }
}
}
print(f"begin requests: {url} ...")
start_time = time.time()
first_token_time = None
last_chunk_time = None
try:
with requests.post(url, headers=headers, json=data, stream=True) as response:
      if response.status_code != 200:
         print(f"status: {response.status_code}")
         print(response.text)
         return
      print("-" * 50)
      for line in response.iter_lines():
         if not line:
            continue
         decoded_line = line.decode('utf-8').strip()
         if not decoded_line.startswith("data: "):
            continue
         json_str = decoded_line[6:]
         if json_str == "[DONE]":
            break
         try:
            now = time.time()
            if first_token_time is None:
                  first_token_time = now
                  print(f"\n[total] frist token TTFT: {(now - start_time) * 1000:.2f} ms")
                  print("-" * 50)
                  last_chunk_time = now
            chunk_data = json.loads(json_str)
            candidates = chunk_data.get("candidates", [])
            total_elapsed = (now - start_time) * 1000
            chunk_gap = (now - last_chunk_time) * 1000 if last_chunk_time else 0
            last_chunk_time = now
            if candidates:
                  content = candidates[0].get("content", {})
                  parts = content.get("parts", [])
                  if parts:
                     text_chunk = parts[0].get("text", "")
                     print(text_chunk, end="", flush=True)
         except Exception as e:
pass
except Exception as e:
pass
end_time = time.time()
print("\n\n" + "-" * 50)
print(f"total time: {(end_time - start_time) * 1000:.2f} ms")
if name == "main":
stream_gemini_content()
代码测试非常慢，200 个字故事就超过 17s 了

Google 云平台 Vertex AI 服务流式输出非常慢， gemini-3-pro-preview 模型，首个流式输出超过 17s，有没有好的解决方案

热门主题

alist有哪些平替

阿里云国际已经支持Open Claw了！

OpenClaw部署试了没？哪家的更好用

我没有女人缘，除了亲人，我从来没有和女人

抽奖送10张虚拟卡开卡券 visa卡

收一个GeorgeDC 2H4G40G 2T 10.99刀/年洛

480出几个16g d4内存

币圈是不是凉了, AI Agent 真的会用区块链

生成分享卡片，浏览器扩展。已上架 chrome

AI这么牛逼，以后努力工作学习还有意义吗

热门板块

公告

网站帮助 - Yoo趣儿

我们的愿景

在 Yoo趣儿投放广告

Yoo趣儿网站用户应遵守规则

Google 云平台 Vertex AI 服务 流式输出非常慢， gemini-3-pro-preview 模型，首个流式输出超过 17s，有没有好的解决方案