30 多年前,游戏行业走过的一段路
先讲一段游戏史。
1991 年,Sid Meier 发布《Civilization》(文明),把回合制策略游戏(TBS)推到了一个长寿至今的巅峰。深度、复杂、耐玩——玩家在每个回合里慢慢思考、规划、决策,下一步,再下一步。
仅仅一年后的 1992 年,Westwood Studios 推出《Dune II: Battle for Arrakis》(沙丘魔堡 2),确立了 RTS(实时策略)的全部基础范式:资源采集、基地建设、科技树、战争迷雾、鼠标驱动的实时操控。它销量不算炸,但所有后来的 RTS 都是踩在它的设计模板上的。
又过了 3 年,1995 年,Westwood 用《Command & Conquer》(命令与征服)让 RTS 第一次成为全球性的商业大爆款——多人对战、电影感叙事、流畅操控,RTS 第一次从极客圈层破壁进入主流玩家视野。
再过 3 年,1998 年,Blizzard 的《StarCraft》(星际争霸)登顶,把 RTS 推向竞技神坛,影响延续二十多年。
从《文明》到《星际》,整整 7 年,策略游戏完成了一次彻底的范式迁移——从「我下一步、系统下一步」的回合制,到「手不离鼠标、眼不离战场」的实时制。核心变化不是游戏变得更复杂,而是玩家的手不再等系统结算回合。
但再往深一层想,这并不只是「游戏行业的偶然演变」。游戏本身是一个独立、简化构建的小世界——人最初被强制以回合制参与,不是因为回合制更好玩,而是因为当时的技术承载不了实时。一旦算力、输入设备、引擎技术发展上来,参与模式就开始从回合制滑向实时制。所以在这个简化的小世界里,其实已经能看到从回合制到实时制演变的必然性:
- 回合制:是对现实世界离散抽象、离散化之后的产物——把连续的时间切成一段段可以慢慢处理的「回合」
- 现实世界:所有交互本来就是以极高频实时的状态发生的——物理世界里没有谁在「等系统结算回合」
再往深一层说,实时制本质上只是一个更高频的回合制——micro-turn based。这一点在工程实践里早就被验证过:早年做 RTC(Real-time Communication,实时语音通讯)的时候,最小单元就是一个个「最小语音包」,把连续的声音流切成毫秒级的小包传输再拼起来;今天 RTM(Real-time Model)走的是同一条路——「continuous input and output streams split into micro-turns」,把输入输出流拆成一个个极小的「轮」。所谓「实时」,归根到底就是「足够小的回合」。
再深一层说就有点哲学意味了,类似计算机里的浮点数与整数、物理学里的量子与连续。但落回这一篇的语境,结论只有一句:回合制是粗粒度的离散化,实时制是细粒度的离散化——是向真实世界连续性的逼近,但本质上仍然是离散。
游戏从回合制走到实时制,从来不是一个「风格选择」,而是当技术允许之后,人和系统的参与模式必然向真实世界的本来面貌回归。
这条路,今天 AI 行业正在重走一遍。
AI 也在走这条路:从 TBM 到 RTM
游戏行业从 Turn-Based 走到 Real-Time,所有实时交互的游戏才得以大发展,比如 FPS、RTS、MOBA、STG 等等;AI 行业正在从 TBM 走到 RTM——这是这一篇要立下的核心二分:
- Turn-Based Model(TBM,轮次模型):按「轮」工作。你说一段,模型等你说完,冻结一切感知去生成,生成完了你接着说。一段一段、一来一回,像写信。
- Real-Time Model(RTM,实时模型):按「流」工作。输入和输出同时进出,模型跟着用户的节奏一起推进——在对话里是「听-说-答」的同步,在创作/编辑里是「动手-构思-接上」的同步,在具身动作里是「感知-决策-执行」的同步。RTM 涵盖的远不只是「听说」这种浅层应用——更深层、也更广阔的战场,是创作、编辑、动作这一类需要「一起做」的场景。
这是两种本质不同的范式,不是一个连续光谱上的快与慢。TBM 再快也是 TBM,RTM 再慢也是 RTM——两者的差别不在响应延迟,而在「模型与时间的关系」。
过去几年大家熟悉的 AI——GPT、Claude、Gemini 这一代聊天 LLM——本质上都是 TBM。整个 LLM 行业已经把 TBM 这条路打磨到了非常高的水准。
但在用户操作密集、意图连续、容不下等待的场景里,TBM 彻底失效了。这种场景里要的不是更聪明的 TBM,而是 RTM——按流工作、永远在场、和用户的时间轴同步推进的模型。
视频编辑就是 RTM 的天然战场:
- 操作密集:一个剪辑师每分钟几十次交互——拉时间轴、切片段、加转场、调音量
- 意图连续:上一个动作的尾部,往往就埋着下一个动作的开头。加完一个 J-cut,他马上要处理转场;调完一段 BGM,他下一步十有八九要压人声。意图是一条链,不是一颗一颗的孤立请求
- 等待反人性:创作流是脆弱的。一个 loading 圈、一次「思考中…」,灵感就断了
TBM 在这里做不到的事——「问 AI 一下、等它想 3 秒、给你建议」——正是 RTM 必须能做的:「我做你接」,在我手刚动的瞬间,把下一步的可能性铺到我面前。
我们在做的事:PACE
我们最近在做的一条核心技术线,内部叫 PACE——Predictive Action Chain Engine,预测式动作链引擎,在 Z Potential 三月份对我们的报道中我们也提到过。它本质上是一个为视频编辑场景而生的 RTM——解决的问题不是「AI 怎么更聪明」,而是 「AI 怎么真正在场」:在用户操作的每一个瞬间和下一个瞬间之间,提前把可能性铺好。
三句话讲清楚 PACE 在做什么:
- 基于上下文实时预测创作意图
- 智能操作引导,做 Agent 产品的「预测大脑」
- 低延迟响应剪辑动作链
而我们做 PACE 的方式,不是「先训一个通用模型、再套到剪辑产品上」,而是 「模应一体」——模型和应用从第一天就一起设计、一起迭代。产品里的每一类动作、每一种时序约束、每一个用户停顿,都直接喂回模型的训练目标;反过来,模型的能力边界又直接决定了产品形态里能放多少「预测式」交互。这不是「一个模型 + 一个壳」,是一个完整的、为创作场景而生的实时系统。「模应一体」是我们这套引擎能跑起来的根本前提,也是我们和「拿通用 LLM 包一层 UI」这种路线最大的分野。
之所以挑 editing 作为「实时 AI」的落点,是因为这里是 AI 时代少数几个真正的创作场域之一——用户不是来「让 AI 替他干活」的,是来和 AI 一起创作的。剪辑、写作、设计这一类工作的核心快感,本就在于「我有想法、我把它做出来」。AI 在这种场景里如果走「替代」路线,反而会摧毁创作本身的乐趣;它真正该做的,是 human-AI cocreation——成为创作者手边那个永远在线、永远懂上下文、永远比你快半步的搭档。所以 editing 几乎注定是 real-time model 最重大的应用方向之一——因为它对「实时」和「协作」这两件事的要求,都被推到了极致。
我在上一篇《除夕前的 48 小时极限》里讲过 Video-editing Agent 这个判断,PACE 就是这个判断在引擎层的具体回答——一个为 cocreation 而生的实时引擎。
这条路上不止我们一家
把视野放大一点会发现,朝 RTM 这个方向走的,远不止我们。
音频领域最早趟通了 RTM 这条路——Moshi、PersonaPlex、Nemotron VoiceChat 这一批全双工对话 RTM,过去一两年已经实现了输入流和输出流的同时进出。为什么音频先做?因为音频本来就是时间流,没法回合制——TBM 在音频里一旦「等用户说完再答」,对话体感就崩了,人和人之间的对话从来不是这么发生的。
具身智能领域走得也很坚决。Physical Intelligence 的 π0——一个面向机器人控制的 RTM——用一套明确的双层架构来应对实时性要求:前台一个永远在转的快策略负责接住每一个感知-动作循环,后台一个慢推理负责拆解长程目标和复杂规划。对话慢一拍只是体验糟,机器人慢一拍是要撞东西的——具身智能对实时性的极致要求,逼出了这套架构。
通用多模态对话这边,最近 Thinking Machines 发了一篇 《Interaction Models》,给 RTM 这个范式做了一次很完整的命名和论述(他们用的词是 「Interaction Model」,本质和 RTM 是同一件事)。文章里讲的「交互性必须是模型的一部分」「时间对齐的微转轮」等等,本质上和音频领域、具身智能那套是同一脉。
所以你会发现:音频、具身、通用对话、视频编辑——四个不同模态、不同团队、不同地点的 RTM 探索,正在向同一个范式收敛。这件事让我们对 PACE 的方向更笃定了一些——不是孤注一掷,而是和这个时代的暗线同向而行。
共同的底层:System 1 / System 2
这些看似不相干的探索,背后有一个共同的理论框架在做地基——System 1 与 System 2。
这个框架最早由心理学家 Keith Stanovich 和 Richard West 在双系统理论(Dual-process theory)的研究中正式提出,后来被行为经济学之父、诺贝尔奖得主 Daniel Kahneman 引用到他的著作《思考,快与慢》里,才被心理学圈以外的人广泛知道。它讲了一件挺朴素的事:人脑里有两套系统在协作——S1 是不假思索的直觉反应,快、廉价、永远在线;S2 是慢思考、推理、规划,贵、稀缺、需要被唤起。
有意思的是,这个心理学框架真真切切启发了这一代工程师的设计直觉。不管是音频领域的全双工模型,还是具身智能的 π0,还是 Thinking Machines 的 Interaction Model,背后的研究者都在带着 S1/S2 的视角想问题——什么任务该交给永远在线的快脑、什么任务该让慢脑异步去啃。一个心理学的认知模型,被工程师们当作架构图来用了。
PACE 也是这套范式在视频编辑场景的实例:快脑负责把每一刻的「下一步」实时铺出来,慢脑负责把更复杂的长程判断流式地补进来。
一个判断
把这条线一口气拉完:
Moshi 是音频对话的 RTM。
π0 是具身智能的 RTM。
PACE 是创作领域的 RTM。
Thinking Machines 是音频领域的 RTM。
不同模态、不同团队、不同地点,殊途同归到 RTM(底层是 System 1/2 + Realtime Micro-Turn) 这同一个范式。这不是巧合,是 AI 产品形态演进到现阶段的必然指向。
前几年大家忙着比拼模型有多聪明、参数有多大、能写多长——那些都是 TBM 内部的竞争。但到了应用层真要解决问题的时候,会发现「聪明」只是必要条件——「在场」才是关键。模型必须和用户在同一条时间轴上、在同一段操作链里、在同一个意图序列中。不是回答得多漂亮,而是反应得多及时、衔接得多自然。
PACE 是我们对这件事的回答——一个用「模应一体」方式做出来的 RTM。
下一个五年的 AI 主战场不再是 TBM 内部的卷王之争,而是 RTM——真正能和人实时协作(Human-AI Collaboration)/ 实时共创(Human-AI Cocreation)的 Agent,在你的工作流里、在你的操作链中、在你的意图还没完全成形之前。它不是来替代你的,是来和你一起协作/创作的。
新范式从被立起来,到真正定义出一个品类,往往要走好几年。我们这一波 AI 大概率也跑不掉相似的节奏——这不是 FOMO 的时刻,是把活做扎实的窗口期。
我们正走在这条路上。
A Path the Game Industry Walked 30 Years Ago
Let me start with a bit of game history.
In 1991, Sid Meier released Civilization, bringing turn-based strategy games to a peak that still feels alive today. It was deep, complex, and endlessly replayable. Players could think, plan, decide, and then move to the next turn.
Just one year later, in 1992, Westwood Studios released Dune II: Battle for Arrakis. It established the foundation of real-time strategy: resource gathering, base building, tech trees, fog of war, and mouse-driven real-time control. Its sales were not explosive, but almost every later RTS stood on its design template.
Three years after that, in 1995, Westwood made RTS a global commercial hit with Command & Conquer: multiplayer battles, cinematic storytelling, and fluid control brought the genre out of enthusiast circles and into the mainstream.
Then in 1998, Blizzard's StarCraft pushed RTS to an esports summit and shaped the next two decades.
From Civilization to StarCraft, strategy games completed a seven-year paradigm shift: from "I take a step, the system takes a step" to "hands on the mouse, eyes on the battlefield." The core change was not that games became more complex. It was that players no longer had to wait for the system to resolve a turn.
This was not just a random evolution in game style. A game is an independent, simplified world. People first participated in that world through turns not because turns were naturally better, but because the technology of the time could not carry real time. Once compute, input devices, and engine technology caught up, participation naturally moved from turn-based to real-time.
- Turn-based systems are the result of discretizing the continuous world into slow, manageable turns.
- The real world is already happening as high-frequency real-time interaction. Physics does not wait for a turn to settle.
Real time is, at a deeper level, simply higher-frequency turns: micro-turn based. This has long been true in engineering. Early RTC systems chopped continuous speech into tiny audio packets, transmitted them, and stitched them back together. Today's RTM, or Real-time Model, follows the same path: continuous input and output streams split into micro-turns. "Real time" is ultimately "turns small enough to feel continuous."
The game industry did not move from turns to real time as a matter of taste. When technology allowed it, human-system participation returned toward the native shape of the real world.
AI is now walking the same road.
AI Is Moving from TBM to RTM
The game industry moved from Turn-Based to Real-Time, enabling FPS, RTS, MOBA, STG, and many other real-time interaction forms. AI is now moving from TBM to RTM. This is the core distinction I want to draw:
- Turn-Based Model, or TBM: the model works by turns. You speak, it waits for you to finish, freezes perception, generates a response, and then waits again. It feels like writing letters back and forth.
- Real-Time Model, or RTM: the model works by streams. Input and output flow at the same time, and the model advances with the user's rhythm. In conversation, it listens and speaks synchronously. In creation and editing, it moves with your hands and picks up your intent. In embodied action, it perceives, decides, and acts continuously.
These are fundamentally different paradigms, not just points on a speed spectrum. A fast TBM is still TBM. A slow RTM is still RTM. The difference is not latency alone, but the model's relationship to time.
The AI products most people know from the past few years, including GPT, Claude, and Gemini-style chat LLMs, are essentially TBMs. The industry has refined that path to a very high level.
But TBM breaks down in scenarios where users operate densely, intent is continuous, and waiting is unacceptable. Those scenarios do not need a smarter TBM. They need an RTM: a model that works by streams, stays present, and moves on the same timeline as the user.
Video editing is a natural battlefield for RTM:
- Dense operation: an editor may interact dozens of times per minute, dragging timelines, cutting clips, adding transitions, and adjusting audio.
- Continuous intent: the tail of one action often contains the beginning of the next. After adding a J-cut, the editor may immediately need to handle a transition. After adjusting background music, they may need to duck vocals. Intent is a chain, not a pile of isolated prompts.
- Waiting breaks creation: a loading spinner or a "thinking..." moment can interrupt the fragile state of creative flow.
What TBM cannot do here is exactly what RTM must do: respond in the moment, take over the next beat, and place possibilities in front of the user just as their hand starts to move.
What We Are Building: PACE
One of our core technical directions is called PACE: Predictive Action Chain Engine. It is, in essence, an RTM born for video editing. The problem it solves is not "how can AI become smarter?" but "how can AI truly be present?" Between the current user action and the very next moment, it prepares likely next actions ahead of time.
PACE can be summarized in three lines:
- Predict creative intent from context in real time.
- Provide intelligent action guidance as the predictive brain of an Agent product.
- Respond to editing action chains with low latency.
We are not building PACE by training a generic model first and wrapping it in an editing interface later. We are building it through model-application co-design: the model and the product are designed and iterated together from day one. Every action type, timing constraint, and user pause in the product feeds back into the model objective; the model's capability boundary, in turn, shapes how much predictive interaction the product can expose.
This is not "a model plus a shell." It is a complete real-time system designed for creative scenarios. Model-application integration is the condition that lets this engine run, and it is also the biggest difference between our path and simply wrapping a general LLM in a UI.
We chose editing as the landing point for real-time AI because it is one of the few true creative arenas of the AI era. Users are not here to have AI replace their work; they are here to create together with AI. The joy of editing, writing, and design lies in having an idea and making it real. If AI follows a pure replacement path here, it destroys the pleasure of creation. Its better role is human-AI cocreation: the always-on collaborator beside the creator, aware of context and half a step faster.
PACE is our answer at the engine layer: a real-time engine built for cocreation.
We Are Not the Only Ones Moving This Way
Zooming out, many teams are moving toward RTM.
Audio got there early. Full-duplex conversational RTMs such as Moshi, PersonaPlex, and Nemotron VoiceChat have already shown simultaneous input and output streams. Audio had to move first because audio itself is a time stream. Once a model waits for the user to finish before replying, the feeling of conversation collapses.
Embodied intelligence is also moving firmly in this direction. Physical Intelligence's π0, an RTM for robot control, uses a clear two-layer architecture: a fast always-on policy catches every perception-action loop, while a slower reasoning layer handles long-horizon goals and complex planning. A late response in chat is annoying; a late response in robotics can cause a collision.
General multimodal interaction is converging too. Thinking Machines recently published Interaction Models, which names and explains this paradigm in detail. Their language around interaction being part of the model and time-aligned micro-turns is essentially the same underlying direction.
Audio, embodied AI, general multimodal interaction, and video editing are different modalities and different teams, but their explorations are converging on the same paradigm. That makes us more confident in PACE: it is not a lonely bet, but part of the hidden current of this era.
The Shared Foundation: System 1 / System 2
Behind these seemingly separate explorations sits a shared theoretical foundation: System 1 and System 2.
The framework was formally introduced by psychologists Keith Stanovich and Richard West in dual-process theory, and later popularized beyond psychology by Daniel Kahneman in Thinking, Fast and Slow. It describes a simple idea: the mind has two cooperating systems. System 1 is intuitive, fast, cheap, and always on. System 2 is slower reasoning and planning: expensive, scarce, and summoned when needed.
This psychology framework has genuinely influenced engineering intuition. Full-duplex audio models, embodied RTMs, and Thinking Machines' interaction model all carry an S1/S2 flavor: which tasks should go to the always-on fast brain, and which should be handled asynchronously by the slow brain?
PACE is this same pattern applied to video editing. The fast brain lays out the next possible action in real time; the slow brain streams in more complex long-horizon judgment.
A Judgment
Pulling the thread together:
Moshi is an RTM for audio conversation.
π0 is an RTM for embodied intelligence.
PACE is an RTM for creation.
Thinking Machines is an RTM for audio.
Different modalities, different teams, and different locations are converging on RTM: System 1/2 plus real-time micro-turns. This is not coincidence. It is the direction AI product form points to at this stage.
In the past few years, the industry competed over model intelligence, parameter scale, and context length. Those are competitions inside TBM. But when AI reaches the application layer, intelligence is only a prerequisite. Presence becomes the key. The model must live on the same timeline as the user, inside the same action chain, before intent is fully formed.
PACE is our answer: an RTM built through model-application integration.
The next five years of AI will not only be a race inside TBM. It will be about RTM: agents that can collaborate and cocreate with people in real time, inside their actions, before their intent has fully crystallized. It is not here to replace you. It is here to create with you.
When a new paradigm is named, it still takes years to define a category. This is not a moment for FOMO. It is a window for doing the work carefully.
We are on that road.