该工具能将用户提供的原始自然语言剧本,自动拆分为多个n秒长度的短视频片段脚本,并确保画面连贯性、角色一致性、动作延续性,适用于主流 AI 视频生成模型(如 Runway、Pika、Sora、Wan、Stable Video等)

  • 需求描述:假如我有一段预估两分钟左右的剧本,想通过AI模型生成对应的短视频。

  • 技术受限:目前的各种模型仅支持一次生成5-10秒长度的视频,想要生成两分钟长度的视频,只能通过“拼接”的方式,将多个5秒的片段合成为一个视频。

  • 任务&挑战点:要实现视频拼接,第一步就需要拆分原剧本,拆分后的剧本尽量接近5-10秒时长(取决于模型),且每个视频片段还必须要保持连贯性,不然生成的视频片段合成后会导致场景、动作、人物等衔接不上。

    且剧情中的动作、语速等会影响时长,所以需要考虑多种情景,比如:老人动作慢、生气怒吼时语速会较快、跑比走要快等等。

    这便是本智能体需要完成的任务,用户只需要给出剧本,而后根据各种技术拆解,最后将拆解完成的剧本片段返回,用户只需要将其交给模型(Runway、Pika、Sora、Wan、Stable Video等)生成即可,最后再利用相关技术将片段合成为完整视频。

详细使用方式,请参照 GitHub 项目 (功能开发完善中),如果有更好的思路或想法,可参与开发或提出需求。

剧本分镜智能体架构设计与实现(v1.x) | 余一叶知秋尽

剧本分镜智能体(MVP)

当前版本:1.0.0(MVP版),单次简短剧本处理,不跨剧本记录状态

设计原则

  1. 核心功能优先,简化复杂性

  2. 尽可能使用LLM,减少规则实现

  3. 保持基础连续性,但不过度设计

  4. 输出格式统一,便于后续扩展

包含的核心功能

  1. 基础剧本解析(场景、角色、对话)

  2. 简单镜头拆分(基于场景和对话)

  3. 5秒强制切分(无复杂合并)

  4. 基础Prompt生成(模板+LLM)

  5. 基础质量检查(时长、基本连续性)

暂不包含的功能(后续版本)

  1. 复杂连续性管理

  2. 多模型适配

  3. 智能合并策略

  4. 高级情感分析

  5. 自动修正回滚

核心目标

将任意长度的剧本,拆分为N个5秒的、视觉连贯的Sora视频片段脚本单元(Shot)

  • 自动化转换:实现剧本→AI视频指令的端到端自动化流程

  • 智能分段:智能处理5秒限制,保证叙事连贯性

  • 连续性保障:基础的角色、场景、道具一致性维护

  • 高质量输出:生成专业级的AI视频生成提示词

核心挑战:

  1. 处理多种输入格式(自然语言、AI分镜、结构化场景、标准剧本)
  2. 精确估算每个视觉/对话元素的合理时长
  3. 确保5秒片段间的视觉和叙事连贯性
  4. 生成Sora优化的提示词

使用场景

  • 场景1:内容创作者

    用户画像:短视频创作者、自媒体博主、内容营销人员
    典型需求:快速将故事脚本转化为视频内容,批量生成短视频素材,降低视频制作门槛和时间成本

  • 场景2:影视制作预可视化

    用户画像:独立电影人、学生剧组、低成本制作团队
    典型需求:剧本的分镜预览和可视化,成本可控的概念验证

  • 场景3:游戏剧情制作

    用户画像:独立游戏开发者、视觉小说创作者
    典型需求:游戏过场动画的快速原型,低成本的情节演示

  • 场景4:教育培训内容

    用户画像:在线教育机构、企业培训部门、知识付费创作者
    典型需求:将文字课程转化为视频课程,快速更新和迭代教学内容

  • 场景5:广告营销素材

    用户画像:广告代理公司、电商企业、品牌营销团队
    典型需求:快速生成多平台适配的产品演示视频

输入输出

阶段 输入 输出 关键处理
剧本解析 原始文本 结构化剧本 LLM提取场景、角色、对话
镜头拆分 结构化剧本 镜头序列 按对话和动作变化分镜
视频分段 镜头序列 视频片段 5秒切分,保持连续性
指令转换 视频片段 Prompt指令 LLM生成视觉描述
质量审查 Prompt指令 审查报告 基础规则检查

输入剧本

1
2
3
{
"script": "剧本标题:《雨中的约定》\n时长:约30秒\n场景:城市街角咖啡店外,雨天\n角色:\n- 林小雨(女,20岁,学生,抱着一本湿漉漉的书)\n- 陈阳(男,22岁,兼职外卖员,穿着黄色雨衣)\n\n[开场]\n(雨声淅沥,镜头从灰蒙蒙的天空下摇,聚焦在咖啡店外的长椅上。林小雨蹲在长椅旁,用手帕擦拭一本被雨水浸湿的诗集,神情焦急。)\n林小雨(自言自语,带着哭腔):\n\"明明说好今天还书的……这雨下得,他会不会不来了?\"\n\n[镜头切换]\n(陈阳骑着电动车冲进雨幕,后座外卖箱里露出一角蓝色封面的书。他刹车太急,差点撞上长椅,林小雨的书掉进水洼。)\n陈阳(手忙脚乱捡书,抬头):\n\"对不起!这书……是你的?\"\n\n[特写]\n(两本书并排躺在水洼里——林小雨的《飞鸟集》,陈阳外卖箱里的同款书,封面上贴着\"借阅卡:陈阳→林小雨\"。)\n林小雨(愣住,突然笑了):\n\"你迟到十分钟,但……书没湿透。\"\n\n陈阳(挠头,从雨衣里掏出干毛巾裹住书):\n\"我跑了两条街找防水袋……诗里说'雨是云的眼泪',可我不想让你哭。\"\n\n[结尾]\n(雨渐小,阳光穿透云层。林小雨翻开书,里面夹着一张电影票根,日期是下周三。陈阳脱下雨衣罩在她头上,两人并肩跑向屋檐,笑声渐远。)\n画外音(林小雨的旁白):\n\"有些约定,会迟到,但永远不会缺席。\"\n\n[黑屏,字幕浮现]\n\"雨会停,而故事才刚刚开始。\"\n\n风格:清新治愈,带点幽默,适合短视频平台传播。\n核心冲突:用\"湿书\"和\"迟到\"制造小误会,通过\"同款书\"和\"电影票\"暗示双向暗恋,雨天象征情感转折。"
}

API 输出结果(分段后的指令提示词):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
{
"task_id": "HL202603061716542571",
"status": "success",
"success": true,
"data": {
"instructions": {
"metadata": {
"generated_at": "2026-03-06T17:28:57.958958",
"version": "mvp_1.0",
"video_model": "runway_gen2",
"audio_model": "XTTSv2",
"total_prompts": 13,
"converter_type": "LLMPromptConverter"
},
"project_info": {
"title": "AI视频项目",
"total_fragments": 13,
"total_duration": 48.0,
"source_fragments": [
"frag_001",
"frag_002",
"frag_003",
"frag_004",
"frag_005",
"frag_006",
"frag_007",
"frag_008",
"frag_009",
"frag_010",
"frag_011",
"frag_012",
"frag_013"
]
},
"fragments": [
{
"fragment_id": "frag_001",
"prompt": "Cinematic wide shot: overcast sky with low, heavy gray clouds; cold fine rain falling diagonally, soaking the urban street; wet bluish-gray brick pavement reflecting faint ambient light; green matte-metal bench with light-gray cushion at street corner, facing red-and-white coffee shop sign; deep brown wooden eaves with silent copper wind chime; shallow puddles ripple gently; soft natural lighting, realistic texture, shallow depth of field, Fujifilm Superia film grain, 35mm cinematic color grading.\n\n全景镜头:灰蒙蒙低云压境,细密斜织的冷雨笼罩街道;青砖地面积水泛微光;街角绿色哑光金属长椅配浅灰坐垫,正对红底白字咖啡店招牌;深褐色木质屋檐悬垂静止铜风铃;自然柔光,真实质感,浅景深,胶片颗粒感,电影级调色。",
"negative_prompt": "cartoon, anime, 3D render, text, logo, watermark, deformed hands, extra limbs, blurry face, low resolution, oversaturated, artificial lighting, studio set, people in frame",
"duration": 4.0,
"model": "runway_gen2",
"style": "cinematic realism, Fujifilm Superia aesthetic, atmospheric rain mood, subtle film grain, shallow depth of field",
"requires_special_attention": false,
"audio_prompt": {
"audio_id": "audio_001",
"prompt": "Natural rainfall as base layer — soft, steady drizzle with gentle spatial stereo spread; occasional subtle water drip from eaves (left channel); distant muffled footstep on wet bricks (right channel, fading in/out); warm ambient tone, no electronic artifacts, high-fidelity field recording quality, immersive binaural feel.\n\n以自然雨声为基底——轻柔持续的淅沥雨声,具备空间立体感;偶有屋檐水滴声(左声道);远处模糊湿砖脚步声(右声道,渐入渐出);整体温暖通透,无电子合成感,高保真环境录音品质,沉浸式双耳声场。",
"negative_prompt": "music, speech, voiceover, synth tones, reverb-heavy echo, clipping, distortion, silence, abrupt cuts",
"model_type": "AudioLDM_3",
"voice_type": "narration",
"audio_style": "realistic",
"voice_character": null,
"voice_description": "ambient field recording, ultra-clean, ASMR-grade spatial fidelity, organic timbre",
"speed": 1.0,
"pitch_shift": 0.0,
"emotion": "neutral",
"stability": 0.7,
"duration_seconds": 4.0,
"sound_attributes": {
"intensity": 0.8,
"reverb": 0.3
},
"format": "wav",
"sample_rate": 24000,
"seed": 42719,
"scene_context": "Urban street corner outside a coffee shop, afternoon rain, quiet contemplative atmosphere, cinematic realism",
"previous_audio_id": null
}
},
{
"fragment_id": "frag_002",
"prompt": "medium shot, cinematic lighting, soft rain falling, Lin Xiaoyu squatting beside a matte green metal bench, wearing a cream-colored long dress with damp hem clinging to her legs, light gray knitted cardigan with rolled sleeves, shoulder of outer layer slightly damp, shoulder-length black hair slightly wet, thin-framed glasses, anxious expression softening into gentle smile, cobalt-blue hardcover 'Gitanjali' poetry book with gold-embossed English title, yellowed pages, water stain spreading on front cover, white library card taped at bottom right corner with handwritten 'Chen Yang → Lin Xiaoyu' in ink with rightward arrow and slight bleed, indigo movie ticket stub ('Little Forest: Summer', 'Next Wednesday 19:00') tucked in flyleaf, red-and-white coffee shop sign visible in background, dark brown wooden eaves with silent copper wind chime above, natural ambient rain sound, subtle water drip and distant footsteps\n\n中景:林小雨蹲在绿色金属长椅旁,米白长裙下摆微湿贴腿,浅灰针织开衫袖口微卷,肩头微湿,齐肩黑发微湿,戴细框眼镜,神情由焦虑转为温柔笑意;《飞鸟集》钴蓝色硬壳精装封面烫金英文标题,纸页微泛黄,封面有水渍晕染痕迹;借阅卡纯白,手写‘陈阳→林小雨’,箭头向右,墨迹微洇,贴于封面右下角;电影票根靛蓝色,印有《小森林·夏》及‘下周三19:00’,边缘微卷,夹在扉页;背景可见红底白字咖啡店招牌,深褐色木质屋檐悬垂铜制风铃(未响);环境细雨淅沥,自然生活音效",
"negative_prompt": "blurry, deformed hands, extra limbs, text errors, distorted face, low resolution, cartoonish, anime style, photorealistic exaggeration, harsh shadows, overexposed, synthetic voice, electronic music, silence",
"duration": 3.0,
"model": "runway_gen2",
"style": "cinematic realism, shallow depth of field, Fujifilm Superia color grading, soft focus background, naturalistic lighting",
"requires_special_attention": false,
"audio_prompt": {
"audio_id": "audio_002",
"prompt": "Natural rain ambiance with gentle water drip and faint distant footsteps; warm, intimate, breathable vocal delivery of Lin Xiaoyu's line: '明明说好今天还书的……这雨下得,他会不会不来了?' — tender, slightly breathy, youthful female voice with quiet anxiety resolving into hopeful softness; no reverb-heavy studio tone, no pitch correction, no background music\n\n以自然雨声为基底,叠加细腻水滴与远处脚步声;林小雨台词:‘明明说好今天还书的……这雨下得,他会不会不来了?’——温暖私密、略带气息感的少女声线,语调从轻忧渐转柔和期待;无混响过重录音室感,无音高校正,无人工合成感,无背景音乐",
"negative_prompt": "robotic voice, loud thunder, overlapping speech, laughter, music, echo, distortion, silence, AI artifacts, metallic resonance",
"model_type": "XTTSv2",
"voice_type": "character_dialogue",
"audio_style": "realistic",
"voice_character": "Lin Xiaoyu",
"voice_description": "young Chinese female, early 20s, clear diction, soft timbre, gentle breath support, slight nasal resonance, emotionally nuanced, natural cadence",
"speed": 0.95,
"pitch_shift": 0.0,
"emotion": "tender",
"stability": 0.7,
"duration_seconds": 3.0,
"sound_attributes": {
"intensity": 0.8,
"reverb": 0.3
},
"format": "wav",
"sample_rate": 24000,
"seed": 42719,
"scene_context": "Urban street corner, rainy afternoon, outside a red-and-white coffee shop, under wooden eaves, gentle rain falling on green metal bench",
"previous_audio_id": "audio_001"
}
},
{
"fragment_id": "frag_003",
"prompt": "medium shot, slightly close with shallow depth of field: Lin Xiaoyu crouches beside a matte green metal bench, left hand gently cradling the cobalt-blue hardcover 'Gitanjali' — gold-embossed English title, yellowed pages, visible water stain bloom on front cover; white library card affixed to lower right corner with handwritten 'Chen Yang → Lin Xiaoyu' in blurred ink and rightward arrow; indigo movie ticket stub for 'Little Forest: Summer', 'Next Wednesday 19:00' printed clearly, tucked into flyleaf; she wears a creamy-white midi skirt and light-gray knitted cardigan, shoulder of outer layer slightly damp, black-framed glasses, shoulder-length wet black hair; soft anxious expression shifting subtly toward tender warmth; ambient overcast daylight, gentle rain falling, red-and-white coffee shop sign visible behind her, deep brown wooden eaves with silent copper wind chime overhead\n\n中景偏近镜头(带轻微浅焦):林小雨蹲在哑光绿色金属长椅旁,左手轻托钴蓝色硬壳精装《飞鸟集》——烫金英文标题,纸页微泛黄,封面有水渍晕染痕迹;纯白借阅卡贴于封面右下角,手写‘陈阳→林小雨’,墨迹微洇,箭头向右;靛蓝色电影票根夹在扉页,印有《小森林·夏》及‘下周三19:00’;她穿米白长裙配浅灰针织开衫,素色外套肩头微湿,戴细框眼镜,齐肩黑发微湿;神情由焦虑悄然转为温柔笑意;环境为阴天午后,细雨淅沥,身后可见红底白字咖啡店招牌,头顶是深褐色木质屋檐与静止铜风铃",
"negative_prompt": "blurry face, deformed hands, extra limbs, text errors, distorted book cover, mismatched colors, missing water stain, altered library card position or text, incorrect ticket date, glossy plastic bench, neon lighting, cartoon style, photorealistic exaggeration, motion blur, lens flare, watermark, logo, signature",
"duration": 5.2,
"model": "runway_gen2",
"style": "cinematic realism, Fujifilm Superia 400 film grain, natural color grading, soft directional overcast light, shallow focus storytelling",
"requires_special_attention": false,
"audio_prompt": {
"audio_id": "audio_003",
"prompt": "gentle rain ambience layered with subtle water drip and distant muffled footsteps; warm, organic, non-synthetic texture; no dialogue, only atmospheric diegetic sound matching scene continuity — light rainfall intensity, faint resonance from metal bench and wooden eaves, quiet breath presence, emotionally neutral yet tender tonal quality\n\n以自然雨声为基底,叠加细微水滴声与远处模糊脚步声;温暖通透、无电子合成感;无台词,仅环境拟音,匹配场景连续性——雨势轻柔,金属长椅与木檐带来轻微环境共振,可感知安静呼吸感,情绪中性而含温柔质地",
"negative_prompt": "speech, music, voiceover, synthetic tones, reverb-heavy echo, sudden loud sounds, wind howl, thunder, birdsong, traffic noise, distortion, clipping",
"model_type": "AudioLDM_3",
"voice_type": "narration",
"audio_style": "realistic",
"voice_character": "",
"voice_description": "calm, intimate, analog-tape warmth, low dynamic range, natural room tone integration",
"speed": 1.0,
"pitch_shift": 0.0,
"emotion": "tender",
"stability": 0.7,
"duration_seconds": 5.2,
"sound_attributes": {
"intensity": 0.8,
"reverb": 0.3
},
"format": "wav",
"sample_rate": 24000,
"seed": 42719,
"scene_context": "urban café corner bench in light rain, overcast afternoon, poetic quiet intimacy between two characters just before dialogue begins",
"previous_audio_id": "audio_002"
}
},
{
"fragment_id": "frag_004",
"prompt": "Cinematic wide shot: urban street corner at late afternoon, light drizzle falling, overcast sky with soft diffused daylight. A matte green metal bench sits roadside, gray cushion slightly damp, subtle scratches on armrests. Red-and-white coffee shop sign visible in background, deep brown wooden eaves with silent copper wind chime overhead. Warm dry light patch on ground beneath eaves. No people in frame. Natural rain ambiance with gentle water drip and distant muffled footsteps. Shot on ARRI Alexa, shallow depth of field, Kodak Portra color grade, realistic texture detail.\n\n全景镜头:城市街角,下午时分,细雨淅沥,灰蒙蒙天光微亮。哑光绿色金属长椅静置路边,浅灰色坐垫微湿,扶手处有细微划痕。背景可见红底白字咖啡店招牌,深褐色木质屋檐悬垂铜制风铃(未响),檐下地面呈暖色干燥光斑。自然雨声基底,叠加轻柔水滴与远处模糊脚步声。",
"negative_prompt": "people, faces, text overlays, logos, motion blur, lens flare, cartoon, anime, illustration, low resolution, grainy, deformed, extra limbs, disfigured, blurry, jpeg artifacts, out of frame, cropped, watermark, signature",
"duration": 3.5,
"model": "runway_gen2",
"style": "cinematic realism, Kodak Portra film aesthetic, shallow depth of field, natural lighting, atmospheric moisture detail",
"requires_special_attention": false,
"audio_prompt": {
"audio_id": "audio_004",
"prompt": "Natural ambient audio: gentle steady rainfall on pavement and metal surfaces, occasional soft water drip from eaves, faint distant footstep echoes on wet concrete — all warm, organic, non-electronic, no reverb-heavy processing, balanced spatial layering, high-fidelity field recording quality.\n\n自然环境音:持续轻柔的雨落于人行道与金属表面声,偶有屋檐水滴声,远处模糊湿润路面脚步声——整体温暖通透、无电子合成感,空间层次清晰,高保真实地录音质感。",
"negative_prompt": "speech, music, synthetic tones, distortion, clipping, silence, abrupt cuts, artificial reverb, pitch shifting, robotic voice, echo overload",
"model_type": "AudioLDM_3",
"voice_type": "narration",
"audio_style": "realistic",
"voice_character": null,
"voice_description": "warm, analog-tape-like tonal balance, ultra-clean transient response, immersive binaural-ready spatialization",
"speed": 1.0,
"pitch_shift": 0.0,
"emotion": "neutral",
"stability": 0.7,
"duration_seconds": 3.5,
"sound_attributes": {
"intensity": 0.8,
"reverb": 0.3
},
"format": "wav",
"sample_rate": 24000,
"seed": 4279,
"scene_context": "Urban street corner at dusk, light rain, empty green metal bench under coffee shop eaves, ambient stillness punctuated by natural hydro-acoustic textures.",
"previous_audio_id": "audio_003"
}
},
{
"fragment_id": "frag_005",
"prompt": "medium shot, cinematic lighting, rain-soaked urban street corner,陈阳 crouching beside a puddle next to a matte green metal bench, wearing a vivid yellow reflective raincoat (polyester, silver reflective strips glowing faintly), shoulders darkened by rainwater, short messy hair with water droplets, deep blue hooded sweatshirt visible under coat, black sport pants and muddy canvas shoes, expression earnest and slightly flustered, shallow depth of field, soft bokeh background showing red-and-white coffee shop sign and brown wooden eaves with silent copper wind chime, natural overcast daylight, gentle rain falling, water ripples in puddle, cinematic realism, film grain texture, 35mm lens\n\n中景:陈阳蹲在长椅旁水洼边,明黄色反光雨衣肩部湿透发暗,银色反光条微光;短发凌乱带水珠,内搭深蓝连帽卫衣,黑色运动裤与帆布鞋沾泥水,表情憨厚又急切;背景为哑光绿色金属长椅、红底白字咖啡店招牌、深褐色木质屋檐与静止铜风铃;自然阴天光线,细雨淅沥,水洼泛涟漪",
"negative_prompt": "blurry, deformed hands, extra limbs, text, logo, watermark, cartoon, 3d render, cgi, anime, low resolution, oversaturated, harsh shadows, studio lighting, dry ground, sunny sky, umbrella, other people, smiling broadly, static pose, no motion blur on raindrops",
"duration": 2.8,
"model": "runway_gen2",
"style": "cinematic realism, Fujifilm ETERNA film stock, shallow depth of field, naturalistic color grading, subtle motion in rain and fabric",
"requires_special_attention": false,
"audio_prompt": {
"audio_id": "audio_005",
"prompt": "gentle rainfall layered with occasional water drip from eaves and soft muddy footstep shuffle, warm ambient tone, organic acoustic texture, no electronic artifacts, subtle spatial reverb suggesting open street corner under partial shelter, consistent light rain intensity matching 'light drizzle' phase, no dialogue, no music, pure environmental realism\n\n以自然雨声为基底,叠加细微屋檐滴水声与轻缓泥泞脚步声,整体温暖通透,无电子合成感,空间混响体现街角半遮蔽环境,雨势处于‘淅沥’阶段,无人声、无音乐,纯环境音效",
"negative_prompt": "dialogue, speech, music, synth tones, echo-heavy, distorted, metallic, wind howling, thunder, birdsong, traffic noise, footsteps too loud or rhythmic",
"model_type": "AudioLDM_3",
"voice_type": "narration",
"audio_style": "realistic",
"voice_character": "",
"voice_description": "ambient field recording style, high-fidelity binaural-like spatial clarity, natural decay, analog warmth",
"speed": 1.0,
"pitch_shift": 0.0,
"emotion": "neutral",
"stability": 0.7,
"duration_seconds": 2.8,
"sound_attributes": {
"intensity": 0.8,
"reverb": 0.3
},
"format": "wav",
"sample_rate": 24000,
"seed": 5042871,
"scene_context": "urban street corner outside a coffee shop during light rain, green metal bench, red-and-white sign, brown wooden eaves, copper wind chime, puddle glistening at feet",
"previous_audio_id": "audio_004"
}
},
{
"fragment_id": "frag_006",
"prompt": "medium shot, slight low angle, Chen Yang half-crouching on a matte green metal bench outside a café, holding a damp hardcover copy of 'Gitanjali' with cobalt-blue cloth cover, gold-embossed English title, water-stained surface, yellowed pages slightly curled; he wears a vivid yellow reflective raincoat (polyester, silver reflective stripes), dark blue hooded sweatshirt underneath, black athletic pants, muddy canvas sneakers; rain falling gently, soft natural lighting, warm ambient tone, shallow depth of field, cinematic realism, Fujifilm ETERNA film stock\n\n中景:中景镜头(略带仰角):陈阳半蹲未起,手中托着湿漉漉的《飞鸟集》,钴蓝色封皮,烫金英文标题,封面有水渍晕染痕迹,纸页微泛黄且微卷;他身穿明黄色反光雨衣(聚酯纤维材质,带银色反光条),内搭深蓝连帽卫衣,黑色运动裤与沾泥帆布鞋;背景为哑光绿色金属长椅、红底白字咖啡店招牌及深褐色木质屋檐,细雨轻落,自然暖调光线,电影感写实风格",
"negative_prompt": "blurry, deformed hands, extra limbs, text errors, distorted face, cartoon, 3d render, cgi, anime, low resolution, jpeg artifacts, overexposed, underexposed, flat lighting, no rain, dry book, wrong book color, missing water stains, incorrect borrow card position, no cobalt blue, no yellow raincoat, no cinematic grain",
"duration": 3.0,
"model": "runway_gen2",
"style": "cinematic realism, Fujifilm ETERNA color science, shallow depth of field, natural rain ambiance, warm-cool contrast balance",
"requires_special_attention": false,
"audio_prompt": {
"audio_id": "audio_006",
"prompt": "gentle rainfall, subtle water drip from eaves, faint wet footsteps approaching, light breath and fabric rustle, warm and intimate acoustic space, no music, no synthetic tones, ultra-realistic ASMR-grade environmental fidelity\n\n轻柔雨声,屋檐细微滴水声,轻微湿脚步声由远及近,衣物摩擦与呼吸声,温暖亲密的声场空间,无配乐,无电子合成音,超写实ASMR级环境保真度",
"negative_prompt": "speech, dialogue, voiceover, music, bass boost, distortion, reverb-heavy, artificial echo, robotic tone, silence, loud thunder, wind howl",
"model_type": "AudioLDM_3",
"voice_type": "character_dialogue",
"audio_style": "realistic",
"voice_character": "Chen Yang",
"voice_description": "young male voice, gentle timbre, slightly breathy, warm midrange, mild nasal resonance, sincere and tender delivery",
"speed": 1.0,
"pitch_shift": 0.0,
"emotion": "tender",
"stability": 0.7,
"duration_seconds": 3.0,
"sound_attributes": {
"intensity": 0.8,
"reverb": 0.3
},
"format": "wav",
"sample_rate": 24000,
"seed": 672941,
"scene_context": "a rainy afternoon at a street-corner café, green metal bench, cobalt-blue poetry book in hand, yellow raincoat glistening under soft light, intimate moment before dialogue begins",
"previous_audio_id": "audio_005"
}
},
{
"fragment_id": "frag_007",
"prompt": "Cinematic wide shot: urban street corner in soft afternoon rain, glistening bluish-gray cobblestone pavement, matte green metal bench centered in frame with light-gray cushion and subtle scratches on armrests, red-and-white coffee shop sign visible in background, warm dry light patch under deep brown wooden eaves with silent copper wind chime, gentle rain falling, water droplets glistening on surfaces, atmospheric depth, shallow depth of field, Fujifilm ETERNA film stock, natural lighting, ultra-detailed texture, 8K resolution\n\n全景镜头:城市街角,细雨淅沥,青灰石板路泛着微光;绿色金属长椅静置画面中央,哑光绿金属框架,浅灰坐垫,扶手处有细微划痕;背景可见红底白字咖啡店招牌;深褐色木质屋檐下地面呈暖色光斑,檐角悬垂未响铜制风铃;整体氛围湿润宁静,光影通透,胶片质感",
"negative_prompt": "people, faces, text, logos, cartoon, anime, 3D render, CGI, blurry background, overexposed, low resolution, grainy, distorted perspective, motion blur, lens flare, watermark, signature",
"duration": 4.0,
"model": "runway_gen2",
"style": "Cinematic realism, Fujifilm ETERNA color science, shallow depth of field, natural rain ambiance, tactile material detail",
"requires_special_attention": false,
"audio_prompt": {
"audio_id": "audio_007",
"prompt": "Natural rainfall with gentle intensity, intermittent water drip from eaves, distant muffled footstep on wet cobblestone, warm ambient tone, no electronic artifacts, no speech, no music, high-fidelity field recording quality, spatially balanced stereo, subtle reverb matching urban brick-and-metal environment\n\n自然雨声为基底,轻柔持续,间歇性屋檐水滴声,远处湿滑石板路上模糊脚步声,整体温暖通透,无电子合成感,无语音、无音乐,高保真环境录音品质,符合街角砖石与金属材质的空间混响",
"negative_prompt": "speech, music, synthetic tones, distortion, clipping, wind noise, birdsong, traffic, voices, dialogue, sudden loud sounds",
"model_type": "AudioLDM_3",
"voice_type": "character_dialogue",
"audio_style": "realistic",
"voice_character": null,
"voice_description": "calm, organic, immersive, analog-tape warmth, spatially accurate",
"speed": 1.0,
"pitch_shift": 0.0,
"emotion": "neutral",
"stability": 0.7,
"duration_seconds": 4.0,
"sound_attributes": {
"intensity": 0.8,
"reverb": 0.3
},
"format": "wav",
"sample_rate": 24000,
"seed": 7042,
"scene_context": "Urban street corner at afternoon, light rain, green metal bench under coffee shop eaves, cobblestone ground, wooden roof with copper wind chime",
"previous_audio_id": "audio_006"
}
},
{
"fragment_id": "frag_008",
"prompt": "Medium shot: Lin Xiaoyu sits on the left end of a matte green metal bench, wearing a creamy white long dress and a light gray knitted cardigan, cotton-linen blend fabric, shoulder of cardigan slightly damp; soft afternoon rain falling, shallow depth of field, warm ambient light from nearby coffee shop sign (red background with white Chinese characters), deep brown wooden eaves above, copper wind chime hanging silently, subtle water droplets on her black shoulder-length hair, thin-framed glasses, anxious expression softening into gentle smile; 'Lin Xiaoyu: Mingming shuo hao jin tian hai shu de... Zhe yu xia de, ta hui bu hui bu lai le?' — voiceover in natural Mandarin tone, calm yet tender, slight breathiness, ambient rain and distant water drip layered beneath\n\n中景:林小雨坐在绿色金属长椅左端,米白长裙配浅灰针织开衫,素色棉麻混纺,肩头微湿;午后细雨,浅景深,咖啡店红底白字招牌暖光映照,深褐色木檐悬垂静默铜风铃;齐肩黑发微湿、戴细框眼镜,神情由焦虑渐转为温柔笑意;台词:‘明明说好今天还书的……这雨下得,他会不会不来了?’",
"negative_prompt": "blurry face, deformed hands, extra limbs, text overlay, watermark, cartoon, anime, 3D render, photorealistic exaggeration, harsh lighting, overexposed skin, synthetic voice, robotic speech, background music, laughter, crowd noise, thunder, wind howling",
"duration": 4.5,
"model": "runway_gen2",
"style": "cinematic realism, Fujifilm Superia 400 film grain, soft focus edges, natural color grading, shallow depth of field, emotionally grounded framing",
"requires_special_attention": false,
"audio_prompt": {
"audio_id": "audio_008",
"prompt": "Natural Mandarin female voice, gentle and introspective, slight breathy texture, calm pacing with quiet emotional weight; ambient rain (light drizzle), occasional water drip from eaves, faint distant footstep on wet pavement; no music, no reverb-heavy processing — warm, intimate, lifelike acoustic space matching outdoor café corner under light rain\n\n自然中文女声,温柔内省,略带气息感,语速舒缓而富有情绪分量;环境音:轻柔雨声、屋檐水滴声、远处湿地面脚步声;无人声配乐,无强烈混响,温暖亲密,真实还原街角咖啡店外细雨氛围",
"negative_prompt": "robotic voice, pitch instability, clipping, echo chamber effect, background music, laughter, crowd noise, thunder, wind gusts, synthesized tones",
"model_type": "XTTSv2",
"voice_type": "character_dialogue",
"audio_style": "realistic",
"voice_character": "Lin Xiaoyu",
"voice_description": "young adult female, clear diction, soft timbre, slight breathiness, warm mid-range, natural Mandarin accent with gentle intonation",
"speed": 0.95,
"pitch_shift": 0.0,
"emotion": "tender",
"stability": 0.7,
"duration_seconds": 4.5,
"sound_attributes": {
"intensity": 0.8,
"reverb": 0.3
},
"format": "wav",
"sample_rate": 24000,
"seed": 8742,
"scene_context": "Outdoor corner bench at a small café under light rain, late afternoon, urban residential street, gentle atmosphere with emotional stillness and quiet anticipation",
"previous_audio_id": "audio_007"
}
},
{
"fragment_id": "frag_009",
"prompt": "medium shot, slight rack focus: Chen Yang straightens up, wearing a vivid matte-yellow reflective raincoat with silver reflective strips gleaming with cool-toned sheen under overcast daylight; his right hand holds an empty semi-transparent waterproof bag; left hand lifts from his knee, fingertips gently touching the cobalt-blue hardcover of 'Stray Birds', pages slightly curled, faint water stain outline visible on cover; his gaze hasn't fully turned toward Lin Xiaoyu yet; chest rises subtly; he speaks in a warm, slightly breathless tone: 'I ran down two streets to find a waterproof bag...'; soft natural rain ambiance, subtle water drip and distant footsteps; shallow depth of field, cinematic color grading, realistic texture detail, Fujifilm ETERNA film stock aesthetic\n\n中景镜头(轻微跟焦):陈阳刚直起身,明黄色反光雨衣饱和度鲜明,银色反光条在阴天微光中泛冷调光泽;他右手拎着一只半透明防水袋(已空),左手正从膝上抬起,指尖轻触《飞鸟集》钴蓝封皮边缘,书页微翘,水渍轮廓初显;他目光尚未完全转向林小雨,胸廓微起伏,语气温和而略带喘息:'我跑了两条街找防水袋……'",
"negative_prompt": "blurry face, deformed hands, extra limbs, text errors, watermark, logo, cartoon, 3d render, anime, low resolution, oversaturated background, unnatural lighting, floating objects, duplicate characters, no book, missing raincoat, incorrect book color, wrong jacket color, no water stain, no reflective strips",
"duration": 3.2,
"model": "runway_gen2",
"style": "cinematic realism, Fujifilm ETERNA film look, shallow depth of field, soft overcast lighting, emotionally grounded performance",
"requires_special_attention": false,
"audio_prompt": {
"audio_id": "audio_009",
"prompt": "A young male voice, warm timbre, gentle breath support, mild exertion resonance (as after light running), clear diction, tender sincerity, slight vocal fatigue — delivering the line 'I ran down two streets to find a waterproof bag...' with natural cadence and emotional warmth; layered beneath: ambient rainfall (light-to-moderate intensity), occasional water drip from awning, faint distant footstep echoes on wet pavement; no reverb overload, no electronic artifacts, warm analog tonality\n\n青年男性嗓音,温暖音色,气息柔和带轻微运动后喘息感,吐字清晰,真挚温柔,略带声带微疲感——自然说出台词:'我跑了两条街找防水袋……',节奏舒缓、情绪温厚;背景叠加自然雨声(中低强度)、屋檐滴水声、远处湿滑路面脚步回响;无过度混响,无电子合成感,模拟暖调类比录音质感",
"negative_prompt": "robotic voice, exaggerated emotion, pitch instability, background music, echo distortion, clipping, silence gaps, synthetic tones, overlapping speech, non-human voice, whispering, shouting",
"model_type": "XTTSv2",
"voice_type": "character_dialogue",
"audio_style": "realistic",
"voice_character": "Chen Yang",
"voice_description": "male, 22 years old, warm baritone, slightly breathy, earnest and humble tone, subtle nasal resonance, natural Mandarin accent with gentle intonation",
"speed": 0.95,
"pitch_shift": 0.0,
"emotion": "tender",
"stability": 0.7,
"duration_seconds": 3.2,
"sound_attributes": {
"intensity": 0.8,
"reverb": 0.3
},
"format": "wav",
"sample_rate": 24000,
"seed": 892473,
"scene_context": "Outdoor corner café setting, overcast afternoon, light rain tapering off, metal bench, wooden eaves, copper wind chime silent, ambient moisture in air",
"previous_audio_id": "audio_008"
}
},
{
"fragment_id": "frag_010",
"prompt": "medium shot, cinematic shallow depth of field with gentle rack focus, Chen Yang (22, short messy hair with water droplets, wearing a vivid yellow reflective raincoat with silver reflective strips, polyester fabric, dark blue hooded sweatshirt underneath, black sport pants, muddy canvas shoes) steadily holds 'Stray Birds' in his left hand — hardcover book with cobalt-blue cover, visible water stain spreading softly from top-left corner, gold-embossed English title, slightly yellowed pages, white library card neatly affixed to bottom-right corner of cover, handwritten 'Chen Yang → Lin Xiaoyu' in black ink with rightward arrow and faint ink bleed; he gazes intently at Lin Xiaoyu (20, shoulder-length black hair slightly damp, thin-framed glasses, wearing off-white linen-cotton midi dress and light gray knitted cardigan, subtle moisture on shoulders), his expression sincere and tender, breath calm, voice soft yet certain: 'The poem says \"rain is the cloud's tears\", but I don't want you to cry.', warm ambient light, shallow green metal bench in background (matte green frame, light gray cushion, faint scratches on armrest), red-and-white coffee shop sign visible behind, soft rain falling, naturalistic lighting, film grain texture, realistic detail, 8K\n\n中景镜头(持续轻微跟焦):陈阳左手已稳稳托起《飞鸟集》,钴蓝封皮水渍清晰可见,借阅卡‘陈阳→林小雨’完好无损贴于封面右下角;他双目凝视林小雨,眼神真挚,呼吸稍缓,语气温柔而笃定:'诗里说\"雨是云的眼泪\",可我不想让你哭。'",
"negative_prompt": "blurry face, deformed hands, extra limbs, text errors, distorted book cover, mismatched clothing colors, missing water stain, misplaced library card, no rain, cartoonish style, low resolution, glare, overexposure, synthetic textures, floating objects, duplicate characters, watermark, logo",
"duration": 2.8,
"model": "runway_gen2",
"style": "cinematic realism, Fujifilm ETERNA film stock aesthetic, soft natural lighting, emotionally grounded, intimate human scale",
"requires_special_attention": false,
"audio_prompt": {
"audio_id": "audio_010",
"prompt": "A tender male voice, warm timbre, gentle articulation, slight breath control, emotionally grounded delivery of: 'The poem says \"rain is the cloud's tears\", but I don't want you to cry.' — layered under gentle ambient rain, distant water drip, subtle footstep echo on wet pavement, no reverb overload, natural vocal warmth, no electronic artifacts, studio-quality clarity with environmental authenticity\n\n温柔的男声,音色温暖,吐字轻柔而有控制,略带气息感,情感真挚地念出:'诗里说\"雨是云的眼泪\",可我不想让你哭。'——背景叠加自然雨声、远处水滴声、湿润路面细微脚步回响,无过度混响,人声温暖自然,无电子失真,录音室级清晰度与环境真实感并存",
"negative_prompt": "robotic voice, exaggerated emotion, pitch instability, background music, laughter, crowd noise, distortion, clipping, silence gaps, unnatural pauses, AI-sounding artifacts",
"model_type": "XTTSv2",
"voice_type": "character_dialogue",
"audio_style": "realistic",
"voice_character": "Chen Yang",
"voice_description": "young adult male, warm baritone, slightly husky from light breath control, sincere and unpolished, gentle cadence, native Mandarin speaker with neutral Beijing-influenced accent",
"speed": 0.95,
"pitch_shift": 0.0,
"emotion": "tender",
"stability": 0.7,
"duration_seconds": 2.8,
"sound_attributes": {
"intensity": 0.8,
"reverb": 0.3
},
"format": "wav",
"sample_rate": 24000,
"seed": 42719,
"scene_context": "Outdoor corner café setting, light rain tapering, green metal bench, warm ambient light under wooden eaves, intimate two-character moment",
"previous_audio_id": "audio_009"
}
},
{
"fragment_id": "frag_011",
"prompt": "Cinematic wide shot: urban street corner in soft afternoon rain, matte green metal bench with light gray cushion sits beside glistening wet asphalt, facing a red-background sign with crisp white Chinese characters 'COFFEE & POETRY', shallow depth of field, realistic lighting with gentle overcast diffusion, raindrops glisten on bench surface and puddles form near curb, subtle water drip from deep brown wooden eaves above, copper wind chime hangs silently, warm ambient tone, film grain texture, 8K resolution, shot on ARRI Alexa Mini LF\n\n全景:城市街角,绿色金属长椅静置在湿漉漉的柏油路旁,正对红底白字‘COFFEE & POETRY’咖啡店招牌;深褐色木质屋檐悬垂未响铜风铃,地面有暖色光斑;细雨淅沥,水珠在长椅扶手与沥青路面微微反光",
"negative_prompt": "people, faces, text other than 'COFFEE & POETRY', cartoon, anime, illustration, blurry, low-res, deformed bench, dry pavement, sunny sky, lens flare, logo, watermark, modern signage, neon lights",
"duration": 4.0,
"model": "runway_gen2",
"style": "cinematic realism, atmospheric rain photography, Wong Kar-wai color grading (teal-amber balance), shallow focus, naturalistic lighting",
"requires_special_attention": false,
"audio_prompt": {
"audio_id": "audio_011",
"prompt": "Gentle rainfall layered with subtle water dripping from wooden eaves and distant muffled footsteps on wet asphalt; warm ambient tone, no electronic artifacts, organic texture, light reverb simulating open street corner under partial shelter, consistent intensity at 0.8, natural decay between drips, no dialogue or voice\n\n自然雨声为基底,叠加屋檐水滴声与远处湿滑柏油路上模糊的脚步声;整体温暖通透,无电子合成感,空间感模拟街角半遮蔽环境,水滴间歇自然,无台词、无人声",
"negative_prompt": "speech, singing, music, thunder, wind howling, traffic noise, birds, mechanical sounds, distortion, silence",
"model_type": "AudioLDM_3",
"voice_type": "character_dialogue",
"audio_style": "realistic",
"voice_character": null,
"voice_description": "calm, organic, spatially grounded, gently immersive, analog warmth",
"speed": 1.0,
"pitch_shift": 0.0,
"emotion": "neutral",
"stability": 0.7,
"duration_seconds": 4.0,
"sound_attributes": {
"intensity": 0.8,
"reverb": 0.3
},
"format": "wav",
"sample_rate": 24000,
"seed": 4117,
"scene_context": "Urban street corner in light rain, coffee shop exterior, green metal bench, red-white sign, wooden eaves, copper wind chime, damp asphalt",
"previous_audio_id": "audio_010"
}
},
{
"fragment_id": "frag_012",
"prompt": "medium shot, cinematic lighting, rain-soaked urban street corner, green matte-metal bench with light-gray cushion, red-and-white coffee shop sign in background, deep brown wooden eaves with silent copper wind chime,陈阳 enters briskly from right frame wearing vivid yellow reflective raincoat (polyester, silver reflective strips on shoulders and back), dark blue hooded sweatshirt underneath, black sport pants, muddy canvas sneakers, water droplets on short messy hair, expression earnest and slightly flustered, gentle overcast daylight, natural rain ambiance, shallow depth of field, realistic texture detail, Fujifilm Superia film grain\n\n中景,电影感布光,雨润的城市街角,哑光绿色金属长椅配浅灰坐垫,背景是红底白字咖啡店招牌,深褐色木制屋檐悬垂静默铜风铃,陈阳从画面右侧快步走入,身穿明黄色反光雨衣(聚酯纤维材质,肩背处银色反光条),内搭深蓝连帽卫衣,黑色运动裤与沾泥帆布鞋,短发凌乱带水珠,神情憨厚又急切,柔和阴天自然光,浅景深,真实质感细节,富士Superia胶片颗粒感",
"negative_prompt": "cartoon, anime, 3D render, text, logo, watermark, deformed hands, extra limbs, blurry face, low resolution, oversaturated, artificial lighting, studio backdrop, no rain, dry pavement, sunny weather",
"duration": 3.2,
"model": "runway_gen2",
"style": "cinematic realism, Fujifilm Superia aesthetic, shallow depth of field, naturalistic color grading, subtle motion blur on entry",
"requires_special_attention": false,
"audio_prompt": {
"audio_id": "audio_012",
"prompt": "Natural rainfall with gentle drip and soft footsteps on wet pavement, warm ambient tone, no synthetic elements, subtle reverb suggesting open street corner under eaves, light footstep rhythm matching brisk walk, consistent rain intensity at medium-loud level, emotionally neutral yet tender undertone\n\n自然雨声为基底,叠加轻柔滴水声与湿滑路面脚步声,整体温暖通透,无电子合成感,轻微混响暗示街角屋檐下空间,脚步节奏匹配快步行走,雨势维持中等强度,情绪中性而隐含温柔",
"negative_prompt": "scream, music, voiceover, dialogue, laughter, thunder, wind howl, mechanical noise, digital distortion, silence",
"model_type": "AudioLDM_3",
"voice_type": "character_dialogue",
"audio_style": "realistic",
"voice_character": "Chen Yang",
"voice_description": "young male voice, warm timbre, slightly breathy, gentle urgency, unpolished sincerity, 22 years old, Mandarin native speaker",
"speed": 1.0,
"pitch_shift": 0.0,
"emotion": "tender",
"stability": 0.7,
"duration_seconds": 3.2,
"sound_attributes": {
"intensity": 0.8,
"reverb": 0.3
},
"format": "wav",
"sample_rate": 24000,
"seed": 42791,
"scene_context": "Urban street corner outside a coffee shop, light rain falling, green metal bench, red-and-white sign, wooden eaves — Chen Yang rushes in to meet Lin Xiaoyu",
"previous_audio_id": "audio_011"
}
},
{
"fragment_id": "frag_013",
"prompt": "medium shot transitioning to close-up (slow dolly-in): Lin Xiaoyu and Chen Yang running side by side under golden-hour sunlight and lingering rain, shallow depth of field blurs background into soft motion streaks; Lin Xiaoyu wears a米white cotton-linen midi dress and light gray knitted cardigan, shoulder of outerwear slightly damp, shoulder-length black hair damp with raindrops, thin-framed glasses, expression shifting from anxious to tender smile; Chen Yang wears a vivid yellow reflective raincoat (polyester, silver reflective strips), dark blue hooded sweatshirt underneath, black sport pants and muddy canvas sneakers, short messy hair with water droplets, earnest and urgent expression; warm ambient light, gentle rain mist, subtle golden backlighting, cinematic natural lighting, film grain texture, 35mm anamorphic lens aesthetic\n\n中景转特写镜头(缓慢推近):林小雨与陈阳并肩奔跑于斜阳与残雨之间,背景虚化为流动光斑;林小雨穿米白长裙配浅灰针织开衫,素色外套肩头微湿,齐肩黑发微湿、戴细框眼镜,神情由焦虑转为温柔笑意;陈阳穿明黄色反光雨衣(聚酯纤维材质,带银色反光条),内搭深蓝连帽卫衣,黑色运动裤与沾泥水的帆布鞋,短发凌乱带水珠,表情憨厚又急切;暖调环境光,细雨薄雾,斜阳金边轮廓光,电影感自然光效,胶片颗粒质感,35mm变形宽银幕镜头风格",
"negative_prompt": "deformed, distorted, disfigured, poorly drawn face, extra limbs, missing limbs, floating limbs, mutated hands, disconnected limbs, malformed hands, blurry, bad anatomy, bad proportions, extra legs, extra arms, extra head, cloned face, worst quality, low quality, text, signature, watermark, username, artist name, jpeg artifacts, cartoon, 3d, cgi, render, illustration, drawing, painting, anime, overexposed, underexposed, flat lighting, harsh shadows, synthetic sound, electronic tone, voiceover, narration, dialogue, speech, talking",
"duration": 4.8,
"model": "runway_gen2",
"style": "cinematic realism, Fujifilm ETERNA film stock, shallow focus, golden hour atmosphere, gentle rain ambiance, emotionally resonant framing",
"requires_special_attention": false,
"audio_prompt": {
"audio_id": "audio_013",
"prompt": "Natural rain ambience (light drizzle fading into sparse drip), wet footsteps on pavement (two distinct rhythmic patterns: light feminine steps and heavier masculine strides), distant copper wind chime faint resonance (no ring), warm ambient air tone, no speech, no music, no synthetic elements — pure organic acoustic layering, high-fidelity field recording quality\n\n自然雨声基底(淅沥渐疏为滴答),湿润路面脚步声(两组节奏分明:轻盈女性步频与沉实男性步频),远处铜风铃极微弱泛音(未实际发声),温暖空气底噪,无人声、无音乐、无电子合成音——纯有机声场分层,高保真实地录音品质",
"negative_prompt": "speech, dialogue, singing, music, melody, beat, synth, electronic, distortion, clipping, reverb-heavy, artificial, robotic, voice, whisper, breath noise, cough, laugh",
"model_type": "AudioLDM_3",
"voice_type": "character_dialogue",
"audio_style": "realistic",
"voice_character": "",
"voice_description": "organic, warm, spatially accurate, binaural-ready, ultra-clean field recording style",
"speed": 1.0,
"pitch_shift": 0.0,
"emotion": "tender",
"stability": 0.7,
"duration_seconds": 4.8,
"sound_attributes": {
"intensity": 0.8,
"reverb": 0.3
},
"format": "wav",
"sample_rate": 24000,
"seed": 892473,
"scene_context": "Urban street corner outside a café at golden hour, light rain ending, two characters running together under soft backlight and lingering mist, emotional warmth and quiet intimacy",
"previous_audio_id": "audio_012"
}
}
],
"global_settings": {
"style_consistency": true,
"use_common_negative_prompt": true
},
"execution_suggestions": [
"按顺序生成片段",
"保持相同种子值以获得一致性",
"生成后检查片段衔接"
]
}
},
"message": null,
"processing_time_ms": 519072,
"created_at": "2026-03-06T17:21:05.051369",
"completed_at": "2026-03-06T17:29:44.123977"
}

集成方式

安装依赖

1
2
3
4
5
6
7
# 选择最新版本,下载 whl 包(https://github.com/neopen/video-shot-agent/releases)
# https://github.com/neopen/video-shot-agent/releases/download/v0.1.3-beta/hengshot-0.1.3-py3-none-any.whl
# 内部默认安装使用 ollama,如果要使用其他平台,需要安装对应的包
pip install hengshot-0.1.3-py3-none-any.whl
# 安装指定LLM 包
# pip install langchain-openai 使用 openai 或 deepseek
# pip install dashscope 使用千问

环境配置

  1. 复制示例文件:cp .env.example .env

  2. 编辑 .env 文件,填入真实配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# .env - 实际配置文件
# ================= 应用配置 =================
APP__LANGUAGE=zh

# ================= API配置 =================
# 服务器主机,支持HOST环境变量
API__HOST=localhost
# 服务器端口,支持PORT环境变量
API__PORT=8000

# ================= LLM默认配置 =================
LLM__DEFAULT__BASE_URL=https://api.openai.com/v1
LLM__DEFAULT__API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
LLM__DEFAULT__MODEL_NAME=gpt-4-turbo-preview
LLM__DEFAULT__TEMPERATURE=0.7
LLM__DEFAULT__TIMEOUT=30
LLM__DEFAULT__MAX_RETRIES=3
LLM__DEFAULT__MAX_TOKENS=8192

# ================= LLM备用配置 =================
LLM__FALLBACK__BASE_URL=http://localhost:11434
LLM__FALLBACK__MODEL_NAME=qwen3:4b
LLM__FALLBACK__TEMPERATURE=0.1
LLM__FALLBACK__TIMEOUT=300
LLM__FALLBACK__MAX_TOKENS=8192

1.作为python 库使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
async def basic_usage():
"""基础用法示例"""
script = """
场景:现代办公室
时间:下午3点
人物:小李(程序员)
动作:小李正在写代码,突然接到电话,表情惊讶
"""

# 创建自定义配置 LLM
custom_config = ShotConfig(
model_name="gpt-4",
base_url="http://localhost:11434", # 假设本地部署了 Ollama
temperature=0.2
)

# 简单调用
result = await generate_storyboard(
script_text=script,
config=custom_config
)
print(f"生成完成,任务ID: {result.get('task_id')}")
print(f"生成结果: {result.get('success', False)}")
print(f"分镜片段: {result.get('data', {})}")

return result

2.集成到Web应用(API)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
@app.post("/api/generate-storyboard")
async def generate_storyboard_endpoint(script_text: str):
"""
生成视频分镜的Web API端点
"""

# 创建自定义配置 LLM
custom_config = ShotConfig(
model_name="gpt-4",
base_url="http://localhost:11434", # 假设本地部署了 Ollama
temperature=0.2
)

try:
return await generate_storyboard(
script_text=script_text,
config=custom_config
)
except Exception as e:
raise HTTPException(status_code=500, detail=f"生成失败: {str(e)}")

3.集成到 LangGraph 节点

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# 定义状态结构
class StoryboardState(BaseModel):
script_text: str = Field(description="输入剧本文本")
task_id: str = Field(default=None, description="任务ID")
storyboard_result: Dict[str, Any] = Field(default=None, description="分镜生成结果")
next_step: str = Field(default="", description="下一步操作指示")


# 创建分镜生成节点
async def storyboard_generator_node(state: StoryboardState) -> Dict[str, Any]:
"""
LangGraph 工作流中的分镜生成节点
"""
try:
result = await generate_storyboard(
script_text=state.script_text,
task_id=state.task_id
)

return {
"storyboard_result": result,
"next_step": "storyboard_generated"
}
except Exception as e:
return {
"storyboard_result": {"error": str(e)},
"next_step": "error"
}


# 构建工作流示例
def create_storyboard_workflow():
workflow = StateGraph(StoryboardState)

# 添加节点
workflow.add_node("generate_storyboard", storyboard_generator_node)

# 设置入口点
workflow.set_entry_point("generate_storyboard")
workflow.add_edge("generate_storyboard", END)

return workflow.compile()


# 使用示例
async def run_langgraph_example():
app = create_storyboard_workflow()

# 初始化状态
initial_state = StoryboardState(
script_text="一个男孩在公园里放风筝,天空很蓝...",
task_id="storyboard_task_001"
)

# 运行工作流
final_state = await app.ainvoke(initial_state)

return final_state

响应:

1
2
3
4
5
6
7
8
9
10
11
{
"success": true,
"data": {
"instructions": {},
"continuity_issues": [],
"audit_report": {}
},
"errors": {},
"task_id": "HL202602101908249720",
"workflow_status": "completed"
}

4.集成到 A2A 系统

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
@dataclass
class A2ATask:
"""A2A任务数据类"""
task_id: str
script_content: str
priority: int = 1
metadata: Dict[str, Any] = None


class StoryboardA2AAgent:
"""分镜生成的A2A代理"""

def __init__(self, agent_id: str):
self.agent_id = agent_id
self.task_queue = []

async def process_task(self, task: A2ATask) -> Dict[str, Any]:
"""
处理A2A任务
"""
try:
# 调用分镜生成智能体
result = await generate_storyboard(
script_text=task.script_content,
task_id=task.task_id
)

return {
"agent_id": self.agent_id,
"task_id": task.task_id,
"status": "completed",
"result": result,
"metadata": task.metadata or {}
}
except Exception as e:
return {
"agent_id": self.agent_id,
"task_id": task.task_id,
"status": "failed",
"error": str(e)
}


class A2AOrchestrator:
"""A2A系统编排器"""

def __init__(self):
self.agents = {}

def register_agent(self, agent: StoryboardA2AAgent):
self.agents[agent.agent_id] = agent

async def dispatch_task(self, task: A2ATask, agent_id: str = None):
"""
分发任务到指定代理或选择合适的代理
"""
if agent_id and agent_id in self.agents:
agent = self.agents[agent_id]
else:
# 简单的负载均衡:选择第一个可用代理
agent = list(self.agents.values())[0]

return await agent.process_task(task)


# 使用示例
async def a2a_demo():
# 创建编排器
orchestrator = A2AOrchestrator()

# 注册分镜生成代理
storyboard_agent = StoryboardA2AAgent(agent_id="storyboard_agent_001")
orchestrator.register_agent(storyboard_agent)

# 创建任务
task = A2ATask(
task_id="a2a_task_001",
script_content="早晨,一个女孩在咖啡馆读书,阳光透过窗户...",
priority=1,
metadata={"user_id": "user123", "project": "短片制作"}
)

# 分发并处理任务
result = await orchestrator.dispatch_task(task)

print(f"任务处理结果: {result}")
return result

技术架构(多智能体)

智能体框架: LangGraph(专为多智能体状态机设计)

graph TB
    %% 输入输出
    Input[原始剧本文本] --> Parser
    FinalOutput[AI视频指令] --> End[完成]
    
    %% 核心处理流程
    subgraph "处理流程"
        Parser[剧本解析智能体]
        Splitter[镜头拆分智能体]
        Fragmenter[视频分段智能体]
        Converter[指令转换智能体]
        Auditor[质量审查智能体]
    end
    
    %% 数据流
    Parser --> Splitter
    Splitter --> Fragmenter
    Fragmenter --> Converter
    Converter --> Auditor
    Auditor --> FinalOutput
    
    %% 支持服务
    subgraph "支持服务"
        LLM[LLM服务]
        StateDB[(状态存储)]
    end
    
    %% 服务调用
    Parser -.-> LLM
    Converter -.-> LLM
    Fragmenter -.-> StateDB
    Converter -.-> StateDB
智能体 职责 输入 → 输出 要点
1. 剧本解析智能体 将任意格式剧本统一解析为结构化叙事单元,提取场景、角色、动作、对话、道具、情绪等 原始文本 → 结构化序列(场景、角色、对话等) 仅提取基本元素,不做复杂时长估算
2. 镜头拆分智能体 基于情感强度、节奏变化等,将剧本划分为最小可独立表达的镜头单元,并分配合理时长 结构化对象 → 带时间戳的镜头序列JSON 基于简单规则(如场景变化、对话切换)拆分,不涉及复杂镜头语言
3. 视频分段智能体 将镜头按5秒粒度切分,优先在动作静止点切割,输出符合AI视频模型要求的片段序列 镜头序列 → 符合AI视频长度限制的片段序列 只对>5秒镜头 简单切分,不做镜头智能合并等复杂操作
4. 指令转换智能体 为每个AI片段生成模型适配的文生视频提示词(Prompt),包含构图、光线、运动等 片段内容 + 连续性锚点 → 可直接用于AI视频生成指令 使用“模板+LLM优化” 的方式生成提示词,降低复杂度
5. 质量审查智能体 对全序列进行连贯性、时长合规性、角色漂移等检查,输出修正建议 AI片段序列 → 审查报告 (通过 / 修正建议列表) 只检查硬性规则(如时长、基本连续性),不做深度美学判断

全流程协作序列图

sequenceDiagram
    participant User as 用户
    participant System as 流程编排器
    participant Parser as 剧本解析
    participant Splitter as 镜头拆分
    participant Fragmenter as 视频分段
    participant Converter as 指令转换
    participant Auditor as 质量审查
    
    User->>System: 提交剧本
    System->>Parser: 开始解析
    Parser-->>System: 返回结构化剧本
    System->>Splitter: 开始镜头拆分
    Splitter-->>System: 返回镜头序列
    System->>Fragmenter: 开始5秒分段
    Fragmenter-->>System: 返回片段序列
    System->>Converter: 生成AI指令
    Converter-->>System: 返回Prompt序列
    System->>Auditor: 开始质量审查
    Auditor-->>System: 返回审查结果
    
    alt 审查通过
        System->>User: 返回完整AI指令
    else 审查不通过
        System->>User: 返回错误和建议
    end

1.剧本解析智能体

核心目标:将任意格式剧本转换为统一的结构化中间表示,执行剧本的格式转换和语义提取。

支持四种类型的剧本:

  • 自然语言剧本
  • 标准电影剧本
  • AI生成的分镜剧本
  • 结构化场景剧本
graph TD
    A[原始剧本文本] --> B(LLM解析)
    B --> C{解析成功?}
    C -->|是| D[结构化JSON]
    C -->|否| E[错误处理]
    D --> F[添加元数据]
    F --> G[输出结构化剧本]
    E --> H[返回错误信息]
  1. 自动识别输入格式,根据不同的剧本做不同处理

  2. 使用LLM解析,通过剧本复杂度、类型、解析置信度

  3. 提取语义元素(角色、动作、对话、场景)

  4. 转换为统一数据结构

  5. 保持原始语义,不添加时间信息

sequenceDiagram
    participant User as 用户
    participant API as 流程编排器
    participant Parser as 剧本解析智能体
    participant LLM as LLM服务
    
    User->>API: POST 接口 {text: "剧本文本"}
    API->>Parser: 调用解析任务
    Parser->>LLM: 发送解析请求
    Note over Parser,LLM: 使用提示词让LLM提取
场景、角色、对话、动作 LLM-->>Parser: 返回结构化JSON Parser->>Parser: 添加元数据 Parser-->>API: 返回解析结果 API-->>User: 返回结构化剧本

输出示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
{
"metadata": {
"parsed_at": "2024-01-20T10:30:00Z",
"version": "mvp_1.0",
"source_type": "text"
},
"title": "深夜对话",
"characters": [
{
"name": "张三",
"description": "中年男性,神情紧张",
"key_traits": ["穿灰色夹克", "声音低沉", "性格谨慎"]
},
{
"name": "李四",
"description": "年轻女性,警觉性高",
"key_traits": ["长发", "穿黑色外套", "敏锐观察力"]
}
],
"scenes": [
{
"id": "scene_001",
"location": "客厅",
"description": "昏暗的客厅,只有一盏台灯亮着",
"time_of_day": "night",
"elements": [
{
"id": "elem_001",
"type": "ACTION",
"sequence": 1,
"estimated_duration": 3.0,
"confidence": 0.9,
"content": "环顾四周",
"character": "张三",
"target_character": null,
"description": "张三紧张地环顾客厅四周",
"intensity": 0.7,
"emotion": "fear"
},
{
"id": "elem_002",
"type": "DIALOGUE",
"sequence": 2,
"estimated_duration": 2.0,
"confidence": 0.95,
"content": "你听到了吗?",
"character": "张三",
"target_character": "李四",
"description": "张三压低声音询问李四",
"intensity": 0.6,
"emotion": "fear"
},
{
"id": "elem_003",
"type": "ACTION",
"sequence": 3,
"estimated_duration": 4.0,
"confidence": 0.85,
"content": "走向窗边",
"character": "李四",
"target_character": null,
"description": "李四小心翼翼地走向窗边查看",
"intensity": 0.8,
"emotion": "neutral"
},
{
"id": "elem_004",
"type": "SCENE",
"sequence": 4,
"estimated_duration": 2.5,
"confidence": 0.8,
"content": "窗外传来微弱声响",
"character": null,
"target_character": null,
"description": "窗外传来微弱的、类似树枝摩擦的声音",
"intensity": 0.4,
"emotion": "neutral"
}
]
},
{
"id": "scene_002",
"location": "客厅窗边",
"description": "窗帘微动,月光从缝隙透入",
"time_of_day": "night",
"elements": [
{
"id": "elem_005",
"type": "DIALOGUE",
"sequence": 5,
"estimated_duration": 3.0,
"confidence": 0.9,
"content": "外面好像有人...",
"character": "李四",
"target_character": "张三",
"description": "李四压低声音,语气紧张",
"intensity": 0.7,
"emotion": "fear"
}
]
}
],
"stats": {
"total_elements": 5,
"total_duration": 14.5,
"dialogue_count": 2,
"action_count": 3
}
}

存在的问题和挑战

AI的顺序识别能力

  • 优势:理解语义关系,推断逻辑顺序
  • 劣势:无法精确对应原始文本位置
  • 准确率:约70-85%(取决于剧本复杂度)

解决方案:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def mvp_order_handling(raw_text: str):
# 1. 让AI解析,明确要求保持顺序
ai_result = parse_with_ai(raw_text, instruction="保持原始顺序")

# 2. 添加简单的位置标记
for element in ai_result["elements"]:
element["order_index"] = estimate_position(raw_text, element)

# 3. 按位置标记排序
sorted_elements = sort_by_order_index(ai_result["elements"])

# 4. 标记需要人工检查的元素
for element in sorted_elements:
if element.get("order_confidence", 1) < 0.6:
element["needs_human_review"] = True

return {
"elements": sorted_elements,
"order_quality": calculate_order_quality(sorted_elements),
"review_needed": any(e.get("needs_human_review") for e in sorted_elements)
}

演进路径与挑战:

graph LR
    A[MVP: 简单顺序
LLM推断+数组索引] --> B[V1.1: 增强位置
后端文本匹配] B --> C[V1.2: 精确位置
分词+偏移计算] C --> D[V2.0: 智能验证
多模型交叉验证] style A fill:#e1f5fe style D fill:#c8e6c9

2.镜头拆分智能体

核心目标:将结构化剧本拆分为视觉上独立的镜头单元,考虑情感节奏和时长合理性

flowchart TD
    A[结构化剧本] --> B[提取场景]
    B --> C[遍历场景元素]
    C --> D{元素类型?}
    D -->|新场景| E[创建新镜头]
    D -->|对话切换| F[创建新镜头]
    D -->|同一动作| G[扩展当前镜头]
    E --> H[添加到镜头序列]
    F --> H
    G --> H
    H --> I[计算时间戳]
    I --> J[输出镜头序列]

输出示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
{
"metadata": {
"generated_at": "2024-01-20T10:35:00Z",
"version": "mvp_1.0",
"parser_type": "shot_splitter_v1"
},
"script_reference": {
"title": "深夜对话",
"total_elements": 5,
"original_duration": 14.5
},
"shots": [
{
"id": "shot_001",
"scene_id": "scene_001",
"description": "张三紧张地环顾客厅四周",
"start_time": 0.0,
"duration": 3.0,
"shot_type": "medium_shot",
"main_character": "张三",
"element_ids": ["elem_001"],
"confidence": 0.8
},
{
"id": "shot_002",
"scene_id": "scene_001",
"description": "张三压低声音询问李四",
"start_time": 3.0,
"duration": 2.0,
"shot_type": "close_up",
"main_character": "张三",
"element_ids": ["elem_002"],
"confidence": 0.9
},
{
"id": "shot_003",
"scene_id": "scene_001",
"description": "李四小心翼翼地走向窗边查看",
"start_time": 5.0,
"duration": 4.0,
"shot_type": "medium_shot",
"main_character": "李四",
"element_ids": ["elem_003"],
"confidence": 0.8
}
],
"stats": {
"shot_count": 3,
"total_duration": 9.0,
"avg_shot_duration": 3.0,
"close_up_count": 1,
"wide_shot_count": 0
}
}

存在的问题和挑战:时长的精准估算(动作、对话、场景等),MVP演进路径:

graph LR
    A[MVP: 固定参数
简单分类
接受30%误差] --> B[V1: 可配置参数
情感检测
误差20%] B --> C[V2: 机器学习模型
个性化适配
误差15%] C --> D[V3: 实时调整
根据生成反馈优化
误差10%] style A fill:#e1f5fe style D fill:#c8e6c9

3.视频分段智能体

核心目标::将镜头序列按5秒粒度切分,确保每个片段符合AI视频模型限制

sequenceDiagram
    participant Splitter as 视频分段智能体
    participant Logic as 切分逻辑
    participant CM as 简单连续性管理
    
    Splitter->>Logic: 镜头序列输入
    Logic->>Logic: 遍历每个镜头
    Logic->>Logic: 检查时长是否超过5秒
    alt 时长≤5秒
        Logic->>CM: 获取连续性信息
        CM-->>Logic: 返回角色/场景状态
        Logic->>Logic: 创建单个片段
    else 时长>5秒
        Logic->>Logic: 计算切分点
        Logic->>Logic: 创建多个片段
        Logic->>CM: 为每个片段获取状态
        CM-->>Logic: 返回状态信息
    end
    Logic-->>Splitter: 返回片段序列
    Splitter->>Splitter: 组装最终输出

输出示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
{
"metadata": {
"generated_at": "2024-01-20T10:40:00Z",
"version": "mvp_1.0",
"max_fragment_duration": 5.0
},
"source_info": {
"shot_count": 3,
"original_duration": 9.0,
"title": "深夜对话"
},
"fragments": [
{
"id": "frag_001",
"shot_id": "shot_001",
"element_ids": ["elem_001"],
"start_time": 0.0,
"duration": 3.0,
"description": "中景:张三紧张地环顾四周",
"continuity_notes": {
"main_character": "张三",
"location": "场景scene_001",
"main_action": "张三紧张地环顾四周"
},
"requires_special_attention": false
},
{
"id": "frag_002",
"shot_id": "shot_002",
"element_ids": ["elem_002"],
"start_time": 3.0,
"duration": 2.0,
"description": "特写:张三低声询问",
"continuity_notes": {
"main_character": "张三",
"location": "场景scene_001",
"main_action": "张三压低声音询问李四"
},
"requires_special_attention": false
},
{
"id": "frag_003",
"shot_id": "shot_003",
"element_ids": ["elem_003"],
"start_time": 5.0,
"duration": 4.0,
"description": "中景:李四小心翼翼地走向窗边查看",
"continuity_notes": {
"main_character": "李四",
"location": "场景scene_001",
"main_action": "李四小心翼翼地走向窗边查看"
},
"requires_special_attention": false
}
],
"stats": {
"fragment_count": 3,
"total_duration": 9.0,
"avg_duration": 3.0,
"fragments_under_5s": 3,
"fragments_split": 0,
"split_ratio": 0.0
}
}

视频分段时的连续性挑战

  1. 动作中断:在动作中间切分导致视觉跳跃

  2. 状态不一致:片段边界状态不匹配

  3. 时空不连续:位置、服装、道具突然变化

  4. 情绪断裂:情绪变化不符合时间逻辑

> flowchart TD
    A[镜头序列] --> B[提取连续性信息]
    B --> C[分析动作完整性]
    C --> D[识别语义边界]
    D --> E[状态一致性检查]
    E --> F{是否需要切分?}
    F -->|是| G[寻找安全切分点]
    F -->|否| H[直接创建片段]
    G --> I[动态规划优化]
    I --> J[生成片段序列]
    H --> J
    J --> K[连续性验证]
    K --> L{通过验证?}
    L -->|是| M[输出最终片段]
    L -->|否| N[调整切分策略]
    N --> G
> sequenceDiagram
    participant Splitter as 分段智能体
    participant Action as 动作保护器
    participant State as 状态检查器
    participant Boundary as 边界检测器
    participant Validator as 约束验证器
    
    Splitter->>Action: 检查镜头动作完整性
    Action-->>Splitter: 返回可切分点列表
    
    Splitter->>Boundary: 寻找语义边界
    Boundary-->>Splitter: 返回安全切分点
    
    Splitter->>State: 获取当前状态
    State-->>Splitter: 返回角色/场景状态
    
    Splitter->>Splitter: 动态规划计算最优切分
    
    Splitter->>Validator: 验证片段连续性
    Validator-->>Splitter: 返回验证结果
    
    alt 验证通过
        Splitter->>Splitter: 生成最终片段
    else 验证失败
        Splitter->>Splitter: 调整切分点重试
        Splitter->>Validator: 重新验证
    end

4.指令转换智能体

核心目标:将片段序列转换为高质量、模型特定的AI视频生成提示词

技术栈:使用模板+LLM优化

graph TD
    A[片段序列] --> B{选择AI模型}
    B --> C[Runway Gen-2]
    B --> D[Pika Labs]
    B --> E[其他模型]
    C --> F[加载模型模板]
    D --> F
    E --> F
    F --> G[LLM优化描述]
    G --> H[生成技术参数]
    H --> I[组装最终Prompt]
    I --> J[输出AI指令]

输出示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
{
"metadata": {
"generated_at": "2024-01-20T10:45:00Z",
"version": "mvp_1.0",
"target_model": "runway_gen2"
},
"project_info": {
"title": "深夜对话",
"total_fragments": 3,
"total_duration": 9.0,
"source_fragments": ["frag_001", "frag_002", "frag_003"]
},
"fragments": [
{
"fragment_id": "frag_001",
"prompt": "张三, 中景:张三紧张地环顾四周, cinematic style, dramatic lighting, high quality",
"negative_prompt": "blurry, distorted, low quality, cartoonish, bad anatomy",
"duration": 3.0,
"model": "runway_gen2",
"style": "cinematic, suspense",
"requires_special_attention": false
},
{
"fragment_id": "frag_002",
"prompt": "张三, 特写:张三低声询问, close-up shot, cinematic lighting, detailed facial expression",
"negative_prompt": "blurry, distorted, low quality, cartoonish, bad anatomy",
"duration": 2.0,
"model": "runway_gen2",
"style": "cinematic",
"requires_special_attention": false
},
{
"fragment_id": "frag_003",
"prompt": "李四, 中景:李四小心翼翼地走向窗边查看, cinematic style, suspenseful atmosphere",
"negative_prompt": "blurry, distorted, low quality, cartoonish, bad anatomy",
"duration": 4.0,
"model": "runway_gen2",
"style": "cinematic, suspense",
"requires_special_attention": false
}
],
"global_settings": {
"style_consistency": true,
"use_common_negative_prompt": true
},
"execution_suggestions": [
"按顺序生成片段",
"保持相同种子值以获得一致性",
"生成后检查片段衔接"
]
}

5.质量审查智能体

核心目标:确保整个流程的输出质量,检查连贯性、合规性和技术可行性

  • 基础合规性检查

  • 只检查硬性规则

flowchart TD
    A[AI指令序列] --> B[提取检查数据]
    B --> C[时长检查]
    B --> D[基本连续性检查]
    B --> E[技术参数检查]
    C --> F{所有片段≤5秒?}
    D --> G{角色服装一致?}
    E --> H{参数有效?}
    F -->|是| I[记录通过]
    F -->|否| J[记录失败]
    G -->|是| K[记录通过]
    G -->|否| L[记录警告]
    H -->|是| M[记录通过]
    H -->|否| N[记录失败]
    I --> O[汇总结果]
    J --> O
    K --> O
    L --> O
    M --> O
    N --> O
    O --> P{有失败项?}
    P -->|无| Q[状态: 通过]
    P -->|有| R[状态: 失败]
    Q --> S[生成报告]
    R --> S

输出示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
{
"metadata": {
"audited_at": "2024-01-20T10:50:00Z",
"version": "mvp_1.0",
"auditor_type": "basic"
},
"project_info": {
"title": "深夜对话",
"fragment_count": 3,
"total_duration": 9.0
},
"status": "passed",
"checks": [
{
"name": "片段时长限制检查",
"status": "passed",
"details": "所有片段时长符合要求",
"checked_at": "2024-01-20T10:50:01Z"
},
{
"name": "提示词内容检查",
"status": "passed",
"details": "所有提示词非空",
"checked_at": "2024-01-20T10:50:01Z"
},
{
"name": "提示词长度检查",
"status": "passed",
"details": "所有提示词长度合适",
"checked_at": "2024-01-20T10:50:01Z"
},
{
"name": "片段数量检查",
"status": "passed",
"details": "共3个片段",
"checked_at": "2024-01-20T10:50:01Z"
},
{
"name": "模型支持检查",
"status": "passed",
"details": "所有模型都受支持",
"checked_at": "2024-01-20T10:50:01Z"
}
],
"violations": [],
"stats": {
"total_checks": 5,
"passed_checks": 5,
"warnings": 0,
"errors": 0,
"fragments_checked": 3
},
"suggestions": [
"检查所有片段时长是否≤5秒",
"确保没有空提示词"
],
"conclusion": "审查通过,可以开始视频生成"
}

智能体实现细节

LLM 客户端

支持以下 LLM 客户端,可通过 环境变量配置(默认),或者参数实时传递指定 API(优先级高)

  • OpenAI

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    ChatOpenAI(
    model=self.config.model,
    temperature=self.config.temperature,
    api_key=self.config.api_key,
    base_url=self.base_url,
    max_retries=3,
    max_tokens=self.config.max_tokens,
    )

    # Embeddings
    OpenAIEmbeddings(
    model=self.config.model,
    api_key=self.config.api_key,
    base_url=self.base_url
    )
  • Ollama

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    ChatOllama(
    base_url=self.base_url,
    model=self.config.model,
    temperature=self.config.temperature,
    num_predict=self.config.max_tokens * 4,
    keep_alive=self.config.timeout * 5,
    num_thread=8,
    client_kwargs=self._get_model_kwargs(),
    )

    # Embeddings
    OllamaEmbeddings(
    model=self.config.model,
    base_url=self.base_url
    )
  • QWen

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    ChatTongyi(
    model=self.config.model,
    model_kwargs=self._get_model_kwargs(),
    api_key=self.config.api_key,
    max_retries=3,
    streaming=False,
    )
    # Embeddings
    DashScopeEmbeddings(
    model=self.config.model,
    dashscope_api_key=self.config.api_key
    )

提示词管理

LLM 提示词管理,支持不同版本不同语言的维护。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
version: "1.3"
changelog: |
- 2026-03-18: 规范化提示词,添加错误处理、镜头类型、完整示例和相关规则


shot_segmenter_system:
name: "shot_segmenter_system"
description: "系统提示:将结构化剧本拆分为多个镜头序列节点"
template: |
你是一位顶尖的电影分镜师,精通分镜设计和视觉叙事,能将结构化的叙事单元拆分为不同的镜头节点。

【规则约束】
1. 生动描述: 画面描述要具体、生动,包含视觉细节、光影、氛围、角色状态和情感变化
2. 连续性保证: 严格遵循连续性约束,确保角色外观、位置和状态的一致性
3. 时长估算: 根据元素内容和类型合理估算每个镜头的时长,确保节奏感
4. 镜头描述: 每个镜头的描述要简洁明了,突出视觉元素和情感氛围
5. 角色聚焦: 镜头应突出主要角色,确保角色在镜头中的位置和状态一致
6. 场景一致性: 确保同一场景内的镜头保持相同的视觉风格和环境元素
7. 转场建议: 根据内容和节奏提供合理的转场建议,如切换、淡入淡出等
8. 避免过度分割,保持镜头的视觉和叙事连贯性
9. 片段时长控制: 根据内容类型和情感重量调整镜头时长,遵循以下建议:
- 快节奏动作/突发事件:1.5-2.5秒(如急刹、奔跑、意外发生)
- 对话/表情反应:2.5-4秒(根据台词长度和情感深度调整)
- 风景/环境建立:3-5秒(让观众有时间沉浸)
- 情感高潮/重要台词:4-6秒(给足情感发酵空间)
- 特写细节/道具揭示:2-3秒(确保观众看清关键信息)

【拆分规则】
1. 对话元素通常用特写镜头(close_up)或中景(medium_shot),时长根据台词长度和情感重量估算:
- 简短台词(2-3秒):2.5-3.5秒
- 中等长度台词(4-6秒):3.5-5秒
- 长段情感台词(6-8秒):5-7秒
2. 重要:对话元素必须在镜头描述中完整逐字呈现,不可省略、概括或使用"他说……"代替。必须确保台词原文完整出现在描述中
3. 动作元素通常用中景镜头(medium_shot),时长根据动作复杂度:
- 简单单一动作:1.5-2.5秒
- 复合动作序列:2.5-4秒
4. 场景描述用全景镜头(wide_shot),时长3-5秒
5. 连续相关元素可以合并到一个镜头,避免过度分割
6. 道具一致性:必须严格遵循【全剧上下文】中列出的关键道具描述,同一物品在不同镜头中的名称、外观、颜色、文字必须完全相同。
7. 角色服装分配:必须严格遵循【全剧上下文】中列出的角色服装要求,不可混淆不同角色的服装特征
8. 台词必须在相应镜头中完整出现
9. 避免将单一元素/台词分割到多个镜头中,除非原始时长确实过长(超过7-8秒)
10. 如果元素时长较短(<3秒),通常用一个镜头完整呈现,不要分割
11. 如果元素时长中等(3-6秒),评估内部是否有自然的视觉停顿点,否则保持单一镜头
12. 如果元素确实需要分割(>7秒),确保分割后的每个片段都有独立的视觉焦点或情感递进

【镜头衔接规则】
- wide_shot medium_shot:空间定位
- medium_shot close_up:情感聚焦
- close_up close_up:视线匹配
- 避免同一镜头类型连续超过3次

【转场建议规则】
- 场景变化:淡入淡出
- 情感高潮:直接切换
- 时间跳跃:叠化
- 视角转换:跟拍/摇移

【元素分配规则】
- elem_001(开场)→ shot_001
- elem_002(发展)→ shot_002-003
- elem_003(高潮)→ shot_004

【关键要求】
- 台词完整性:所有对话元素必须在相应镜头描述中完整逐字呈现,不可省略、概括或仅用"说台词"代替
- 道具一致性:同一物品在所有镜头中的名称、外观、颜色、文字内容必须完全一致
- 文字准确性:任何出现在物品上的文字必须在所有相关镜头中保持完全相同的表述
- 角色视觉标识:每个角色的服装颜色、配饰等视觉特征必须严格遵循剧本设定,在所有镜头中保持一致
- 角色服装分配:严格区分每个角色的专属服装特征,不可将A角色的服装特征错误分配给B角色
- 元素追踪:每个镜头的描述必须明确包含其对应的元素ID所代表的所有关键信息

【错误处理】
- 如果无法解析,返回空数组
- 如果置信度过低,标记为NEEDS_REVIEW
- 如果有矛盾信息,优先采用高置信度

【性能优化】
- 缓存常用结果
- 并行处理独立片段
- 增量更新(只处理变化部分)


shot_segmenter_user:
name: "shot_segmenter_user"
description: "用户提示:提供剧本场景信息和元素列表,要求拆分为镜头序列"
template: |
请将以下剧本场景拆分为镜头序列。

【场景信息】
地点:{location}
时间:{time_of_day}
天气:{weather}
描述:{description}

【元素列表】
{elements_list}

【全剧上下文】
{global_context}

【镜头类型选择规则】
- 对话 + 情感高潮 特写镜头
- 对话 + 环境交代 中景镜头
- 动作 + 多人互动 中景镜头
- 动作 + 细节展示 特写镜头
- 场景建立 全景镜头

【置信度评估】
- 0.9-1.0: 明确对应元素,拆分决策清晰
- 0.7-0.8: 基本合理,但有轻微不确定性
- 0.5-0.6: 推断成分较多,需人工复核

【输出格式】
[
{{
"id": "shot_001",
"description": "详细的镜头描述(必须包含完整台词)",
"duration": 镜头的预估时长(秒),最短不要低于1秒,
"shot_type": "wide_shot/medium_shot/close_up",
"main_character": "主要角色(如果有)",
"emotion": "neutral/happy/sad/angry/tense/fear/surprise/disgust/anxious/excited/calm/tender/hesitant/crying/whisper/choking/repression/emotional/shock/resigned/resolute/nostalgic/heavy/other",
"confidence": 置信度评分(0.0-1.0,表示分镜决策置信度),
"element_ids": ["元素列表中对应的元素ID,如:elem_001,elem_002"]
}}
]

【完整示例:场景拆分为镜头】
输入:
场景信息:咖啡店外长椅 | 下午 | 雨天 | 灰蒙蒙雨天,林小雨蹲在长椅旁擦书,神情焦急
元素列表:
1. [scene] 雨声淅沥,镜头下摇,聚焦长椅。林小雨蹲在长椅旁擦书,神情焦急。 (4.0秒)
2. [dialogue] 林小雨:明明说好今天还书的……这雨下得,他会不会不来了? (3.5秒)
全剧上下文:
【角色服装】林小雨:米白长裙配浅灰开衫
【关键道具】《飞鸟集》:钴蓝色硬壳

输出:
[
{
"id": "shot_001",
"description": "全景镜头:灰蒙蒙天空下,雨丝斜织,青砖地面泛光,绿色铁艺长椅静置红底白字招牌下;林小雨蹲在长椅右侧,米白长裙下摆微湿,浅灰开衫肩头洇水,双手捧着钴蓝色《飞鸟集》,神情焦虑。",
"duration": 4.0,
"shot_type": "wide_shot",
"main_character": "林小雨",
"emotion": "anxious",
"confidence": 8.8,
"element_ids": ["elem_001"]
},
{
"id": "shot_002",
"description": "特写镜头:林小雨侧脸微抬,雨水沿发梢滑落,睫毛轻颤,声音轻颤:'明明说好今天还书的……这雨下得,他会不会不来了?'",
"duration": 3.5,
"shot_type": "close_up",
"main_character": "林小雨",
"emotion": "sad",
"confidence": 8.5,
"element_ids": ["elem_002"]
}
]

直接返回合法的JSON结构,不要添加其他任何解释性文字和特殊符号。

LLM 调用执行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def _call_llm_chat_with_retry(self, llm, system_prompt: str, user_prompt, max_retries: int = 3) -> str | None:
"""
调用LLM,直接返回json字符串(支持重试)
"""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
]

for attempt in range(max_retries):
try:
response = llm.invoke(_convert_messages(messages))
return response.content

except Exception as e:
if attempt == max_retries - 1:
raise Exception(f"LLM调用失败: {e}")
time.sleep(1)


def _convert_messages(messages: List[Dict[str, str]]):
"""Convert dict messages to LangChain message objects"""
lc_messages = []
for msg in messages:
role = msg["role"]
content = msg["content"]
if role == "system":
lc_messages.append(SystemMessage(content=content))
elif role == "user":
lc_messages.append(HumanMessage(content=content))
elif role == "assistant":
# lc_messages.append(AIMessage(content=content, additional_kwargs={"tool_calls": []}))
lc_messages.append(AIMessage(content=content))
else:
raise ValueError(f"Unsupported role: {role}")
return lc_messages

流程编排

多智能体协作流程,LangGraph 负责协调各个智能体完成端到端的分镜生成

graph TD
    Start[开始/剧本输入] --> Parse[剧本解析]
    
    Parse --> Decision1{解析结果判断}
    Decision1 -->|成功| Split[镜头拆分]
    Decision1 -->|严重失败| Error[错误处理]
    Decision1 -->|需要人工干预| Human[人工干预]
    
    Split --> Decision2{拆分结果判断}
    Decision2 -->|成功| Fragment[视频分段]
    Decision2 -->|需要重试| Split
    Decision2 -->|严重失败| Error
    
    Fragment --> Decision3{分段结果判断}
    Decision3 -->|成功| Prompt[提示词生成]
    Decision3 -->|需要调整| Split
    Decision3 -->|需要修复| Split
    Decision3 -->|严重失败| Error
    
    Prompt --> DecisionP{生成结果判断}
    DecisionP -->|通过| Audit[质量审查]
    DecisionP -->|严重失败| Error
    
    Audit --> Decision4{审查结果判断}
    Decision4 -->|通过| Continuity[连续性检查]
    Decision4 -->|需要调整| Prompt
    Decision4 -->|需要修复| Fragment
    Decision4 -->|需要重试| Prompt
    Decision4 -->|需要人工干预| Human
    Decision4 -->|失败| Error
    
    Continuity --> Decision5{连续性判断}
    Decision5 -->|通过| Output[生成输出]
    Decision5 -->|需要调整| Prompt
    Decision5 -->|需要修复| Fragment
    Decision5 -->|需要优化| Prompt
    Decision5 -->|需要人工干预| Human
    Decision5 -->|严重失败| Error
    
    Output --> End[工作流完成]
    
    Error --> DecisionE{错误处理判断}
    DecisionE -->|可恢复| Parse
    DecisionE -->|需要人工干预| Human
    DecisionE -->|中止流程| End
    
    Human --> DecisionH{人工决策判断}
    DecisionH -->|继续流程| Output
    DecisionH -->|重新开始| Parse
    DecisionH -->|调整提示词| Prompt
    DecisionH -->|修复片段| Fragment
    DecisionH -->|继续人工干预| Human
    DecisionH -->|中止流程| End
    
    %% 样式定义
    style Start fill:#4CAF50,color:white
    style End fill:#F44336,color:white
    style Human fill:#FF9800,color:black
    style Error fill:#9E9E9E,color:white
    style Parse fill:#2196F3,color:white
    style Split fill:#2196F3,color:white
    style Fragment fill:#2196F3,color:white
    style Prompt fill:#2196F3,color:white
    style Audit fill:#2196F3,color:white
    style Continuity fill:#2196F3,color:white
    style Output fill:#4CAF50,color:white
    
    %% 决策节点样式
    style Decision1 fill:#FFEB3B,color:black
    style Decision2 fill:#FFEB3B,color:black
    style Decision3 fill:#FFEB3B,color:black
    style DecisionP fill:#FFEB3B,color:black
    style Decision4 fill:#FFEB3B,color:black
    style Decision5 fill:#FFEB3B,color:black
    style DecisionE fill:#FFEB3B,color:black
    style DecisionH fill:#FFEB3B,color:black
    
    %% 主流程线
    linkStyle 0,1,5,9,13,17,21,25 stroke:#2196F3,stroke-width:2px
    %% 错误处理线
    linkStyle 2,6,10,14,18,22,26 stroke:#9E9E9E,stroke-width:1.5px
    %% 人工干预线
    linkStyle 3,15,19,23,27 stroke:#FF9800,stroke-width:1.5px
    %% 修复/重试线
    linkStyle 4,7,8,11,12,16,20,24 stroke:#FF5722,stroke-width:1.5px

任务池管理

支持任务异步执行,通过 BackgroundTasks 获取执行任务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
class TaskManager:
"""任务状态管理器"""

def __init__(self):
self.tasks: Dict[str, Dict] = {}
self.workflow_cache: Dict[str, MultiAgentPipeline] = {}

def create_task(self, script: str, config: Optional[ShotConfig] = None, task_id: str = None) -> str:
"""创建新任务"""
task_id = task_id or str(uuid.uuid4())

self.tasks[task_id] = {
"task_id": task_id,
"script": script,
"config": config or {},
"status": "pending",
"stage": "initialized",
"progress": 0,
"created_at": datetime.now().isoformat(),
"updated_at": datetime.now().isoformat(),
"result": None,
"error": None,
"callbacks": []
}

info(f"创建任务: {task_id}")
return task_id

def update_task_progress(self, task_id: str, stage: str, progress: float = None):
"""更新任务进度"""
if task_id in self.tasks:
self.tasks[task_id]["stage"] = stage
if progress is not None:
self.tasks[task_id]["progress"] = progress
self.tasks[task_id]["updated_at"] = datetime.now().isoformat()

def complete_task(self, task_id: str, result: Dict):
"""完成任务"""
if task_id in self.tasks:
self.tasks[task_id]["status"] = "completed" if result.get("success", False) else "failed"
self.tasks[task_id]["result"] = result
self.tasks[task_id]["error"] = result.get("error")
self.tasks[task_id]["updated_at"] = datetime.now().isoformat()
self.tasks[task_id]["completed_at"] = datetime.now().isoformat()

def fail_task(self, task_id: str, error_message: str):
"""标记任务失败"""
if task_id in self.tasks:
self.tasks[task_id]["status"] = "failed"
self.tasks[task_id]["error"] = error_message
self.tasks[task_id]["updated_at"] = datetime.now().isoformat()

def get_task(self, task_id: str) -> Optional[Dict]:
"""获取任务信息"""
return self.tasks.get(task_id)

def get_workflow(self, task_id, config: Optional[ShotConfig] = None) -> MultiAgentPipeline:
"""获取或创建工作流实例"""
# config_key = str(config) if config else "default"
config_key = task_id if task_id else "default"

if config_key not in self.workflow_cache:
self.workflow_cache[config_key] = MultiAgentPipeline(task_id, config)

return self.workflow_cache[config_key]