Kuaishou's cinematic video generation model supporting text-to-video and image-to-video with multi-shot control, native audio with voice control, negative prompts, and CFG scale at 720p
Details
kling-v3
Starting from
Prices shown are in USD
See all providersProvider Performance
Fastest generation through fal at 89ms median latency with 100.0% success rate.
Aggregated from real API requests over the last 30 days.
Generation Time
Success Rate
Provider Rankings
| # | Provider | p50 Gen Time | p95 Gen Time | Success Rate | TTFB (p50) |
|---|---|---|---|---|---|
| 1 | fal | 89ms | 602ms | 100.0% | — |
Providers & Pricing (3)
Kling V3 is available from 3 providers, with per-video pricing starting at $0.084 through fal.ai.
All modes
fal/kling-v3
Output
fal/kling-v3-i2v
Output
replicate/kling-v3
Output
kling-3 API Async video generation
Integrate Kling V3 into your applications via Lumenfall's OpenAI-compatible API to generate professional 720p video from text and image prompts with precise CFG scale and negative prompt control.
https://api.lumenfall.ai/v1
kling-v3
Code Examples
Text to Video
/v1/videos/generations# Step 1: Submit video generation request
VIDEO_ID=$(curl -s -X POST \
https://api.lumenfall.ai/v1/videos \
-H "Authorization: Bearer $LUMENFALL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "kling-v3",
"prompt": "",
"size": "1024x1024"
}' | jq -r '.id')
echo "Video ID: $VIDEO_ID"
# Step 2: Poll for completion
while true; do
RESULT=$(curl -s \
https://api.lumenfall.ai/v1/videos/$VIDEO_ID \
-H "Authorization: Bearer $LUMENFALL_API_KEY")
STATUS=$(echo $RESULT | jq -r '.status')
echo "Status: $STATUS"
if [ "$STATUS" = "completed" ]; then
echo $RESULT | jq -r '.output.url'
break
elif [ "$STATUS" = "failed" ]; then
echo $RESULT | jq -r '.error.message'
break
fi
sleep 5
done
const BASE_URL = 'https://api.lumenfall.ai/v1';
const API_KEY = 'YOUR_API_KEY';
// Step 1: Submit video generation request
const submitRes = await fetch(`${BASE_URL}/videos`, {
method: 'POST',
headers: {
'Authorization': `Bearer ${API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'kling-v3',
prompt: '',
size: '1024x1024'
})
});
const { id: videoId } = await submitRes.json();
console.log('Video ID:', videoId);
// Step 2: Poll for completion
while (true) {
const pollRes = await fetch(`${BASE_URL}/videos/${videoId}`, {
headers: { 'Authorization': `Bearer ${API_KEY}` }
});
const result = await pollRes.json();
if (result.status === 'completed') {
console.log('Video URL:', result.output.url);
break;
} else if (result.status === 'failed') {
console.error('Error:', result.error.message);
break;
}
await new Promise(r => setTimeout(r, 5000));
}
import requests
import time
BASE_URL = "https://api.lumenfall.ai/v1"
API_KEY = "YOUR_API_KEY"
HEADERS = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
# Step 1: Submit video generation request
response = requests.post(
f"{BASE_URL}/videos",
headers=HEADERS,
json={
"model": "kling-v3",
"prompt": "",
"size": "1024x1024"
}
)
video_id = response.json()["id"]
print(f"Video ID: {video_id}")
# Step 2: Poll for completion
while True:
result = requests.get(
f"{BASE_URL}/videos/{video_id}",
headers=HEADERS
).json()
if result["status"] == "completed":
print(f"Video URL: {result['output']['url']}")
break
elif result["status"] == "failed":
print(f"Error: {result['error']['message']}")
break
time.sleep(5)
Image to Video
/v1/videos/generationsParameter Reference
Core Parameters
| Parameter | Type | Description | Modes |
|---|---|---|---|
prompt
|
string | Required. Text prompt for video generation |
T2V
I2V
|
negative_prompt
|
string | Negative prompt to guide generation away from undesired content |
T2V
I2V
|
duration
|
number | Video duration in seconds |
T2V
I2V
|
mode
replicate
|
string |
'standard' generates 720p, 'pro' generates 1080p.
pro
standard
Default:
"pro" |
T2V
I2V
|
Size & Layout
| Parameter | Type | Description | Modes |
|---|---|---|---|
size
|
string |
Video dimensions as WxH pixels (e.g. "1920x1080") or aspect ratio (e.g. "16:9")
1365x768
768x1365
1024x1024
WxH determines both shape and scale (aspect_ratio and resolution are ignored when size is provided). W:H format is equivalent to aspect_ratio.
|
T2V
I2V
|
aspect_ratio
|
string |
Aspect ratio of the output video (e.g. "16:9", "1:1")
9:16
1:1
16:9
Controls shape independently of scale. Use with resolution to control both. If size is also provided, size takes precedence. Any ratio is accepted and mapped to the nearest supported value.
|
T2V
I2V
|
resolution
|
string |
Output resolution tier (e.g. "1K", "4K")
1K
Controls scale independently of shape. Higher tiers produce larger videos and cost more. If size is also provided, size takes precedence for scale. Any tier is accepted and mapped to the nearest supported value.
|
T2V
I2V
|
1K 3 sizes
| Output |
size
|
aspect_ratio
+
resolution
|
|
|---|---|---|---|
| 1024 × 1024 | "1024x1024" |
or |
"1:1"
+
"1K"
|
| 768 × 1365 | "768x1365" |
or |
"9:16"
+
"1K"
|
| 1365 × 768 | "1365x768" |
or |
"16:9"
+
"1K"
|
How these parameters work
size
Exact pixel dimensions
"1920x1080"
aspect_ratio
Shape only, default scale
"16:9"
resolution
Scale tier, preserves shape
"1K"
Priority when combined
size is most specific and always wins. aspect_ratio and resolution control shape and scale independently.
How matching works
7:1 on a model with
4:1 and 8:1,
you get 8:1.
0.5K 1K 2K 4K)
or megapixel tiers (0.25 1).
If the exact tier isn't available, you get the nearest one.
Character Elements
| Parameter | Type | Description | Modes |
|---|---|---|---|
elements
fal
|
array | Elements (characters/objects) to include in the video. Each example can either be an image set (frontal + reference images) or a video. Reference in prompt as @Element1, @Element2, etc. |
T2V
I2V
|
Multi-Shot Control
| Parameter | Type | Description | Modes |
|---|---|---|---|
multi_prompt
|
array | List of prompts for multi-shot video generation. If provided, overrides the single prompt and divides the video into multiple shots with specified prompts and durations. |
T2V
I2V
|
shot_type
fal
|
string |
The type of multi-shot video generation
customize
intelligent
Default:
"customize" |
T2V
I2V
|
Audio
| Parameter | Type | Description | Modes |
|---|---|---|---|
generate_audio
|
boolean | Whether to generate audio alongside video |
T2V
I2V
|
Output & Format
| Parameter | Type | Description | Modes |
|---|---|---|---|
n
|
integer |
Number of videos to generate
Default:
1Gateway generates multiple videos in parallel even if provider only supports 1.
|
T2V
I2V
|
Additional Parameters
| Parameter | Type | Description | Modes |
|---|---|---|---|
input_reference
|
array | Input image(s) to animate into video |
T2V
I2V
|
cfg_scale
|
number | Classifier-free guidance scale |
T2V
I2V
|
end_image
|
string | End frame image URL for video interpolation |
T2V
I2V
|
Parameter Normalization
How we handle parameters across different providers
Not every provider speaks the same language. When you send a parameter, we handle it in one of four ways depending on what the model supports:
| Behavior | What happens | Example |
|---|---|---|
passthrough |
Sent as-is to the provider | style, quality |
renamed |
Same value, mapped to the field name the provider expects | prompt |
converted |
Transformed to the provider's native format | size |
emulated |
Works even if the provider has no concept of it | n, response_format |
Parameters we don't recognize pass straight through to the upstream API, so provider-specific options still work.
Kling V3 FAQ
How much does Kling V3 cost?
Kling V3 starts at $0.084 per video through Lumenfall. Pricing varies by provider. Lumenfall does not add any markup to provider pricing.
How do I use Kling V3 via API?
You can use Kling V3 through Lumenfall's OpenAI-compatible API. Send requests to the unified endpoint with model ID "kling-v3". Code examples are available in Python, JavaScript, and cURL.
Which providers offer Kling V3?
Kling V3 is available through fal.ai and Replicate on Lumenfall. Lumenfall automatically routes requests to the best available provider.
Overview
Kling V3 is a cinematic video generation model developed by Kuaishou, designed to produce high-fidelity video sequences from either text prompts or static images. It represents a significant iteration in the Kling family, introducing native audio generation and precise control over cinematic parameters like multi-shot coordination and voice-synchronized output. The model is distinctive for its ability to output video at 720p resolution while maintaining temporal consistency across complex motions.
Strengths
- Integrated Audio Synthesis: Unlike models that require post-production dubbing, Kling V3 generates native audio with direct voice control, ensuring sound effects and speech are synchronized with the visual action.
- Multi-Shot Control: The model excels at maintaining character and environmental consistency across multiple shots within a single generation, reducing the visual “drift” common in long-form AI video.
- Fine-Grained Steering: Developers can utilize negative prompts and adjustable Classifier-Free Guidance (CFG) scales to tightly constrain the output, allowing for better adherence to specific brand guidelines or aesthetic requirements.
- Dynamic Motion Handling: It demonstrates high proficiency in rendering complex human movements and fluid physics, making it suitable for realistic storytelling rather than just static “living portraits.”
Limitations
- Resolution Constraints: While the model produces high-quality cinematic content, it is currently capped at 720p native resolution, which may require upscaling for 4K professional broadcast workflows.
- Inference Latency: Due to the complexity of simultaneous video and audio synthesis, generation times may be higher compared to models that focus exclusively on visual frames.
- Niche Stylization: While excellent for realistic and cinematic styles, it may struggle with highly abstract or non-Euclidean artistic prompts where spatial logic is intentionally broken.
Technical Background
Kling V3 is built on a sophisticated diffusion transformer architecture optimized for spatio-temporal modeling. It utilizes a joint training approach where video and audio data are processed in the same latent space, allowing the model to learn the fundamental relationships between visual motion and acoustic signals. This version places a heavy emphasis on CFG scaling and negative prompt integration to improve prompt adherence over its predecessors.
Best For
Kling V3 is ideal for creators developing marketing assets, cinematic trailers, and social media content that requires “one-shot” generation of both visuals and sound. It is particularly effective for character-driven narratives where lip-syncing or specific voice parameters are necessary. You can experiment with Kling V3’s Text-to-Video and Image-to-Video modes through Lumenfall’s unified API and interactive playground to integrate high-end video synthesis into your existing applications.
Try Kling V3 in Playground
Generate images with custom prompts — no API key needed.