Skip to main content
This API documentation is provisional. WorldStereo code and model weights are not yet publicly released. The API described here represents the expected interface based on the research framework.

Overview

The inference API provides a simplified interface for generating videos from WorldStereo. It handles the complexity of memory management, diffusion scheduling, and camera control, allowing you to focus on creative applications.
This API is designed for production use cases and research experiments. For low-level control, use the WorldStereo model directly.

Function: generate_video

Main function for generating multi-view-consistent videos.
generate_video(
    model: WorldStereo,
    condition_image: Union[str, PIL.Image, torch.Tensor],
    camera_trajectory: CameraTrajectory,
    num_frames: int = 16,
    height: int = 512,
    width: int = 512,
    num_inference_steps: int = 50,
    guidance_scale: float = 7.5,
    point_cloud: Optional[torch.Tensor] = None,
    use_geometric_memory: bool = True,
    use_spatial_memory: bool = True,
    seed: Optional[int] = None,
    output_format: str = "tensor",
    save_path: Optional[str] = None,
    progress_callback: Optional[Callable] = None
) -> Union[torch.Tensor, np.ndarray, List[PIL.Image]]

Parameters

model
WorldStereo
required
Initialized WorldStereo model instance.
condition_image
Union[str, PIL.Image, torch.Tensor]
required
Conditioning image for video generation. Can be:
  • File path (str)
  • PIL Image object
  • Torch tensor with shape (C, H, W) or (1, C, H, W)
Supports both perspective and panoramic images.
camera_trajectory
CameraTrajectory
required
Camera trajectory defining viewpoints for each frame. See Camera Control for creation methods.
num_frames
int
default:"16"
Number of frames to generate. Must match the trajectory length.
height
int
default:"512"
Output video height in pixels.
width
int
default:"512"
Output video width in pixels.
num_inference_steps
int
default:"50"
Number of denoising steps. More steps typically improve quality but increase computation time.
  • Recommended: 25-50 for fast generation
  • 50-100 for high quality
guidance_scale
float
default:"7.5"
Classifier-free guidance scale. Higher values increase fidelity to the condition image but may reduce diversity.
  • Typical range: 5.0-15.0
  • Lower values (5.0-7.5): More creative
  • Higher values (10.0-15.0): More faithful to input
point_cloud
torch.Tensor
Optional initial point cloud for geometric conditioning, shape (N, 3) or (N, 6) with RGB.
use_geometric_memory
bool
default:"True"
Whether to use global geometric memory module.
use_spatial_memory
bool
default:"True"
Whether to use spatial-stereo memory module.
seed
int
Random seed for reproducibility. If None, uses a random seed.
output_format
str
default:"tensor"
Output format:
  • “tensor”: PyTorch tensor (T, C, H, W)
  • “numpy”: NumPy array (T, H, W, C)
  • “pil”: List of PIL Images
  • “video”: MP4 file (requires save_path)
save_path
str
Path to save the output video. Required if output_format="video".
progress_callback
Callable
Optional callback function called after each denoising step: callback(step: int, total_steps: int, latent: torch.Tensor) -> None

Returns

video
Union[torch.Tensor, np.ndarray, List[PIL.Image]]
Generated video in the specified output format.

Function: generate_multi_view

Generate multiple videos from different camera trajectories with shared memory.
generate_multi_view(
    model: WorldStereo,
    condition_image: Union[str, PIL.Image, torch.Tensor],
    trajectories: List[CameraTrajectory],
    shared_memory: bool = True,
    **generation_kwargs
) -> List[Union[torch.Tensor, np.ndarray, List[PIL.Image]]]
model
WorldStereo
required
WorldStereo model instance.
condition_image
Union[str, PIL.Image, torch.Tensor]
required
Conditioning image.
trajectories
List[CameraTrajectory]
required
List of camera trajectories for different viewpoints.
shared_memory
bool
default:"True"
If True, memory modules are shared and updated across all trajectories for consistency.
**generation_kwargs
Any
Additional arguments passed to generate_video().
videos
List
List of generated videos, one per trajectory.

Function: reconstruct_3d

Generate video and reconstruct the 3D scene.
reconstruct_3d(
    model: WorldStereo,
    condition_image: Union[str, PIL.Image, torch.Tensor],
    camera_trajectory: CameraTrajectory,
    reconstruction_method: str = "point_cloud",
    num_frames: int = 16,
    **generation_kwargs
) -> Tuple[Union[torch.Tensor, np.ndarray], Dict[str, Any]]
model
WorldStereo
required
WorldStereo model instance.
condition_image
Union[str, PIL.Image, torch.Tensor]
required
Conditioning image.
camera_trajectory
CameraTrajectory
required
Camera trajectory for video generation.
reconstruction_method
str
default:"point_cloud"
3D reconstruction method:
  • “point_cloud”: Dense point cloud
  • “mesh”: Triangle mesh
  • “neural_field”: Neural radiance field
num_frames
int
default:"16"
Number of frames to generate.
**generation_kwargs
Any
Additional arguments for generation.
video
Union[torch.Tensor, np.ndarray]
Generated video.
reconstruction
Dict[str, Any]
3D reconstruction with:
  • points: Point cloud coordinates (N, 3)
  • colors: RGB colors (N, 3) (if available)
  • normals: Surface normals (N, 3) (if available)
  • mesh: Triangle mesh (if reconstruction_method="mesh")
  • quality_metrics: Reconstruction quality metrics

Pipeline Class

For more control, use the WorldStereoPipeline class.

Class: WorldStereoPipeline

WorldStereoPipeline(
    model_path: str,
    device: str = "cuda",
    torch_dtype: torch.dtype = torch.float16,
    enable_xformers: bool = True,
    enable_cpu_offload: bool = False
)
model_path
str
required
Path to pretrained model weights or Hugging Face model ID.
device
str
default:"cuda"
Device to run inference on.
torch_dtype
torch.dtype
default:"torch.float16"
Data type for model weights (float32 or float16).
enable_xformers
bool
default:"True"
Use memory-efficient attention from xFormers.
enable_cpu_offload
bool
default:"False"
Offload model to CPU when not in use (saves GPU memory).

Methods

The pipeline exposes the same methods as the functional API: generate_video(), generate_multi_view(), and reconstruct_3d().

Examples

import torch
from worldstereo import WorldStereo, generate_video, CameraTrajectory
from PIL import Image

# Load model
model = WorldStereo.from_pretrained("worldstereo/worldstereo-v1").cuda()

# Load conditioning image
image = Image.open("scene.jpg")

# Create camera trajectory
trajectory = CameraTrajectory.forward(
    distance=2.0,
    num_frames=16
)

# Generate video
video = generate_video(
    model=model,
    condition_image=image,
    camera_trajectory=trajectory,
    num_frames=16,
    guidance_scale=7.5,
    seed=42
)

print(video.shape)  # (16, 3, 512, 512)

Advanced Configuration

Memory Management

# Initialize with custom memory settings
model = WorldStereo(
    vdm_config={...},
    global_memory_config={
        "max_points": 50000,  # Limit point cloud size
        "update_frequency": "manual"  # Control updates manually
    },
    spatial_memory_config={
        "memory_size": 512,  # Smaller memory bank
        "num_correspondences": 32
    }
)

# Disable memory modules for faster generation
video = generate_video(
    model=model,
    condition_image=image,
    camera_trajectory=trajectory,
    use_geometric_memory=False,  # Disable global memory
    use_spatial_memory=False  # Disable spatial memory
)

Optimization

# Enable xFormers for memory-efficient attention
from worldstereo.optimization import enable_xformers_memory_efficient_attention
enable_xformers_memory_efficient_attention(model)

# Use mixed precision
model = model.half()  # Convert to float16

# Reduce inference steps for speed
video = generate_video(
    model=model,
    condition_image=image,
    camera_trajectory=trajectory,
    num_inference_steps=25,  # Faster, slight quality trade-off
    guidance_scale=7.5
)

# Enable CPU offload for large models
from worldstereo.optimization import enable_model_cpu_offload
enable_model_cpu_offload(model)

Batch Processing

# Generate multiple videos in parallel
images = [Image.open(f"scene_{i}.jpg") for i in range(4)]
trajectories = [CameraTrajectory.forward(distance=2.0) for _ in range(4)]

# Batch process (if GPU memory allows)
videos = []
for img, traj in zip(images, trajectories):
    video = generate_video(
        model=model,
        condition_image=img,
        camera_trajectory=traj,
        num_frames=16
    )
    videos.append(video)

Quality Tips

  • Use 50-100 inference steps
  • Set guidance_scale to 10.0-15.0
  • Enable both memory modules
  • Use higher resolution (768x768 or 1024x1024)
  • Provide initial point cloud if available

WorldStereo Model

Low-level model API for advanced use cases

Memory Modules

Configure geometric memory components

Camera Control

Create and customize camera trajectories

3D Reconstruction

Detailed reconstruction pipeline

Build docs developers (and LLMs) love