An avatar is a reusable identity you drive with turns or lipsync. It is backed by one of two source kinds.

Source kinds

Portrait

A single image. The model animates the face from audio. Fast to register, great for most use cases.

Source video

Real footage. The model lipsyncs against the actual frames for natural head motion. The clip is preprocessed into a reusable .avtrv cache and ping-pong looped for any audio length.

Preparation

prepare warms the avatar so the first turn is fast: it downloads and normalizes the portrait (or loads the source-video cache) ahead of time.
  • Streaming turns: call prepare once per session before the first turn.
  • Lipsync: prepares implicitly — you don’t need a separate call.

Voices

Each avatar can carry a default voice from the Cartesia voice list, or let the platform match one automatically at creation time. Override per turn with voice_id. See Create avatars to register one.