Create or transform audio: text-to-audio, inpainting (edit parts of a clip), or audio-to-audio for sound design and music.
Audio prompts work best when they define mood, pacing, structure, and finish. The more clearly you describe the role of the sound, the cleaner the result tends to be.
Best results start with genre, mood, structure, and arrangement.
Stable Audio on Pixio (e.g. Stable Audio 2.5) lets you create or transform audio: text-to-audio, inpainting (edit parts of a clip), or audio-to-audio for sound design and music. Use it when you need prompt-driven music or sound design with the option to edit existing clips (inpaint) or transform them (audio-to-audio).
Stable Audio on Pixio (e.g. Stable Audio 2.5) lets you create or transform audio: text-to-audio, inpainting (edit parts of a clip), or audio-to-audio for sound design and music. Use it when you need prompt-driven music or sound design with the option to edit existing clips (inpaint) or transform them (audio-to-audio).
| Mode | Input | Best for |
|---|---|---|
| Text to Audio | Prompt (genre, mood, duration) | New music or sound design from scratch |
| Inpainting | Existing clip + mask + prompt | Edit or replace a segment |
| Audio to Audio | Existing clip + prompt | Transform style, mood, or content |
| Option | Values | Notes |
|---|---|---|
| Duration | Depends on backend (e.g. up to 90s or more) | Check Pixio for limits |
| Prompt | Genre, mood, instruments, structure | Be specific for best results |
| Credits | Plan-based | Check model card in Pixio |
| Scenario | Best choice |
|---|---|
| Text-to-audio + inpainting + audio-to-audio | Stable Audio |
| Music only (no edit) | Pixio Music, Lyria 2, MiniMax Music, Songcraft |
| Speech / TTS | ElevenLabs TTS, MiniMax Speech |
| Sound effects only | Music Compose Sound Effects |
Use production language, not just genre labels.
Tell the model how the energy should move over time.
For speech, define delivery style, tone, and pacing.
For music, define arrangement and emotional arc early.
A strong audio prompt describes role, pacing, tone, and finish so the output feels produced rather than generic.
Describe the genre, emotional arc, instrumentation, and structure instead of relying on broad tags alone.
Define how the piece should progress so the output feels intentional instead of flat or repetitive.
Split, edit, or reshape useful material rather than rebuilding the whole asset from nothing.
Stable Audio 2.5 is strongest when the brief is clear about function: what the sound should do, how it should move, and what it should feel like.
Use structure language early so the output lands closer to production-ready on the first passes.
For voice work, specify delivery and character. For music, specify arrangement and emotional progression.
Decide whether the output is carrying narrative, mood, rhythm, or all three.
Describe the build, energy, and transitions so the result has movement instead of flattening out.
Once the direction is right, refine and separate instead of regenerating blindly.
Pair voice generation with cloning when continuity across campaigns or characters matters.
Use generated music or speech as the finishing layer once the visual cut is already working.