add-tts-model

Integrate a new text-to-speech model into vLLM-Omni from HuggingFace reference implementation through production-ready serving with streaming and CUDA graph acceleration. Use when adding a new TTS model, wiring stage separation for speech synthesis, enabling online voice generation serving, debugging TTS integration behavior, or building audio output pipelines.

Best for DevOps teams deploying TTS…Works with GitHubLow risk

#tts #vllm #huggingface #cuda #speech-synthesis #model-integration

⌘source

author: @vllm-project
repo: vllm-project/vllm-omni
language: Python

✦overview.md

Key Features

·Integrates new TTS models into vLLM-Omni from HuggingFace references
·Enables production-ready serving with streaming support
·Utilizes CUDA graph acceleration for performance
·Wires stage separation for speech synthesis pipelines
·Supports online voice generation serving
·Facilitates debugging of TTS integration behavior

Use Cases

→Adding a new text-to-speech model to vLLM-Omni
→Building audio output pipelines for speech synthesis
→Debugging TTS integration behavior in production systems
→Enabling online voice generation serving with streaming

Best for

✓DevOps teams deploying TTS models
✓Performance-critical speech synthesis applications

Not ideal for

!One-off model experiments without production deployment
!Non-CUDA environments

FAQs

.claude/skills/add-tts-model/SKILL.md

name

add-tts-model

description

TTS Model Integration Workflow

Overview

HF Reference -> Stage Separation -> Online Serving -> Async Chunk -> CUDA Graph
   (Phase 1)      (Phase 2)          (Phase 3)        (Phase 4)     (Phase 5)

Phase 1: HuggingFace Reference

Goal: Understand the reference implementation and verify it produces correct audio.

Steps

Run the reference model end-to-end using the official HuggingFace / GitHub code
Document the architecture:
- What are the sub-models? (AR decoder, codec decoder, vocoder, etc.)
- What is the token vocabulary? (semantic codes, RVQ codebooks, special tokens)
- What is the output format? (sample rate, channels, codec type)
Capture reference outputs for comparison during integration
Identify the config structure: config.json fields, model_type, sub-model configs

Key Questions

How many codebooks? What are the codebook sizes?
What special tokens exist? (<|voice|>, <|audio_start|>, <|im_end|>, etc.)
What is the token-to-ID mapping for codec codes?
What is the hop length / frame rate of the codec?

...

$install

1-click copy

npx skills add vllm-project/vllm-omni --skill add-tts-model

Safety assessment

★

Clarity score

How clear and easy to understand the SKILL.md instructions are, rated from 1 to 5.

3/ 5

good

Mostly clear, but there are still a few confusing or poorly structured parts.

◎

Actionability score

How directly an agent can act on the SKILL.md instructions, rated from 1 to 5.

3/ 5

medium

Partially actionable with several concrete steps, but still missing important details.

~community cookbook

April 18, 2026

◧ Compare

add-tts-model

Best for DevOps teams deploying TTS…Works with GitHubLow risk

TTS Model Integration Workflow

Overview

HF Reference -> Stage Separation -> Online Serving -> Async Chunk -> CUDA Graph (Phase 1) (Phase 2) (Phase 3) (Phase 4) (Phase 5)

Phase 1: HuggingFace Reference

Goal: Understand the reference implementation and verify it produces correct audio.

Steps

Run the reference model end-to-end using the official HuggingFace / GitHub code

Document the architecture:

What are the sub-models? (AR decoder, codec decoder, vocoder, etc.)
What is the token vocabulary? (semantic codes, RVQ codebooks, special tokens)
What is the output format? (sample rate, channels, codec type)

Capture reference outputs for comparison during integration

Identify the config structure: config.json fields, model_type, sub-model configs

Key Questions

How many codebooks? What are the codebook sizes?

What special tokens exist? (<|voice|>, <|audio_start|>, <|im_end|>, etc.)

What is the token-to-ID mapping for codec codes?

What is the hop length / frame rate of the codec?

add-tts-model

Key Features

Use Cases

Best for

Not ideal for

FAQs

What is the first step in integrating a TTS model?

What architectural details should be documented?

What key configuration information is needed?

TTS Model Integration Workflow

Overview

Phase 1: HuggingFace Reference

Steps

Key Questions

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

add-ollama-tool

add-image-vision

customize

add-voice-transcription

add-compact

add-telegram-swarm

AI Skill Finder

add-tts-model

Key Features

Use Cases

Best for

Not ideal for

FAQs

What is the first step in integrating a TTS model?

What architectural details should be documented?

What key configuration information is needed?

TTS Model Integration Workflow

Overview

Phase 1: HuggingFace Reference

Steps

Key Questions

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

add-ollama-tool

add-image-vision

customize

add-voice-transcription

add-compact

add-telegram-swarm