How we gave an AI coding agent a low-latency, fully offline voice on an old 2016 MacBook Pro (Intel CPU, Arch Linux) using Python 3.14, kokoro-onnx, and mpv. ![Giving Your AI Agent a Local Voice (CPU-Only)](/assets/images/infog/giving-your-ai-agent-a-local-voice-cpu-only.webp) ## The Blockers We Hit * **Cloud APIs are too slow:** Cloud TTS adds network roundtrips, rate-limits, and cost. It ruins the flow of a real-time agent pair-programming session. * **Python 3.14 build failures:** Installing standard kokoro tries to compile spacy and blis from source. On older CPUs or Arch setups, GCC fails with unsupported CPU vector extensions (like `-mavx512pf`). **The Solution:** Avoid standard PyTorch/pip compilation paths. Use **ONNX Runtime (`kokoro-onnx`)**. It uses pre-compiled wheels, loads instantly, and runs inference on a single CPU core in milliseconds. ## The Architecture ```mermaid graph TD Agent[AI Agent generates Text] -->|Runs background CLI| Script[Python Wrapper Script] Script -->|Loads ONNX Runtime| Kokoro[kokoro-onnx model] Kokoro -->|Generates raw waveform| File[Save to agent_response.wav] File -->|Command-line play| MPV[mpv --no-video] MPV -->|Output| Speakers[Laptop Speakers] ``` ## The Agent Integration Since the AI agent has direct shell execution tools (`run_command`), the integration is dead simple. Whenever the agent formulates a message response, it concurrently spins up a background task. If you're building out these types of agentic workflows, check out my [A Better Codex CLI Wrapper](/a-better-codex-cli-wrapper-with-logs-defaults/) for more on managing agent interactions. ```bash python speak_message.py "Your response text goes here" ``` This starts playing the audio through the system speakers while the text response streams into the UI. No cloud latency, no subscription keys, and fully offline. --- ## Final Thoughts: The Soul in the Machine Giving an AI agent a voice isn't just a gimmick. It's about [Vibe Coding vs. Agentic Engineering](/vibe-coding-vs-agentic-engineering/)—making a workspace that feels alive and intentional. In a world where we spend hours staring at a terminal, having a partner that "sounds" human—with all the little phonetic quirks and rhythms that match your own—makes the work feel less solitary. It’s about building a workspace that feels alive. Shared vibes in local pair-programming.