How We Built Codelikha's Voice Engine

April 20, 2025

by Priya Nair, Co-founder & CTO

When we started building Codelikha, we had a clear thesis: voice was the right primary interface for developer tools, and the AI models were finally good enough to make it work. What we did not have was a clear picture of how to build it.

This post is about the engineering decisions we made, the ones that surprised us, and what we learned building a voice-first product for developers.

Why voice latency is everything

The first thing we learned — the hard way — is that latency is not just a performance metric in a voice application. It is the entire user experience.

When you type a command and the response takes 800ms, you barely notice. You have already moved on to the next keystroke. But when you speak a sentence and the response takes 800ms, you are standing there waiting. The interface breaks. The conversation dies.

We ran early user tests with a 600ms average latency. Developer after developer said the same thing: it feels unnatural. They were not complaining about the quality of the output — the code generation was solid. They were complaining about the silence after they spoke.

We spent three weeks optimizing for latency before anything else, because we realized that if the voice interface does not feel like conversation, nothing else matters. Our target was sub-500ms perceived response time. We hit it.

The architecture

Codelikha works as follows at a high level:

Developer speaks — captured via the browser's Web Speech API
Transcript is sent to our backend, which injects project context
Context-enriched prompt is sent to the language model
LLM response is streamed back
Response is simultaneously rendered as text and converted to speech
Audio streams back to the browser in real-time

The key architectural decision was to stream both text and audio in parallel rather than waiting for the full LLM response before starting speech synthesis. This cut our perceived latency by about 40%.

We also built a project context engine that maintains a compressed representation of the developer's codebase — file structure, function signatures, recent changes, and coding conventions. This context is injected into every prompt, which is what allows Codelikha to give answers specific to your project rather than generic boilerplate.

Choosing a voice AI platform

We evaluated several voice synthesis platforms. Our criteria were strict:

Naturalness. When Codelikha narrates code back to a developer — explaining what a generated function does, flagging a potential bug, asking a clarifying question — it needs to sound like a knowledgeable colleague, not a text-to-speech demo. This was non-negotiable.

Streaming latency. Sub-400ms streaming response is required for a conversational interface. Any platform that could not reliably hit that threshold was disqualified.

Developer experience. Clean APIs, good documentation, and reliable uptime. We are building a product that developers depend on — we need infrastructure that matches that standard.

We found a platform that met all three criteria and built our voice layer on top of it. The result is a voice output that feels genuinely conversational rather than synthetic.

What surprised us

Developers talk differently than they type. When people type a prompt, they compress it. When they speak, they expand it naturally — adding context, explaining constraints, referencing things mentioned earlier. That expanded, natural form is easier for the AI to work with — more context, clearer intent, fewer ambiguities. The spoken form of a request produces better output than the typed form.

Code narration is as valuable as code generation. We built code narration as a secondary feature. It turned out to be one of the most-used features in beta — developers were using it to understand existing code in their codebase, not just code Codelikha had generated. We now treat it as a primary feature alongside generation.

Silence is hard to design. Users do not know when to stop speaking. We went through many iterations of end-of-speech detection before finding an approach that felt natural. Too aggressive and it cuts people off mid-sentence. Too conservative and there is an awkward wait after they finish.

What is next

We are working on three things in parallel: deeper IDE integration (VS Code first, then JetBrains), a project memory system that persists context across sessions, and multilingual support for developers who prefer to speak in Hindi, Tamil, or Bengali rather than English.

The multilingual piece is particularly exciting for the Indian developer market. We are not just translating the words — we are tuning tone, cadence, and warmth to feel appropriate in each regional context.

Codelikha is still early, but the foundation is solid. The voice engine works, the latency is there, and developers who try it consistently say the same thing: it is faster than they expected, and stranger than they expected — in a good way.

Building voice-first is hard. But it is the right interface for the next decade of developer tooling.

Quick links

Follow us

How We Built Codelikha's Voice Engine

Why voice latency is everything

The architecture

Choosing a voice AI platform

What surprised us

What is next

More articles

The Developer's Guide to Getting Started with Voice-First Coding

Why Voice is the Next Developer Interface

Ready to code with your voice?