AI & Future

How Voice Assistants Work, From Your Words to an Answer

Ask a question out loud and a voice assistant answers in seconds. Here is what actually happens in between, in plain words, and what it means for privacy.

Priya NadarAI & Internet Writer

March 18, 2026 7 min read

A smart speaker on a kitchen counter with its light ring glowing — Photograph via Unsplash

A voice assistant runs a fixed pipeline every time you talk to it: a local wake-word detector unlocks the microphone, automatic speech recognition (ASR) turns sound into text, natural language understanding (NLU) extracts your intent, a fulfillment step does the work, and text-to-speech (TTS) reads the answer back. Most of that chain finishes in 300 milliseconds to about a second, and knowing which parts run on your device versus in a data center is the key to understanding both the speed and the privacy trade-offs.

What "always listening" actually means#

The device does process sound continuously, but only through a tiny wake-word detector that recognizes one pattern, such as "Alexa," "Hey Siri," or "Hey Google." This detector is a small neural network running on a low-power chip, and it never uploads anything.

The rolling buffer#

Smart speakers keep a short rolling audio buffer in local memory, typically one to two seconds, that is constantly overwritten. The wake-word model checks this buffer against its target phrase and discards everything else. On an iPhone, the first-pass detector runs on the Always-On Processor, a low-power coprocessor separate from the main CPU, precisely so the phone can listen without draining the battery or waking the expensive silicon. Only when the local model is confident it heard the wake word does the device open the microphone stream and start capturing your actual request.

Why false wakes happen#

Because the detector is deliberately tuned to be sensitive, it sometimes fires on similar-sounding speech, a "Seriously?" mistaken for "Siri," a TV character named Alexa. That is the trade-off: set the threshold too strict and the assistant ignores you across the room; set it too loose and it perks up at the wrong moment. This is also why a second verification pass often runs before your words are acted on, and why reviewing your history occasionally is worthwhile.

The five-stage pipeline#

Once the assistant is awake, a distinct sequence runs, and each stage is a different technology solving a different problem.

1. Speech recognition (ASR)#

ASR converts the captured audio into text. Modern systems use end-to-end neural models, Google, for example, uses a Recurrent Neural Network Transducer, that map sound directly to characters, replacing the older, clumsier statistical models from the 2010s. Before this stage, the speaker's microphone array cleans the signal: beamforming focuses on the direction your voice came from, and acoustic echo cancellation subtracts the music the speaker itself is playing so it can still hear "stop." The original Amazon Echo used a seven-microphone array specifically to do this in a noisy kitchen.

2. Endpointing#

The system has to decide when you have finished talking. Endpointing detects the pause that marks the end of your request, which is harder than it sounds, since a short pause mid-sentence ("set a timer for... ten minutes") should not cut you off. Aggressive endpointing is the usual culprit when an assistant answers before you are done.

3. Natural language understanding (NLU)#

Recognizing the words is not the same as understanding them. NLU does two jobs at once: intent classification (you want to create a timer) and slot filling, also called entity extraction (duration = 10 minutes, starting now). "Set a timer for ten minutes," "ten-minute timer," and "wake me in ten" all map to the same intent with the same slot, which is why phrasing flexibly usually works.

4. Fulfillment#

Now the assistant acts. A timer is handled locally. A weather request calls a forecast API. "Play the new Kendrick album" sends a command to a linked music service. Third-party capabilities, Alexa Skills or Google Actions, are just registered handlers the assistant routes matching intents to.

5. Text-to-speech (TTS)#

Finally the answer, written as text, is spoken aloud. Neural TTS, such as the WaveNet models DeepMind built for Google, generates audio waveforms directly, which is why current assistants sound far less robotic than the concatenated-clip voices of a decade ago.

Where your words actually go#

Here is the split most people get wrong. The wake word is detected on-device, but for years the request itself, everything after the wake word, was uploaded to servers to be transcribed and understood, because open-ended speech recognition needs more compute than a cheap speaker has.

That is shifting. Since iOS 15, an iPhone with an A12 Bionic chip or newer runs ASR on-device using the Neural Engine for many requests, so "set an alarm" or "call Mom" never leaves the phone. Google shipped an on-device Assistant with the Pixel 4 and expanded it on Tensor chips. Amazon added the AZ1 Neural Edge chip to newer Echo devices for some on-device processing, but Alexa still leans heavily on the cloud.

The practical consequences are concrete. Local tasks, timers, alarms, device control, often work with the internet down; anything that looks something up, weather, general questions, streaming, stalls when the network drops. So when your speaker goes silent, the fault is usually the connection, not the gadget.

Locking down what gets stored#

Because requests are frequently stored to improve the service, each platform gives you real controls. They are buried, but they work.

Amazon Alexa#

Open the Alexa app, go to More > Settings > Alexa Privacy > Review Voice History to listen to and delete recordings. Under Manage Your Alexa Data, set recordings to auto-delete after three months, or choose "Don't save recordings" entirely. You can also enable and then say "Alexa, delete what I just said" or "Alexa, delete everything I said today."

Apple Siri#

Siri uses a random device identifier, not your Apple ID, and does not store audio by default. To review the setting, go to Settings > Privacy & Security > Analytics & Improvements > Improve Siri & Dictation and turn it off to stop human grading. To wipe transcripts, use Settings > Siri & Search > Siri & Dictation History > Delete Siri & Dictation History.

Google Assistant#

Recordings live under Web & App Activity. Visit myactivity.google.com, or say "Hey Google, delete what I said this week." You can set activity to auto-delete after 3, 18, or 36 months, and turn off "Include voice and audio recordings" so future requests are transcribed but not kept as audio.

The mute button is hardware#

On an Amazon Echo, the mute button electrically disconnects the microphones, which is why the ring turns solid red and no software can override it. That is genuinely different from a software mute, and it is the right tool for a private conversation.

Common mistakes people make#

Assuming the device streams every conversation. It processes sound locally for one pattern and uploads only after the wake word.
Leaving "improve the service" toggles on without knowing it can mean humans review anonymized samples.
Believing on-device processing means nothing is ever stored. Fulfillment and history logging can still involve the cloud even when transcription is local.
Ignoring the recording history entirely, which is the only reliable way to catch false wakes and see what was actually captured.

FAQ#

Does my smart speaker record everything I say?#

No. A local wake-word detector scans a short rolling buffer and discards it continuously. Only after it hears the wake word does the device capture and, on many systems, upload what you say next.

Why does the assistant sometimes answer when no one used the wake word?#

That is a false activation. The detector is tuned to be sensitive so it catches you from across the room, and similar-sounding speech occasionally trips it. You can find and delete these in the voice history for Alexa, Siri, or Google Assistant.

Can I use a voice assistant without an internet connection?#

Partly. On-device tasks like timers, alarms, and local device control often work offline, especially on newer phones and speakers with dedicated neural chips. Anything requiring a lookup, weather, web answers, or streaming, needs a connection.

Is on-device processing more private than the cloud?#

Generally yes for the transcription step, since the audio never leaves the device. But fulfilling a request and logging history can still touch company servers, so on-device ASR reduces exposure rather than eliminating it. Check each assistant's data settings to control what is retained.

#voice #devices #privacy

Written by

Priya Nadar

Priya translates the fast-moving world of AI and the internet into things you can actually use and understand. She's curious but skeptical, quick to separate genuine progress from hype, and keen to help readers use new tools wisely rather than fearfully.

How to Keep Up with AI Trends Without Drowning in the Hype

AI news moves fast and most of it is noise. Here is a calm, jargon-free system for staying informed about what matters without burning out on every headline.

Priya Nadar

Streams of glowing blue digital data flowing across a dark screen

AI 6 min read

How AI Recommendations Work, From Feeds to Shopping

Recommendation systems shape what you watch, buy, and read every day. Here is a clear, jargon-free look at how they work and how to stay in control.