OperationsCreative

Voice, TTS, and multimodal input

Use voice notes, read-aloud summaries, and richer-than-text input where your stack supports it—without losing guardrails.

What you build

Hands-busy and eyes-busy workflows:

Voice-to-task: dictate briefs, corrections, or approvals on the go.
TTS readback of summaries, diffs, or alerts when reading text is awkward.
Accessibility: same workflows for people who prefer speech.

Community stories often mention not typing—the product goal is parity of control, not novelty.

Why CoWork OS is a strong fit

Channels (where supported) already meet users on mobile-first surfaces.
Structured outputs still matter—voice is input; artifacts remain text or files for audit.
Approvals can require explicit spoken or typed confirmation for sensitive actions—configure to policy.

How to use

Define what may be spoken vs what must be typed (financial, legal).
Normalize STT output: trim filler, confirm numbers and names.
Log redacted transcripts if retention policy allows—never raw secrets.
Test TTS for clarity on technical terms (repos, hashes, IDs).
Fallback to text when the ASR confidence is low.

Prerequisites

Provider or local stack for STT/TTS compatible with your privacy bar.
Locale and accent coverage expectations—document limitations.
Consent if voice is stored or reviewed for quality.

Steps

Ship read-only TTS summaries first.
Add voice capture for short commands with confirmation.
Measure error rate on domain vocabulary; maintain a custom lexicon if needed.
Review a sample of transcripts weekly.
Expand only where accuracy holds.

Suggested prompts

“Repeat back entities you understood: names, amounts, dates.”
“Refuse to act if confidence on this transcript is below threshold—ask one clarifying question.”
“Give a 30-second TTS-friendly summary.”

Launch readiness

Mis-hearing path tested for top ten commands.
Mute and stop behaviors are obvious in the UI or channel.
Data handling matches privacy policy and regional law.

Common pitfalls

Over-trusting STT for passwords or one-time codes.
Leaking private speech to the wrong channel.
TTS reading secrets aloud in shared spaces.
Accessibility theater—voice without real control parity.

← All use cases Documentation