Ambient voice documentation is the most consistently-loved AI feature we've shipped in healthcare. Clinicians, in our experience, ask for two things: keep me out of the EHR and don't make me think harder. Voice-to-note delivers on both — when it's done right.
It's also the easiest place we've seen teams accidentally ship a HIPAA violation. The audio stream is PHI. The transcript is PHI. The intermediate cache is PHI. Every hop is a place where the architecture can leak.
After building voice pipelines for four different clinical settings, here are the architectural patterns we've converged on — and the specific places teams most often get it wrong.
Why voice pipelines are different
Most ML pipelines you can prototype locally and worry about PHI later. Voice pipelines are not that. The first second of audio captured in a clinical encounter is PHI. There's no "de-identification later" — there's only "never let it leave the BAA-covered environment in the first place."
Three things make this harder than a typical inference pipeline:
- Audio is large. Transcripts are tiny. Most of your storage and bandwidth cost is in audio.
- Transcription accuracy matters at single-word level. "50 mg" vs "15 mg" is the difference between a correct prescription and a clinical error.
- Latency matters. Clinicians won't tolerate a 30-second delay to see what was just said. Either it streams or they stop using it.
The three architecture patterns
There are three common architectures. Each one trades latency, cost, and PHI surface area differently.
Pattern 1: All-cloud, streaming. Audio is streamed from the clinician's device to a transcription service (AWS Transcribe Medical, Azure Speech, Google Speech-to-Text — all BAA-eligible). Transcript comes back streaming. Post-processing (structured note generation, EHR formatting) runs in the cloud. Pros: lowest engineering cost, highest accuracy. Cons: audio crosses the network in real-time; bandwidth-sensitive in clinical settings with weak Wi-Fi.
Pattern 2: Edge transcription, cloud post-processing. Audio is transcribed on-device (phone, tablet, or dedicated hardware) using an on-device speech model. The transcript — not the audio — is sent to the cloud for structuring. Pros: audio never leaves the room; lower bandwidth; works offline. Cons: on-device transcription is harder to keep current with the state of the art; per-device deployment burden.
Pattern 3: Hybrid. Audio is captured on-device, batched (or streamed under controlled conditions), and processed in a customer-controlled cloud (VPC or dedicated tenant). Useful when the customer's compliance team doesn't trust shared transcription services. Pros: maximum customer control. Cons: highest engineering cost, you become the transcription service.
We've shipped all three. Pattern 1 is the right default for ~80% of use cases. Pattern 2 makes sense when bandwidth is unreliable or for SaMD-class deployments. Pattern 3 is rare — usually a contractual requirement, not a technical one.
Where PHI actually leaks
Five places we've seen teams accidentally expose PHI:
- Application logs. Default logging captures request bodies. If the transcript is in the request, it's now in your log aggregator. Most log aggregators are not BAA-covered.
- Error tracking. Sentry, Bugsnag, Rollbar — if a transcription fails and the audio buffer ends up in the exception, your error tracker is now holding PHI. Sign a BAA or redact at source.
- Browser/client storage. Caching transcripts to localStorage or IndexedDB for offline support is a HIPAA risk if the device isn't enterprise-managed.
- Analytics. Product analytics that captures input text — common in form analytics tools — will hoover up transcript content. Block at the field level.
- Backups. Even if your primary storage is BAA-covered, are the backups encrypted with customer-managed keys? Are they replicated to a region the BAA covers?
The pattern: PHI doesn't usually leak from the obvious places. It leaks from the supporting infrastructure that was designed for non-PHI workloads.
Key design decisions
Decisions we make explicitly on every voice deployment:
Retention. How long do you keep the audio? The transcript? The intermediate caches? Audio retention should usually be minimal (delete after transcript is durable). Transcript retention follows EHR retention policies. Intermediate caches should be zero — never write them to disk.
Speaker diarization. Do you need to separate clinician voice from patient voice? In some settings, you do (for accurate note attribution). In others, you don't. Diarization adds latency and cost — only enable it when you actually need it.
Wake-word vs. continuous. Wake-word activated ("Hey [tool name]…") captures less audio overall but misses spontaneous content. Continuous captures everything but raises the PHI exposure surface dramatically. We lean toward visit-bounded (clinician explicitly starts and stops recording) as a middle ground.
Structured note format. Are you outputting freeform text? SOAP-formatted notes? Discrete structured fields back into the EHR? The output format determines half the pipeline complexity downstream. Decide before you start building.
BAA-eligibility, in practice
Every cloud service in the pipeline needs a BAA. This isn't optional. Common services that are BAA-eligible (with the right configuration):
- AWS Transcribe Medical, AWS Transcribe (general) — eligible under AWS HIPAA
- Azure Speech Service — eligible under Azure HIPAA
- Google Speech-to-Text — eligible under Google Cloud HIPAA
- OpenAI (via Azure OpenAI Service) — eligible. Direct OpenAI API is NOT BAA-eligible for clinical PHI use.
- Anthropic Claude (via AWS Bedrock or direct enterprise contract) — eligible under specific terms
Things teams use that aren't typically BAA-eligible: most third-party analytics, most error tracking SaaS (without specific configuration), most generic logging SaaS. Verify before you ship.
The deployment checklist
Before we ship a voice pipeline to clinicians, every item on this list has to be answered:
- Every service in the pipeline has a signed BAA.
- Audio retention policy is documented and enforced by code, not by convention.
- Logs are configured to redact transcript content (or sign a BAA with your log aggregator).
- Error tracking is configured to redact request/response bodies (or to fail-open on errors without sending stack traces).
- Client-side caches are encrypted at rest, ideally with customer-controlled keys.
- Analytics is field-blocked on transcript content.
- Backups are inside the same compliance boundary as the primary data.
- End-to-end PHI flow diagram is signed off by the customer's privacy officer.
- Incident response plan covers audio retention failure modes.
Teams that ship clinical voice tools fast learn the hard way which of their supporting tools weren't BAA-covered. Teams that ship slow learn it during a security review.
Closing
Voice documentation is one of the most rewarding things you can ship in clinical practice. Clinicians will love it. The downside risk is asymmetric — when it works, it's invisible. When it leaks, it's a breach notification.
Architect for PHI containment from the first prototype. Don't decide "we'll figure out the compliance side later." The pipeline you prototype is the pipeline you ship — and rebuilding the pipeline after the fact to add BAA-covered logging is more expensive than getting it right the first time.
