← All posts
Technology2026-04-058 min read

Simple Call, Superpower Capabilities: How Audio Intelligence Works

You send an audio file. You get a verdict. But what happens in between?

The Vocos detection pipeline is designed to be invisible to developers -a single API call that returns a confidence score in under 200ms. But under the hood, it's running one of the most sophisticated audio analysis systems ever built for production use.

The architecture

Our detection engine combines deep forensic analysis with advanced reasoning in a unified pipeline:

Deep Forensic Detection

At the core of Vocos is a proprietary forensic model trained on more than 5 million audio samples spanning over 100 different TTS and voice synthesis techniques. This isn't a simple binary classifier - it's a forensic engine that has learned the subtle fingerprints left behind by every major (and minor) synthesis method in existence.

Forensics 0.3B analyzes audio across multiple resolution scales simultaneously -from micro-level spectral artifacts to macro-level temporal patterns. This multi-scale approach catches everything from the telltale phase discontinuities of concatenative synthesis to the unnaturally smooth formant transitions of neural vocoders.

What makes our forensic detection uniquely powerful:

  • **Spectro-temporal graph analysis**: The model doesn't just look at the audio frame-by-frame. It builds a rich graph representation where spectral bins and temporal positions form interconnected nodes, capturing dependencies across both dimensions simultaneously.
  • **Artifact pattern matching**: Having been trained across 100+ synthesis techniques, the model recognizes the characteristic "signatures" each generation method leaves behind -artifacts invisible to human ears but statistically unmistakable.
  • **Cross-domain feature learning**: Our model learns representations from raw audio that capture the fundamental structure of human speech -phonemes, prosody, breath patterns -making it extremely difficult for any synthesis method to fool.

Forensic Reasoning & Intelligence

Beyond raw detection, Vocos includes a forensic reasoning layer that analyzes mel-spectrograms and detection artifacts with state-of-the-art intelligence. This reasoning engine can:

  • **Interpret spectrograms visually** -identifying anomalous regions, phase inconsistencies, and spectral patterns that indicate manipulation
  • **Generate forensic explanations** in natural language that non-experts can understand, translating complex detection artifacts into clear, actionable insights
  • **Calibrate confidence scores** by reasoning about the full context of the audio -considering noise conditions, codec artifacts, and recording quality to produce meaningful probability estimates

This combination of deep forensic detection and intelligent reasoning is what makes Vocos truly state-of-the-art. The forensic model finds the evidence; the reasoning layer understands and explains it.

Why one API call is enough

We obsessed over developer experience. The entire detection pipeline is engineered to feel instant -sub-200ms from request to verdict - soyou can run it inline on every audio stream without your users ever noticing.

  • **Zero configuration**: Send audio in any common format. We handle normalization, segmentation, and analysis automatically.
  • **Built for real-time**: The pipeline is optimized end-to-end so detection never becomes a bottleneck in your voice flow.
  • **Long audio, no problem**: Files of any length are analyzed intelligently, producing a single aggregated verdict you can act on immediately.

The result: developers get a simple POST endpoint that accepts audio files and returns structured JSON. No ML expertise required. No GPU provisioning. No model management.

The accuracy question

On standard benchmarks (ASVspoof 2021, In-the-Wild), our model achieves >99% equal error rate. But benchmarks don't tell the whole story.

Real-world audio comes with background noise, compression artifacts, codec distortions, and recording conditions that benchmarks don't capture. Our model is trained on a massively diverse dataset:

  • **5,000,000+ audio samples** across real and synthetic speech
  • **100+ synthesis techniques** including every major TTS engine, voice cloning tool, and vocoder
  • Various audio codecs and bitrates
  • Real-world noise conditions
  • Cross-lingual samples in 50+ languages

We continuously red-team our model against the latest synthesis systems and ship updates monthly.

Getting started

Integration is three lines of code:

import vocos
client = vocos.Client(api_key="your-key")
result = client.detect("audio.wav")

That's the superpower: production-grade audio intelligence, accessible to any developer, in a single API call.

Ready to secure your voice agent?

Try the playground -no credit card required.

Try the Playground