3 Minutes
FFmpeg brings AI transcription to the command line
FFmpeg, the ubiquitous open-source media toolkit, has added a new audio filter called af_whisper that embeds automatic speech recognition (ASR) directly into FFmpeg workflows. Built on the lightweight whisper.cpp runtime, this integration brings a powerful AI transcription model into media processing pipelines, moving FFmpeg beyond traditional encoding and filtering into AI-enabled content handling.
Key features of the af_whisper filter
Model selection and language options
af_whisper supports different whisper.cpp models, letting users pick the balance between speed and accuracy. You can also specify the target language to improve transcription fidelity for multilingual content.
Flexible output formats
The filter can emit plain text, SRT subtitles, or structured JSON metadata. That makes it easy to generate subtitle files for videos and podcasts, feed automatic captions to streaming platforms, or pipe transcription metadata into downstream automation.
Live streaming, VAD, queueing, and GPU acceleration
af_whisper handles both pre-recorded audio and live streams. Voice Activation Detection (VAD) is available to reduce noise and improve accuracy on sparse speech segments. A queue technique allows tuning between transcription latency and precision, and GPU acceleration support can dramatically speed up processing on compatible hardware.
How af_whisper compares to external ASR services
Unlike cloud transcription services, whisper.cpp-powered af_whisper can run locally, offering lower latency, better privacy, and simpler automation. It replaces multi-step external workflows—exporting audio, sending to a cloud API, receiving transcripts—by consolidating everything into a single FFmpeg command line while still supporting high-quality ASR and subtitle generation like SRT.
Advantages for creators and developers
This new filter saves time and reduces complexity for content creators, archivists, journalists, and developers. Benefits include on-device transcription, integrated subtitle generation, output metadata for indexing and search, and a single-tool pipeline that supports automation and batch processing.
Practical use cases
Use cases include creating SRT captions for videos and podcasts, live captioning for streams and broadcasts, searchable transcripts for archives, and automated metadata generation for content management systems. The combination of VAD, GPU support, and flexible outputs makes af_whisper suitable for both real-time applications and large-scale batch jobs.
Market relevance and future directions
Embedding whisper.cpp into FFmpeg sets a precedent for adding more AI and machine learning models to the platform. This move reinforces FFmpeg's position as an industry-standard media tool and signals wider adoption of AI across media tooling. As on-device AI and hybrid workflows grow, expect FFmpeg to continue evolving with additional AI-driven filters and optimizations.
Getting started
To try af_whisper, update to a recent FFmpeg build that includes the filter and explore options for model, language, output format, VAD, and GPU acceleration. For many users, this single-filter approach replaces cumbersome multi-tool transcription pipelines while improving speed, privacy, and automation capability.

Comments