KDnuggets·2026年4月28日 23:00·約6分

ローカル環境での音声文字起こし機能「Whisper」の紹介

#Speech-to-Text #Faster-Whisper #OpenAI #Edge AI #Privacy-Preserving AI

TL;DR

本記事は、プライバシー保護とコスト削減を目的としたローカル環境での音声文字起こしシステム構築について、OpenAI の Whisper と高速化された Faster-Whisper を用いた具体的な実装手順を解説している。

AI深層分析2026年7月5日 11:03

注目/ 5段階

深度40%

キーポイント

ローカル音声認識の利点と課題

クラウド利用に比べプライバシーが守られコストがかからない一方、元の Whisper モデルは CPU での処理速度が遅くメモリ消費が大きいという課題がある。

Faster-Whisper の技術的優位性

CTranslate2 を基盤とした Faster-Whisper は、元のモデルより最大 4 倍高速でメモリ使用量が少なく、Python との親和性が高いため推奨されている。

クロスプラットフォームな環境構築

Windows、macOS、Linux のいずれでも Python 3.8 以上を使用し、仮想環境を作成して Faster-Whisper をインストールする手順が示されている。

音声前処理の必須条件

Whisperは16kHzモノラルのWAV形式を期待するため、MP3やM4AなどのファイルをFFmpegとpydubを使用して変換する必要があります。

GPUによる高速化オプション

NVIDIA GPUを使用する場合はcuBLASとcuDNNをインストールすることで転写速度を向上させられ、設定がない場合は自動的にCPUフォールバックします。

モデルサイズと推論速度のバランス

beam_size=5 は精度と速度のバランスが取れており、compute_type="int8"は8ビット整数演算により推論を高速化します。

CPU と GPU の使い分け

10 分未満のファイルや初心者には CPU が推奨され、長いファイルやバッチ処理には NVIDIA GPU を使用すると 3〜5 倍高速になります。

影響分析・編集コメントを表示

影響分析

この記事は、AI モデルのクラウド依存から脱却し、オンプレミスやエッジデバイスでの実運用を志向する開発者にとって極めて有用な実践ガイドとなっています。特にプライバシー規制が厳格化する現代において、ローカルで完結する音声処理ソリューションの普及を後押しする重要な技術的知見を提供しています。

編集コメント

クラウド依存の音声認識サービスから脱却し、自社のデータセキュリティを確保したい開発者にとって、Faster-Whisper を活用したローカル構築は現実的な解決策として非常に価値が高いです。

音声をテキストに変換することは、開発者にとって一般的なニーズです。音声認識アプリの開発、会議の録音データの分析、動画への字幕の追加など、さまざまな場面で必要になります。ローカルで（自前のマシン上で）処理することで、プライバシーの保護とクラウド利用料の継続的発生を回避できます。

この記事では、Whisper とその最適化版である Faster-Whisper を使って、高速でローカルで動作する音声認識システムを構築する方法を学びます。MP3からWAVへの変換といった音声の前処理、Pythonスクリプトの作成、CPUとGPUの両方での実行方法についても解説します。

# Whisperとは何か？なぜローカル版を使うのか？

OpenAIのWhisper は、自動音声認識（ASR: Automatic Speech Recognition）モデルです。多数の多言語音声データで学習されており、背景ノイズや異なるアクセントがあっても高い精度を発揮します。

しかし、オリジナルのWhisperはCPU上で動作すると遅く、メモリを大量に消費します。このような課題を解決するために、最適化されたバージョンが登場します。

whisper.cpp はC++で書かれており、重い依存関係がありません。CPU上で非常に高速ですが、コンパイルが必要であり、Pythonとの連携はやや不便です。

Faster-Whisper は CTranslate2 を使用した再実装です。オリジナルの Whisper より最大 4 倍速く動作し、メモリ使用量も少なく、Python とスムーズに連携できます。本チュートリアルでは、Faster-Whisper を使用します。

両方のバージョンは 100% ローカルで動作します。データはあなたのコンピュータから一切送信されません。

ほとんどの音声ファイルは、Raw WAV形式ではありません。圧縮形式（MP3）やコンテナ形式（M4A）が使われています。Whisperに渡す前に、それらを16 kHz、モノラル、PCM WAV形式に変換する必要があります。

以下は、pydub（内部でFFmpegを呼び出します）を使用してこの変換を行うPython関数です。

from pydub import AudioSegment

import os

def convert_to_wav(input_path, output_path=None):

"""

任意の音声ファイル（MP3、M4A、OGGなど）をWAV形式（16 kHz、モノラル）に変換します。

output_pathがNoneの場合、同じフォルダ内に拡張子を.wavに置き換えたファイル名で保存されます。

"""

if output_path is None:

base, _ = os.path.splitext(input_path)

output_path = base + ".wav"

# 音声を読み込み（pydubはffmpegを使用）

audio = AudioSegment.from_file(input_path)

# モノラルに変換し、サンプルレートを16000 Hzに設定

audio = audio.set_channels(1).set_frame_rate(16000)

# WAV形式でエクスポート

audio.export(output_path, format="wav")

return output_path

使用例：

wav_file = convert_to_wav("meeting.mp3")

print(f"Converted to: {wav_file}")

ベーシックなトランスクリプションスクリプト（faster-whisperを使用）

それでは、Whisperモデルを読み込み、WAVファイルをトランスクリプトし、結果を出力する完全なPythonスクリプトを作成しましょう。

from faster_whisper import WhisperModel

def transcribe_audio(wav_path, model_size="base", device="cpu"):

"""

Faster-Whisper を使って WAV ファイル（16 kHz モノラル）を音声認識する。

model_size: "tiny", "base", "small", "medium", "large-v2", "large-v3"

device: "cpu" または "cuda"（GPU が利用可能であれば）

"""

# モデルの初期化（初回実行時に自動でダウンロード）

model = WhisperModel(model_size, device=device, compute_type="int8")

# 音声認識の実行

segments, info = model.transcribe(wav_path, beam_size=5, language="en")

print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")

print("\nTranscription:")

for segment in segments:

print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

# 必要に応じて全文を返す

full_text = " ".join([seg.text for seg in segments])

return full_text

例の使用方法

if __name__ == "__main__":

text = transcribe_audio("my_recording.wav", model_size="small", device="cpu")

コードの内容は以下の通りです。

WhisperModel は、最初の実行時に選択されたモデル（例：small）を ~/.cache/huggingface/hub にダウンロードします。

beam_size=5 は精度と速度のバランスを取るための設定です。値を高く（例：10）すると処理は遅くなりますが、精度は向上します。

compute_type="int8" は、推論を高速化するために 8 ビット整数演算を使用します。GPU を使用する場合、"float16" を試すこともできます。

デバイス	速度	設定の複雑さ	推奨対象
CPU	ゆっくり（ただし 10 分未満のファイルには問題なし）	なし（インストールのみ）	初心者、ノートPC、小規模プロジェクト

GPU（CUDA）

3～5倍の高速化

NVIDIAドライバ、cuBLAS、cuDNNが必要

長時間のファイル、バッチ音声認識

GPUを使用するには、コード内で device="cuda" に変更します。Faster-Whisperは、正しくインストールされていればCUDAを自動で検出します。

ヒント：CPUでも、Faster-WhisperはオリジナルのWhisperよりもはるかに高速です。10分間のMP3ファイルの場合、最新のCPUでベースモデルを使用すると、およそ2分で処理できます。

MP3をトランスクリプトに変換する：完全な例

以下は、任意の音声ファイルをWAV形式に変換し、その後トランスクリプト化する完全なスクリプトです。

import os

from pydub import AudioSegment

from faster_whisper import WhisperModel

def convert_to_wav(input_path):

"""任意の音声ファイルを16kHz単音声WAV形式に変換する。"""

audio = AudioSegment.from_file(input_path)

audio = audio.set_channels(1).set_frame_rate(16000)

wav_path = os.path.splitext(input_path)[0] + ".wav"

audio.export(wav_path, format="wav")

return wav_path

def transcribe_file(audio_path, model_size="base", device="cpu"):

# ステップ1：すでにWAV形式でない場合は変換

if not audio_path.lower().endswith(".wav"):

print(f"Converting {audio_path} to WAV...")

audio_path = convert_to_wav(audio_path)

# ステップ2：トランスクリプト化

print(f"Loading model '{model_size}' on {device.upper()}...")

model = WhisperModel(model_size, device=device, compute_type="int8")

segments, info = model.transcribe(audio_path, beam_size=5)

必ずJSON形式で返してください:

{"translation": "翻訳全文", "technical_terms": ["term1", "term2"]}

print(f"\nLanguage: {info.language} (prob: {info.language_probability:.2f})")

print("\nTranscript:")

for seg in segments:

print(seg.text, end=" ", flush=True)

print() # final newline

if __name__ == "__main__":

# Example: transcribe an MP3 file

transcribe_file("interview.mp3", model_size="small", device="cpu")

Save this as transcribe.py and run:

python transcribe.py

The script will download the model once, convert the file, and output the transcript.

# Conclusion

You now have a local, fast, and privacy-friendly audio transcription system. Some key takeaways:

Faster-Whisper gives you near-real-time transcription on a CPU and excellent speed on a GPU.

Always pre-process audio to 16 kHz mono WAV using pydub and FFmpeg.

The model_size parameter trades accuracy for speed — start with "base" or "small".

Running locally means no API keys, no data sharing, and no monthly fees.

Try different Whisper model sizes for better accuracy. Add speaker diarisation (identifying who spoke when) using libraries like pyannote.audio. Build a simple web interface with Gradio or Streamlit**.

Shittu Olumide はソフトウェアエンジニアであり、テクニカルライターとしても活躍している。彼は先進技術を活用して魅力的な物語を創り出すことに情熱を注ぎ、細部への配慮と複雑な概念をわかりやすく伝える才能を持つ。Shittu は Twitter でも活動している。

原文を表示

Image by Author

# Introduction

Transcribing audio into text is a common need for developers, whether you're building a voice-to-text app, analysing meeting recordings, or adding captions to videos. Doing it locally (on your own machine) protects privacy and avoids recurring cloud costs.

In this article, you will learn how to set up a fast, local transcription system using Whisper and its optimised version called Faster-Whisper**. We will cover audio preprocessing like converting MP3 to WAV, write a Python script, and discuss running on both CPUs and GPUs.

# What Is Whisper? And Why Use a Local Variant?

OpenAI's Whisper is an automatic speech recognition (ASR) model. It's trained on a large amount of multilingual audio and performs well even with background noise or different accents.

However, the original Whisper can be slow on a CPU and uses significant memory. That's where optimised variants come in to help.

whisper.cpp is written in C++ with no heavy dependencies. It is very fast on CPU, but requires compilation and is less Python-friendly.

Faster-Whisper is a reimplementation using CTranslate2. It runs up to 4× faster than original Whisper, uses less RAM, and works seamlessly with Python. We will be using Faster-Whisper in this tutorial.

Both variants run 100% locally; no data leaves your computer.

# Setting Up Your Environment (Cross-Platform)

This setup works on Windows, macOS, and Linux with Python 3.8 or higher. Create and activate a virtual environment (optional but recommended):

code

python -m venv whisper_env

Activate the virtual environment on macOS and Linux:

code

source whisper_env/bin/activate

On Windows:

code

whisper_env\Scripts\activate

Install Faster-Whisper:

code

pip install faster-whisper

// Installing Audio Pre-processing Tools

Whisper expects audio in 16 kHz mono WAV format. To convert common formats (MP3, M4A, OGG, etc.), we need FFmpeg and the Python library pydub**.

Install FFmpeg:

On Windows, download from FFmpeg.org and add to PATH, or use winget install ffmpeg.

macOS: brew install ffmpeg

Linux (Ubuntu/Debian): sudo apt install ffmpeg

Then install pydub:

code

pip install pydub

// Optional GPU Support

If you have an NVIDIA GPU and want faster transcription, install cuBLAS and cuDNN following the Faster-Whisper GPU guide. Without this, the code automatically falls back to CPU.

# Audio Pre-processing: Converting Non-WAV Files

Most audio files you encounter are not raw WAV. They use compression (MP3) or container formats (M4A). You must convert them to 16 kHz, mono, PCM WAV before feeding them to Whisper.

Below is a Python function that uses pydub (which calls FFmpeg in the background) to perform this conversion.

code

from pydub import AudioSegment
import os

def convert_to_wav(input_path, output_path=None):
    """
    Convert any audio file (MP3, M4A, OGG, etc.) to WAV (16 kHz, mono).
    If output_path is None, replaces extension with .wav in the same folder.
    """
    if output_path is None:
        base, _ = os.path.splitext(input_path)
        output_path = base + ".wav"

    # Load audio (pydub uses ffmpeg)
    audio = AudioSegment.from_file(input_path)

    # Convert to mono and set sample rate to 16000 Hz
    audio = audio.set_channels(1).set_frame_rate(16000)

    # Export as WAV
    audio.export(output_path, format="wav")
    return output_path

Usage example:

code

wav_file = convert_to_wav("meeting.mp3")
print(f"Converted to: {wav_file}")

# Basic Transcription Script with Faster-Whisper

Now let's write a complete Python script that loads a Whisper model, transcribes a WAV file, and prints the result.

code

from faster_whisper import WhisperModel

def transcribe_audio(wav_path, model_size="base", device="cpu"):
    """
    Transcribe a WAV file (16 kHz mono) using Faster-Whisper.
    model_size: "tiny", "base", "small", "medium", "large-v2", "large-v3"
    device: "cpu" or "cuda" (if GPU is available)
    """
    # Initialize model (downloads automatically on first use)
    model = WhisperModel(model_size, device=device, compute_type="int8")

    # Run transcription
    segments, info = model.transcribe(wav_path, beam_size=5, language="en")

    print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
    print("\nTranscription:")
    for segment in segments:
        print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

    # Return full text if needed
    full_text = " ".join([seg.text for seg in segments])
    return full_text

# Example usage
if __name__ == "__main__":
    text = transcribe_audio("my_recording.wav", model_size="small", device="cpu")

What's happening in the code above?

WhisperModel downloads the chosen model (e.g. small) to ~/.cache/huggingface/hub on first run.

beam_size=5 balances accuracy and speed. Higher values (e.g. 10) are slower but more accurate.

compute_type="int8" uses 8-bit integer math for faster inference. For GPU, you can try "float16".

Device**

Speed

Setup Complexity

Recommended For

CPU

Slower (but fine for files under 10 minutes)

None (just install)

Beginners, laptops, small projects

GPU (CUDA)

3–5× faster

Requires NVIDIA drivers, cuBLAS, cuDNN

Long files, batch transcription

To use a GPU, change device="cuda" in the code. Faster-Whisper automatically detects CUDA if installed correctly.

Tip: Even on CPU, Faster-Whisper is much faster than the original Whisper. For a 10-minute MP3, the base model on a modern CPU takes roughly 2 minutes.

# Converting MP3 to Transcript: A Complete Example

Here's a full script that converts any audio file to WAV, then transcribes it.

code

import os
from pydub import AudioSegment
from faster_whisper import WhisperModel

def convert_to_wav(input_path):
    """Convert any audio to 16kHz mono WAV."""
    audio = AudioSegment.from_file(input_path)
    audio = audio.set_channels(1).set_frame_rate(16000)
    wav_path = os.path.splitext(input_path)[0] + ".wav"
    audio.export(wav_path, format="wav")
    return wav_path

def transcribe_file(audio_path, model_size="base", device="cpu"):
    # Step 1: Convert if not already WAV
    if not audio_path.lower().endswith(".wav"):
        print(f"Converting {audio_path} to WAV...")
        audio_path = convert_to_wav(audio_path)

    # Step 2: Transcribe
    print(f"Loading model '{model_size}' on {device.upper()}...")
    model = WhisperModel(model_size, device=device, compute_type="int8")
    segments, info = model.transcribe(audio_path, beam_size=5)

    print(f"\nLanguage: {info.language} (prob: {info.language_probability:.2f})")
    print("\nTranscript:")
    for seg in segments:
        print(seg.text, end=" ", flush=True)
    print()  # final newline

if __name__ == "__main__":
    # Example: transcribe an MP3 file
    transcribe_file("interview.mp3", model_size="small", device="cpu")

Save this as transcribe.py and run:

code

python transcribe.py

The script will download the model once, convert the file, and output the transcript.

# Conclusion

You now have a local, fast, and privacy-friendly audio transcription system. Some key takeaways:

Faster-Whisper gives you near-real-time transcription on a CPU and excellent speed on a GPU.

Always pre-process audio to 16 kHz mono WAV using pydub and FFmpeg.

The model_size parameter trades accuracy for speed — start with "base" or "small".

Running locally means no API keys, no data sharing, and no monthly fees.

Shittu Olumide is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on Twitter.

この記事をシェア

TechCrunch AI2026年7月5日 00:51

ミストラル AI とは？OpenAI の競合企業に関する全知識

MarkTechPost重要度42026年7月3日 14:55

WebBrain の紹介：Chrome と Firefox で動作するオープンソースのローカルファースト AI ブラウザエージェント

TLDR AI2026年7月3日 09:00

メタの「Watermelon」が GPT-5.5 ベンチマークに匹敵

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

KDnuggets·2026年4月28日 23:00·約6分

ローカル環境での音声文字起こし機能「Whisper」の紹介

#Speech-to-Text #Faster-Whisper #OpenAI #Edge AI #Privacy-Preserving AI

TL;DR

AI深層分析2026年7月5日 11:03

注目/ 5段階

深度40%

キーポイント

ローカル音声認識の利点と課題

Faster-Whisper の技術的優位性

CTranslate2 を基盤とした Faster-Whisper は、元のモデルより最大 4 倍高速でメモリ使用量が少なく、Python との親和性が高いため推奨されている。

クロスプラットフォームな環境構築

Windows、macOS、Linux のいずれでも Python 3.8 以上を使用し、仮想環境を作成して Faster-Whisper をインストールする手順が示されている。

音声前処理の必須条件

Whisperは16kHzモノラルのWAV形式を期待するため、MP3やM4AなどのファイルをFFmpegとpydubを使用して変換する必要があります。

GPUによる高速化オプション

NVIDIA GPUを使用する場合はcuBLASとcuDNNをインストールすることで転写速度を向上させられ、設定がない場合は自動的にCPUフォールバックします。

モデルサイズと推論速度のバランス

beam_size=5 は精度と速度のバランスが取れており、compute_type="int8"は8ビット整数演算により推論を高速化します。

CPU と GPU の使い分け

10 分未満のファイルや初心者には CPU が推奨され、長いファイルやバッチ処理には NVIDIA GPU を使用すると 3〜5 倍高速になります。

影響分析・編集コメントを表示

影響分析

編集コメント

# Whisperとは何か？なぜローカル版を使うのか？

whisper.cpp はC++で書かれており、重い依存関係がありません。CPU上で非常に高速ですが、コンパイルが必要であり、Pythonとの連携はやや不便です。

両方のバージョンは 100% ローカルで動作します。データはあなたのコンピュータから一切送信されません。

以下は、pydub（内部でFFmpegを呼び出します）を使用してこの変換を行うPython関数です。

from pydub import AudioSegment

import os

def convert_to_wav(input_path, output_path=None):

"""

任意の音声ファイル（MP3、M4A、OGGなど）をWAV形式（16 kHz、モノラル）に変換します。

output_pathがNoneの場合、同じフォルダ内に拡張子を.wavに置き換えたファイル名で保存されます。

"""

if output_path is None:

base, _ = os.path.splitext(input_path)

output_path = base + ".wav"

# 音声を読み込み（pydubはffmpegを使用）

audio = AudioSegment.from_file(input_path)

# モノラルに変換し、サンプルレートを16000 Hzに設定

audio = audio.set_channels(1).set_frame_rate(16000)

# WAV形式でエクスポート

audio.export(output_path, format="wav")

return output_path

使用例：

wav_file = convert_to_wav("meeting.mp3")

print(f"Converted to: {wav_file}")

ベーシックなトランスクリプションスクリプト（faster-whisperを使用）

それでは、Whisperモデルを読み込み、WAVファイルをトランスクリプトし、結果を出力する完全なPythonスクリプトを作成しましょう。

from faster_whisper import WhisperModel

def transcribe_audio(wav_path, model_size="base", device="cpu"):

"""

Faster-Whisper を使って WAV ファイル（16 kHz モノラル）を音声認識する。

model_size: "tiny", "base", "small", "medium", "large-v2", "large-v3"

device: "cpu" または "cuda"（GPU が利用可能であれば）

"""

# モデルの初期化（初回実行時に自動でダウンロード）

model = WhisperModel(model_size, device=device, compute_type="int8")

# 音声認識の実行

segments, info = model.transcribe(wav_path, beam_size=5, language="en")

print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")

print("\nTranscription:")

for segment in segments:

print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

# 必要に応じて全文を返す

full_text = " ".join([seg.text for seg in segments])

return full_text

例の使用方法

if __name__ == "__main__":

text = transcribe_audio("my_recording.wav", model_size="small", device="cpu")

コードの内容は以下の通りです。

WhisperModel は、最初の実行時に選択されたモデル（例：small）を ~/.cache/huggingface/hub にダウンロードします。

beam_size=5 は精度と速度のバランスを取るための設定です。値を高く（例：10）すると処理は遅くなりますが、精度は向上します。

compute_type="int8" は、推論を高速化するために 8 ビット整数演算を使用します。GPU を使用する場合、"float16" を試すこともできます。

デバイス	速度	設定の複雑さ	推奨対象
CPU	ゆっくり（ただし 10 分未満のファイルには問題なし）	なし（インストールのみ）	初心者、ノートPC、小規模プロジェクト

GPU（CUDA）

3～5倍の高速化

NVIDIAドライバ、cuBLAS、cuDNNが必要

長時間のファイル、バッチ音声認識

GPUを使用するには、コード内で device="cuda" に変更します。Faster-Whisperは、正しくインストールされていればCUDAを自動で検出します。

MP3をトランスクリプトに変換する：完全な例

以下は、任意の音声ファイルをWAV形式に変換し、その後トランスクリプト化する完全なスクリプトです。

import os

from pydub import AudioSegment

from faster_whisper import WhisperModel

def convert_to_wav(input_path):

"""任意の音声ファイルを16kHz単音声WAV形式に変換する。"""

audio = AudioSegment.from_file(input_path)

audio = audio.set_channels(1).set_frame_rate(16000)

wav_path = os.path.splitext(input_path)[0] + ".wav"

audio.export(wav_path, format="wav")

return wav_path

def transcribe_file(audio_path, model_size="base", device="cpu"):

# ステップ1：すでにWAV形式でない場合は変換

if not audio_path.lower().endswith(".wav"):

print(f"Converting {audio_path} to WAV...")

audio_path = convert_to_wav(audio_path)

# ステップ2：トランスクリプト化

print(f"Loading model '{model_size}' on {device.upper()}...")

model = WhisperModel(model_size, device=device, compute_type="int8")

segments, info = model.transcribe(audio_path, beam_size=5)

必ずJSON形式で返してください:

{"translation": "翻訳全文", "technical_terms": ["term1", "term2"]}

print(f"\nLanguage: {info.language} (prob: {info.language_probability:.2f})")

print("\nTranscript:")

for seg in segments:

print(seg.text, end=" ", flush=True)

print() # final newline

if __name__ == "__main__":

# Example: transcribe an MP3 file

transcribe_file("interview.mp3", model_size="small", device="cpu")

Save this as transcribe.py and run:

python transcribe.py

The script will download the model once, convert the file, and output the transcript.

# Conclusion

You now have a local, fast, and privacy-friendly audio transcription system. Some key takeaways:

Faster-Whisper gives you near-real-time transcription on a CPU and excellent speed on a GPU.

Always pre-process audio to 16 kHz mono WAV using pydub and FFmpeg.

The model_size parameter trades accuracy for speed — start with "base" or "small".

Running locally means no API keys, no data sharing, and no monthly fees.

原文を表示

Image by Author

# Introduction

# What Is Whisper? And Why Use a Local Variant?

OpenAI's Whisper is an automatic speech recognition (ASR) model. It's trained on a large amount of multilingual audio and performs well even with background noise or different accents.

However, the original Whisper can be slow on a CPU and uses significant memory. That's where optimised variants come in to help.

whisper.cpp is written in C++ with no heavy dependencies. It is very fast on CPU, but requires compilation and is less Python-friendly.

Faster-Whisper is a reimplementation using CTranslate2. It runs up to 4× faster than original Whisper, uses less RAM, and works seamlessly with Python. We will be using Faster-Whisper in this tutorial.

Both variants run 100% locally; no data leaves your computer.

# Setting Up Your Environment (Cross-Platform)

This setup works on Windows, macOS, and Linux with Python 3.8 or higher. Create and activate a virtual environment (optional but recommended):

code

python -m venv whisper_env

Activate the virtual environment on macOS and Linux:

code

source whisper_env/bin/activate

On Windows:

code

whisper_env\Scripts\activate

Install Faster-Whisper:

code

pip install faster-whisper

// Installing Audio Pre-processing Tools

Whisper expects audio in 16 kHz mono WAV format. To convert common formats (MP3, M4A, OGG, etc.), we need FFmpeg and the Python library pydub**.

Install FFmpeg:

On Windows, download from FFmpeg.org and add to PATH, or use winget install ffmpeg.

macOS: brew install ffmpeg

Linux (Ubuntu/Debian): sudo apt install ffmpeg

Then install pydub:

code

pip install pydub

// Optional GPU Support

If you have an NVIDIA GPU and want faster transcription, install cuBLAS and cuDNN following the Faster-Whisper GPU guide. Without this, the code automatically falls back to CPU.

# Audio Pre-processing: Converting Non-WAV Files

Most audio files you encounter are not raw WAV. They use compression (MP3) or container formats (M4A). You must convert them to 16 kHz, mono, PCM WAV before feeding them to Whisper.

Below is a Python function that uses pydub (which calls FFmpeg in the background) to perform this conversion.

code

from pydub import AudioSegment
import os

def convert_to_wav(input_path, output_path=None):
    """
    Convert any audio file (MP3, M4A, OGG, etc.) to WAV (16 kHz, mono).
    If output_path is None, replaces extension with .wav in the same folder.
    """
    if output_path is None:
        base, _ = os.path.splitext(input_path)
        output_path = base + ".wav"

    # Load audio (pydub uses ffmpeg)
    audio = AudioSegment.from_file(input_path)

    # Convert to mono and set sample rate to 16000 Hz
    audio = audio.set_channels(1).set_frame_rate(16000)

    # Export as WAV
    audio.export(output_path, format="wav")
    return output_path

Usage example:

code

wav_file = convert_to_wav("meeting.mp3")
print(f"Converted to: {wav_file}")

# Basic Transcription Script with Faster-Whisper

Now let's write a complete Python script that loads a Whisper model, transcribes a WAV file, and prints the result.

code

from faster_whisper import WhisperModel

def transcribe_audio(wav_path, model_size="base", device="cpu"):
    """
    Transcribe a WAV file (16 kHz mono) using Faster-Whisper.
    model_size: "tiny", "base", "small", "medium", "large-v2", "large-v3"
    device: "cpu" or "cuda" (if GPU is available)
    """
    # Initialize model (downloads automatically on first use)
    model = WhisperModel(model_size, device=device, compute_type="int8")

    # Run transcription
    segments, info = model.transcribe(wav_path, beam_size=5, language="en")

    print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
    print("\nTranscription:")
    for segment in segments:
        print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

    # Return full text if needed
    full_text = " ".join([seg.text for seg in segments])
    return full_text

# Example usage
if __name__ == "__main__":
    text = transcribe_audio("my_recording.wav", model_size="small", device="cpu")

What's happening in the code above?

WhisperModel downloads the chosen model (e.g. small) to ~/.cache/huggingface/hub on first run.

beam_size=5 balances accuracy and speed. Higher values (e.g. 10) are slower but more accurate.

compute_type="int8" uses 8-bit integer math for faster inference. For GPU, you can try "float16".

Device**

Speed

Setup Complexity

Recommended For

CPU

Slower (but fine for files under 10 minutes)

None (just install)

Beginners, laptops, small projects

GPU (CUDA)

3–5× faster

Requires NVIDIA drivers, cuBLAS, cuDNN

Long files, batch transcription

To use a GPU, change device="cuda" in the code. Faster-Whisper automatically detects CUDA if installed correctly.

Tip: Even on CPU, Faster-Whisper is much faster than the original Whisper. For a 10-minute MP3, the base model on a modern CPU takes roughly 2 minutes.

# Converting MP3 to Transcript: A Complete Example

Here's a full script that converts any audio file to WAV, then transcribes it.

code

import os
from pydub import AudioSegment
from faster_whisper import WhisperModel

def convert_to_wav(input_path):
    """Convert any audio to 16kHz mono WAV."""
    audio = AudioSegment.from_file(input_path)
    audio = audio.set_channels(1).set_frame_rate(16000)
    wav_path = os.path.splitext(input_path)[0] + ".wav"
    audio.export(wav_path, format="wav")
    return wav_path

def transcribe_file(audio_path, model_size="base", device="cpu"):
    # Step 1: Convert if not already WAV
    if not audio_path.lower().endswith(".wav"):
        print(f"Converting {audio_path} to WAV...")
        audio_path = convert_to_wav(audio_path)

    # Step 2: Transcribe
    print(f"Loading model '{model_size}' on {device.upper()}...")
    model = WhisperModel(model_size, device=device, compute_type="int8")
    segments, info = model.transcribe(audio_path, beam_size=5)

    print(f"\nLanguage: {info.language} (prob: {info.language_probability:.2f})")
    print("\nTranscript:")
    for seg in segments:
        print(seg.text, end=" ", flush=True)
    print()  # final newline

if __name__ == "__main__":
    # Example: transcribe an MP3 file
    transcribe_file("interview.mp3", model_size="small", device="cpu")

Save this as transcribe.py and run:

code

python transcribe.py

The script will download the model once, convert the file, and output the transcript.

# Conclusion

You now have a local, fast, and privacy-friendly audio transcription system. Some key takeaways:

Faster-Whisper gives you near-real-time transcription on a CPU and excellent speed on a GPU.

Always pre-process audio to 16 kHz mono WAV using pydub and FFmpeg.

The model_size parameter trades accuracy for speed — start with "base" or "small".

Running locally means no API keys, no data sharing, and no monthly fees.

この記事をシェア

TechCrunch AI2026年7月5日 00:51

ミストラル AI とは？OpenAI の競合企業に関する全知識

MarkTechPost重要度42026年7月3日 14:55

WebBrain の紹介：Chrome と Firefox で動作するオープンソースのローカルファースト AI ブラウザエージェント

TLDR AI2026年7月3日 09:00

メタの「Watermelon」が GPT-5.5 ベンチマークに匹敵

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

ローカル環境での音声文字起こし機能「Whisper」の紹介

キーポイント

影響分析

編集コメント

# Whisperとは何か？なぜローカル版を使うのか？

ベーシックなトランスクリプションスクリプト（faster-whisperを使用）

例の使用方法

MP3をトランスクリプトに変換する：完全な例

# Conclusion

# Introduction

# What Is Whisper? And Why Use a Local Variant?

# Setting Up Your Environment (Cross-Platform)

// Installing Audio Pre-processing Tools

// Optional GPU Support

# Audio Pre-processing: Converting Non-WAV Files

# Basic Transcription Script with Faster-Whisper

# Converting MP3 to Transcript: A Complete Example

# Conclusion

関連記事

ローカル環境での音声文字起こし機能「Whisper」の紹介

キーポイント

影響分析

編集コメント

# Whisperとは何か？なぜローカル版を使うのか？

ベーシックなトランスクリプションスクリプト（faster-whisperを使用）

例の使用方法

MP3をトランスクリプトに変換する：完全な例

# Conclusion

# Introduction

# What Is Whisper? And Why Use a Local Variant?

# Setting Up Your Environment (Cross-Platform)

// Installing Audio Pre-processing Tools

// Optional GPU Support

# Audio Pre-processing: Converting Non-WAV Files

# Basic Transcription Script with Faster-Whisper

# Converting MP3 to Transcript: A Complete Example

# Conclusion

関連記事

キーポイント

影響分析

編集コメント

# Whisperとは何か？ なぜローカル版を使うのか？

ベーシックなトランスクリプションスクリプト（faster-whisperを使用）

例の使用方法

MP3をトランスクリプトに変換する：完全な例

# Conclusion

# Introduction

# What Is Whisper? And Why Use a Local Variant?

# Setting Up Your Environment (Cross-Platform)

// Installing Audio Pre-processing Tools

// Optional GPU Support

# Audio Pre-processing: Converting Non-WAV Files

# Basic Transcription Script with Faster-Whisper

# Converting MP3 to Transcript: A Complete Example

# Conclusion

関連記事

キーポイント

影響分析

編集コメント

# Whisperとは何か？ なぜローカル版を使うのか？

ベーシックなトランスクリプションスクリプト（faster-whisperを使用）

例の使用方法

MP3をトランスクリプトに変換する：完全な例

# Conclusion

# Introduction

# What Is Whisper? And Why Use a Local Variant?

# Setting Up Your Environment (Cross-Platform)

// Installing Audio Pre-processing Tools

// Optional GPU Support

# Audio Pre-processing: Converting Non-WAV Files

# Basic Transcription Script with Faster-Whisper

# Converting MP3 to Transcript: A Complete Example

# Conclusion

関連記事

# Whisperとは何か？なぜローカル版を使うのか？

# Whisperとは何か？なぜローカル版を使うのか？