Midway

🧠 Real-Time Voice Activity Detection on Raspberry Pi with Silero VAD

In this guide, we implement real-time voice activity detection (VAD) on a Raspberry Pi using the open-source Silero VAD model. This allows us to listen continuously and trigger further actions only when human speech is detected, even in noisy environments.


πŸ”₯ Why Real-Time VAD Matters

When building smart voice systems, it’s wasteful (and error-prone) to analyze audio all the time. We want to:

  • βœ… Detect when someone is actually speaking
  • ❌ Ignore background noise, silence, fans, dogs barking, etc.
  • βœ… Trigger speaker recognition or transcription only when voice is present

πŸ› οΈ What We’ll Use

ComponentPurpose
arecordEfficient audio capture on RPi
torchaudioWAV file loading
silero-vadTiny PyTorch-based voice detector
Raspberry PiEdge device to run everything locally

βš™οΈ Setup

Install Dependencies

pip install torchaudio soundfile silero-vad
sudo apt install sox alsa-utils

βœ… Tip: Make sure your microphone shows up in arecord -l. Use the correct device name (e.g. plughw:1,0).


πŸ“„ The Python Script: realtime_vad.py

This script:

  1. Continuously records 2-second chunks of audio
  2. Analyzes each chunk for speech
  3. Prints whether speech is present
  4. Loops indefinitely
# realtime_vad.py

import torch
from silero_vad import VoiceActivityDetector
import torchaudio
import subprocess
import os
import time

SAMPLE_RATE = 16000
DURATION = 2  # seconds per chunk
TEMP_FILE = "temp_vad.wav"
DEVICE = "plughw:1,0"  # adjust based on `arecord -l`

# Load Silero VAD once
print("πŸ” Loading Silero VAD model...")
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad', trust_repo=True)
(get_speech_timestamps, _, read_audio, _, _) = utils

def record_wav(filename=TEMP_FILE, duration=DURATION, device=DEVICE):
    subprocess.run([
        "arecord",
        "-D", device,
        "-f", "S16_LE",
        "-r", str(SAMPLE_RATE),
        "-c", "1",
        "-t", "wav",
        "-d", str(duration),
        filename
    ], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

def is_speech_present(filename=TEMP_FILE):
    wav = read_audio(filename, sampling_rate=SAMPLE_RATE)
    speech = get_speech_timestamps(wav, model, sampling_rate=SAMPLE_RATE)
    if speech:
        total_speech = sum([s["end"] - s["start"] for s in speech]) / SAMPLE_RATE
        print(f"πŸ—£οΈ Speech detected: {total_speech:.2f} seconds")
        return True
    else:
        print("πŸ”‡ Silence or noise only.")
        return False

def main():
    print("πŸŽ™οΈ Real-Time Voice Activity Detection Started (Ctrl+C to stop)...")
    try:
        while True:
            record_wav()
            is_speech_present()
            if os.path.exists(TEMP_FILE):
                os.remove(TEMP_FILE)
            time.sleep(0.5)
    except KeyboardInterrupt:
        print("\nπŸ›‘ Exiting.")

if __name__ == "__main__":
    main()

βœ… Example Output

πŸŽ™οΈ Real-Time Voice Activity Detection Started (Ctrl+C to stop)...
πŸ”‡ Silence or noise only.
πŸ—£οΈ Speech detected: 1.58 seconds
πŸ”‡ Silence or noise only.
πŸ—£οΈ Speech detected: 2.00 seconds

🧠 How It Works

  • Silero VAD is a neural network trained to distinguish voice vs noise, even in difficult conditions.
  • The model runs fast on a Raspberry Pi (no GPU needed).
  • We record using arecord and process with torchaudio.

πŸ§ͺ Use Cases

  • βœ… Smart voice-triggered assistants
  • βœ… Voice-controlled door locks
  • βœ… Real-time transcription only when speech is detected
  • βœ… Edge-powered speaker recognition

πŸ” Next Steps

You can integrate this directly with speaker recognition like this:

if is_speech_present():
    run_speaker_recognition()
else:
    print("⏭️ Skipping – no voice detected.")

βœ… Conclusion

Silero VAD enables Raspberry Pi projects to detect human voice in real-time, reliably, and with very little resource usage. This is a foundational tool for anyone building smart voice interfaces in real-world environments.


Midway Labs – Building mindful voice tech for the real world.