Ask me what skills you need
What are you building?
Tell me what you're working on and I'll find the best agent skills for you.
Combine visual features (face detection, lip movement analysis) with audio features to improve speaker diarization accuracy in video files. Use OpenCV for face detection and lip movement tracking, then fuse visual cues with audio-based speaker embeddings. Essential when processing video files with multiple visible speakers or when audio-only diarization needs visual validation.
When working with video files, you can significantly improve speaker diarization by combining audio features with visual features like face detection and lip movement analysis.
import cv2
import numpy as np
# Initialize face detector
face_cascade = cv2.CascadeClassifier(
cv2.data.haarcascades + 'haarcascade_frontalface_default.xml'
)
# Process video frames
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
faces_by_time = {}
frame_count = 0
frame_skip = max(1, int(fps / 2)) # Process every other frame
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if frame_count % frame_skip == 0:
timestamp = frame_count / fps
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(gray, 1.1, 4)
faces_by_time[timestamp] = len(faces)
frame_count += 1
npx skills add benchflow-ai/skillsbench --skill multimodal-fusionHow clear and easy to understand the SKILL.md instructions are, rated from 1 to 5.
Mostly clear, but there are still a few confusing or poorly structured parts.
How directly an agent can act on the SKILL.md instructions, rated from 1 to 5.
Partially actionable with several concrete steps, but still missing important details.