GitHub Repository

Free & open source AI video platform — Clip Generator, AI Shorts (UGC with AI actors) & YouTube Studio. Self-hosted, no watermarks.

2,342 starsJavaScript

Open-source AI SaaS to turn long videos into viral shorts

Name: Open-source AI SaaS to turn long videos into viral shorts
Availability: InStock
Author: mutonbini

by mutonbini·Mar 8, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●●BangerWizardrySolve My Problem

Vertical reframing with MediaPipe+YOLOv8 face tracking beats naive cropping; Opus Clip exists but this is free/open.

Strengths

•Dual-mode cropping strategy (TRACK for single subject with stabilization, GENERAL for groups) shows thoughtful scene analysis, not just naive centering.
•Heavy Tripod stabilization engine specifically addresses jitter problem in face-tracked reframing—this is the hard part done right.
•Full stack included: transcription, moment detection (Gemini), cropping, dubbing, S3 backup, direct social posting. Single command from URL to TikTok-ready clip.

Weaknesses

•Opus Clip (paid, closed) and shorter-form tools (Captions.ai) already solve core use case; open-source doesn't guarantee adoption.
•Accuracy of viral moment detection depends entirely on Gemini 2.0 Flash's understanding of platform trends—unproven at scale.

Post Description

Hey HN,

I built OpenShorts, an open-source tool that takes a long YouTube video (or local file) and automatically generates vertical short clips ready for TikTok, Instagram Reels, and YouTube Shorts.

How it works:

1. Transcribes the video using faster-whisper (CPU-optimized, word-level timestamps) 2. Sends the transcript to Gemini 2.0 Flash, which identifies the 3–15 most "viral-worthy" moments (15–60s each) 3. FFmpeg extracts the clips precisely 4. AI-powered vertical reframing with two modes: - TRACK mode: MediaPipe face detection + YOLOv8 fallback with stabilization ("Heavy Tripod" engine) for single-subject scenes - GENERAL mode: Blurred background layout for groups/landscapes 5. Optional: AI subtitles, hook text overlays, voice dubbing (ElevenLabs, 30+ languages), and direct social posting

The reframing engine was the hardest part. Naive face tracking produces jittery, unwatchable output. I built a SmoothedCameraman class with safe-zone logic and a SpeakerTracker that prevents rapid switching between detected faces. The system pre-scans every scene to decide TRACK vs. GENERAL before processing.

Stack: Python/FastAPI backend, React/Vite dashboard, Docker Compose for one-command setup. All API keys (Gemini, ElevenLabs) stay client-side, encrypted in localStorage — never stored on the server.

Try it: git clone ... && docker compose up --build Then open localhost:5173, paste a Gemini API key and a YouTube URL.

MIT licensed. Feedback and PRs welcome.