2500 vision benchmarks / evals for Vision Language Models

Name: 2500 vision benchmarks / evals for Vision Language Models
Availability: InStock
Author: zakariaelhjouji

by zakariaelhjouji·Apr 9, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●SolidNiche GemBig Brain

Daily arXiv scraping with Claude classification beats manual curation.

Strengths

•Automated daily updates via GitHub Actions eliminate manual maintenance burden.
•Structured JSON schema enables programmatic filtering by category and modality.
•22 categories cover visual reasoning, counting, medical, safety niches others miss.

Weaknesses

•Still fundamentally a curated list competing with Papers With Code and HuggingFace.
•No built-in evaluation runner — just metadata, not actual benchmark execution.

Post Description

I love reading benchmark / eval papers. It's one of the best way to stay up-to-date with progress in Vision Language Models, and understand where they fall short.

Vision tasks vary quite a lot from one to another. For example:

- vision tasks that require high-level semantic understanding of the image. Models do quite well in them. Popular general benchmarks like MMMU are good for that. - visual reasoning tasks where VLMs are given a visual puzzle (think IQ-style test). VLMs perform quite poorly on them. Barely above a random guess. Benchmarks such as VisuLogic are designed for this. - visual counting tasks. Models only get it right about 20% of the times. But they’re getting better. Evals such as UNICBench test 21+ VLMs across counting tasks with varying levels of difficulty.

Compiled a list of 2.5k+ vision benchmarks with data links and high-level summary that auto-updates every day with new benchmarks.