Back to browse
GitHub Repository

2500+ VLM benchmarks, auto-updated daily from arXiv

23 starsPython

2500 vision benchmarks / evals for Vision Language Models

by zakariaelhjouji·Apr 9, 2026·1 point·0 comments

AI Analysis

●●SolidNiche GemBig Brain

Daily arXiv scraping with Claude classification beats manual curation.

Strengths
  • Automated daily updates via GitHub Actions eliminate manual maintenance burden.
  • Structured JSON schema enables programmatic filtering by category and modality.
  • 22 categories cover visual reasoning, counting, medical, safety niches others miss.
Weaknesses
  • Still fundamentally a curated list competing with Papers With Code and HuggingFace.
  • No built-in evaluation runner — just metadata, not actual benchmark execution.
Category
Target Audience

ML researchers and VLM developers tracking evaluation methods

Similar To

Papers With Code · HuggingFace Datasets

Post Description

I love reading benchmark / eval papers. It's one of the best way to stay up-to-date with progress in Vision Language Models, and understand where they fall short.

Vision tasks vary quite a lot from one to another. For example:

- vision tasks that require high-level semantic understanding of the image. Models do quite well in them. Popular general benchmarks like MMMU are good for that. - visual reasoning tasks where VLMs are given a visual puzzle (think IQ-style test). VLMs perform quite poorly on them. Barely above a random guess. Benchmarks such as VisuLogic are designed for this. - visual counting tasks. Models only get it right about 20% of the times. But they’re getting better. Evals such as UNICBench test 21+ VLMs across counting tasks with varying levels of difficulty.

Compiled a list of 2.5k+ vision benchmarks with data links and high-level summary that auto-updates every day with new benchmarks.

Similar Projects