List of LLM benchmark tests


LLM testing websites

Alignment & Truthfulness

  • MASK — Measures how frequently an LLM lies when incentivized to do so. Responses are categorized as True, Evasive, or Lie, and models are ranked by 1 − p(Lie).

Knowledge & Reasoning

  • Humanity’s Last Exam (HLE) — A PhD-level benchmark from the creators of MASK, testing deep reasoning and factual accuracy.

  • VirologyTest.ai — Focused on virology expertise. Includes human expert percentiles (e.g., “this LLM performs better than X% of experts”), which help track when models surpass top human specialists.

  • GPQA


General Capability

  • LiveBench — Evaluates a wide range of LLM capabilities; supports sorting models by weighted or task-specific scores.

Coding & Software Engineering

  • LiveCodeBench Pro — Measures performance on competitive coding problems. “Hard” problems correspond to those solved by < 0.1% of top competitive coders.

  • Aider Leaderboards — Practical coding benchmark emphasizing real-world development workflows.


Cybersecurity

  • CyBench — Evaluates LLM performance in cybersecurity tasks, such as threat analysis and vulnerability detection.

Cognitive Reasoning & Pattern Recognition

  • ARC Prize — Measures abstract reasoning and puzzle-solving ability.

Visual Intelligence

  • Visual Tool Bench

  • GeoBench — Tests geolocation inference skills (like “No Move” Geoguessr mode). The model must identify a location from a single static image.

  • VideoMMMU — Evaluates video understanding and how watching a video improves problem-solving performance.


Forecasting

  • ForecastBench — Tests prediction and forecasting abilities, similar to prediction market performance.

Games & Simulation

  • BalrogAI — Measures an LLM’s ability to play video games and learn from interaction.

  • VendingBench — A simulated business management benchmark: models must handle vending machine logistics (inventory, pricing, ordering, etc.).


Hallucination & RAG Reliability


Physics & Common Sense

  • VPCT — Tests the ability to solve simple physics puzzles that are trivial for humans but often difficult for LLMs.

Linguistic Robustness

  • Simple-Bench — Measures linguistic adversarial robustness, i.e. how well a model handles trick or confusing questions.
Subscribe and receive updates, lessons, courses and more. No spam!

Get the latest updates and tips.

0%
If this helped you, it might help others too. Share: