List of LLM benchmark tests
LLM testing websites
- Scale SEAL LLM leaderboards
- LLM Leaderboard - Comparison of over 100 AI models from OpenAI, Google, DeepSeek & others
Alignment & Truthfulness
- MASK — Measures how frequently an LLM lies when incentivized to do so. Responses are categorized as True, Evasive, or Lie, and models are ranked by 1 − p(Lie).
Knowledge & Reasoning
-
Humanity’s Last Exam (HLE) — A PhD-level benchmark from the creators of MASK, testing deep reasoning and factual accuracy.
-
VirologyTest.ai — Focused on virology expertise. Includes human expert percentiles (e.g., “this LLM performs better than X% of experts”), which help track when models surpass top human specialists.
General Capability
- LiveBench — Evaluates a wide range of LLM capabilities; supports sorting models by weighted or task-specific scores.
Coding & Software Engineering
-
LiveCodeBench Pro — Measures performance on competitive coding problems. “Hard” problems correspond to those solved by < 0.1% of top competitive coders.
-
Aider Leaderboards — Practical coding benchmark emphasizing real-world development workflows.
Cybersecurity
- CyBench — Evaluates LLM performance in cybersecurity tasks, such as threat analysis and vulnerability detection.
Cognitive Reasoning & Pattern Recognition
- ARC Prize — Measures abstract reasoning and puzzle-solving ability.
Visual Intelligence
-
GeoBench — Tests geolocation inference skills (like “No Move” Geoguessr mode). The model must identify a location from a single static image.
-
VideoMMMU — Evaluates video understanding and how watching a video improves problem-solving performance.
Forecasting
- ForecastBench — Tests prediction and forecasting abilities, similar to prediction market performance.
Games & Simulation
-
BalrogAI — Measures an LLM’s ability to play video games and learn from interaction.
-
VendingBench — A simulated business management benchmark: models must handle vending machine logistics (inventory, pricing, ordering, etc.).
Hallucination & RAG Reliability
-
Vectara Hallucination Leaderboard — Evaluates hallucination rates during text summarization.
-
Lech Mazur’s RAG Hallucination Benchmark — Focused on hallucination under Retrieval-Augmented Generation (RAG) setups; intentionally challenging.
Physics & Common Sense
- VPCT — Tests the ability to solve simple physics puzzles that are trivial for humans but often difficult for LLMs.
Linguistic Robustness
- Simple-Bench — Measures linguistic adversarial robustness, i.e. how well a model handles trick or confusing questions.
Subscribe and receive updates, lessons, courses and more. No spam!
Get the latest updates and tips.