List of LLM benchmark tests - Geospatial Engineering Academy

ScanBIM Geomatik und Vermessung, Zürich included in List

2025-11-03 About 400 words 2 minutes

Contents

LLM testing websites

Alignment & Truthfulness

MASK — Measures how frequently an LLM lies when incentivized to do so. Responses are categorized as True, Evasive, or Lie, and models are ranked by 1 − p(Lie).

Knowledge & Reasoning

Humanity’s Last Exam (HLE) — A PhD-level benchmark from the creators of MASK, testing deep reasoning and factual accuracy.
VirologyTest.ai — Focused on virology expertise. Includes human expert percentiles (e.g., “this LLM performs better than X% of experts”), which help track when models surpass top human specialists.
GPQA

General Capability

LiveBench — Evaluates a wide range of LLM capabilities; supports sorting models by weighted or task-specific scores.

Coding & Software Engineering

LiveCodeBench Pro — Measures performance on competitive coding problems. “Hard” problems correspond to those solved by < 0.1% of top competitive coders.
Aider Leaderboards — Practical coding benchmark emphasizing real-world development workflows.

Cybersecurity

CyBench — Evaluates LLM performance in cybersecurity tasks, such as threat analysis and vulnerability detection.

Cognitive Reasoning & Pattern Recognition

ARC Prize — Measures abstract reasoning and puzzle-solving ability.

Visual Intelligence

Visual Tool Bench
GeoBench — Tests geolocation inference skills (like “No Move” Geoguessr mode). The model must identify a location from a single static image.
VideoMMMU — Evaluates video understanding and how watching a video improves problem-solving performance.

Forecasting

ForecastBench — Tests prediction and forecasting abilities, similar to prediction market performance.

Games & Simulation

BalrogAI — Measures an LLM’s ability to play video games and learn from interaction.
VendingBench — A simulated business management benchmark: models must handle vending machine logistics (inventory, pricing, ordering, etc.).

Hallucination & RAG Reliability

Vectara Hallucination Leaderboard — Evaluates hallucination rates during text summarization.
Lech Mazur’s RAG Hallucination Benchmark — Focused on hallucination under Retrieval-Augmented Generation (RAG) setups; intentionally challenging.

Physics & Common Sense

VPCT — Tests the ability to solve simple physics puzzles that are trivial for humans but often difficult for LLMs.

Linguistic Robustness

Simple-Bench — Measures linguistic adversarial robustness, i.e. how well a model handles trick or confusing questions.

Source

Subscribe and receive updates, lessons, courses and more. No spam!

Get the latest updates and tips.