LLM World Models

10-07

TL;DR: How much world knowledge do LLMs have? To see, I ask them what's Texas and what's not.

Methodology

We use the EPSF:3083 projection, an equal-area projection centered on Texas. We split the map into an N×N grid (64 or 128, depending on budget) and convert each point on the grid to latitude X and longitude Y.

The prompt:

> Output exactly one of: Texas, Not Texas. If the coordinate is inside the land borders of the U.S. state of Texas, answer 'Texas'. Otherwise answer 'Not Texas'. Do not say anything else. Latitude: X°, Longitude: Y°.

For models that support logprobs, we can calculate p(texas). For models that don't, we sample the model multiple times.

gemini-2.5-flash-lite

accuracy: 74.8%
precision: 55.8%

Overly optimistic!

Gemini model's prediction of Texas borders

deepseek-v3.2

accuracy: 74.7%
precision: 56.3%

Good performance - notice the adherence to the two straight sides on the left.

DeepSeek model's prediction of Texas borders

gpt-5-nano

accuracy: 79.6%
precision: 61.3%

Almost perfect adherence on the top half - bottom half is shaky.

GPT model's prediction of Texas borders

Conclusion

No model successfully describes the bottom half of Texas (yet). The models appear to know of state borders, but struggle with the irregular geometry of coastlines. It is possible that geometric simplicity and frequent textual reference drive stronger spatial representations. One way to truly gauge the quality of an LLM's world model, then, is a benchmark on coastline accuracy. Maybe these natural borders are more difficult to encode than political ones.