Evaluating GPT-4o’s Reasoning Through Wordle and ReAct

Why Wordle?

Everyone loves to say “this model can reason.” Wordle is a fun way to test how well.

Wordle is a five-letter deduction game. You guess a word, you get feedback: letters that are correct and in the right place (green), letters that are in the word but in a different position (yellow), and letters that aren’t in the word at all (gray).

Humans naturally carry those constraints forward with each guess. That’s the whole game.

Large Language Models — including GPT-4o — don’t “carry forward” in the same way. They generate text token-by-token based on probability. So I wanted to see: Can GPT-4o track constraints across turns if we force it to slow down and think?

Lighthouse shining through digital fog, representing AI clarity

The Method: ReAct-Style Reasoning Loop

I asked GPT-4o to play Wordle using a ReAct-style scaffold. ReAct is short for Reasoning + Acting, and the loop looks like this:

Thought: Here’s my reasoning so far.
Action: I will guess THIS word.
Observation: Here’s the feedback I got (greens / yellows / grays).
…repeat.

The point wasn’t “can you guess the word,” it was: do your guesses line up with the rules you yourself just stated?

I ran two variants:

Basic ReAct scaffold — model thinks out loud, then guesses.
Enhanced constraint check — model has to list known constraints before guessing, and explain why its next guess fits those constraints.

Experiment 1: Basic ReAct

In one of the clean runs, GPT-4o solved the puzzle (“PLUCK”) without tripping. It updated its guesses logically, honored feedback, and moved efficiently.

But that wasn’t the common case.

In about half the trials, GPT-4o produced reasoning that sounded careful — and then immediately violated it.

First guess: “CRANE”
Feedback: A and E are yellow (right letters, wrong spots). C, R, N are gray (not in the word).

AI reasoning for second guess:
“The letters ‘A’ and ‘E’ are in the word but in different positions. I’ll try another guess using those letters in different spots, and I’ll avoid C, R, and N.”

AI second guess: “LEAVE.”

Problem: It reused the same A and E positions it had just said were wrong. It also reused letters it said it would avoid.

So even when GPT-4o was narrating its logic, it wouldn’t always follow it. The narration gave the illusion of precision without the execution.

Experiment 2: Enhanced Constraint Evaluation

After seeing that, I tightened the prompt.

Before each new guess, GPT-4o now had to:

List everything we know (letters confirmed present / absent / locked in place).
Describe how its next guess followed those rules.
Only then produce the guess.

I also made sure the feedback I gave it was crisp and machine-readable — no ambiguity about what yellow vs green meant.

Result: It still broke its own rules. Just more politely.

AI reasoning:
“E must not be in the third position, so my next guess will avoid placing E in that slot.”

AI guess: “DUVET.”

Issue: E is in the third position.

It wasn’t “confused.” It didn’t forget the rule. It stated the rule and then generated a guess that violated it in the same breath.

That’s important. Because it tells us the problem isn’t just “the model needs better instructions.”

What This Actually Means About LLMs

GPT-4o (and other LLMs) are not tracking world state the way a human does. They’re doing high-quality next token prediction. That’s not an insult. It’s just the mechanic.

When the model says, “E can’t be in position 3,” that sentence is not a binding contract. It’s just the statistically likely next sentence after the setup you gave it.

When it immediately guesses a word that does put E in position 3, that’s also just statistically likely text — not “a mistake it is aware of.”

This is why AI can sound logically airtight and still ship you something unusable.

That gap shows up in Support work all the time:

It will summarize a ticket thread, then confidently mis-assign blame because it “remembers” who caused the issue… except that never actually happened in the thread.
It will generate troubleshooting steps that violate the safety constraints it just recited to you.
It will produce an answer that matches tone and policy but is technically impossible on your product version.

So Where Does ReAct Help?

ReAct — this loop of Thought → Action → Observation — still matters. It slows the model down. It gives you checkpoints where you can step in and say “nope.”

In some runs, especially with simpler boards, that was enough to keep GPT-4o honest. It played like a focused analyst instead of a guess generator. That’s promising.

But we shouldn’t pretend ReAct alone solves the deeper issue. It doesn’t magically give the model a working memory of constraints. You still need either:

external constraint checking (code that rejects bad guesses), or
human oversight (someone who can read the reasoning and say “that doesn’t follow”).

Where This Leaves Us

The point of this test was never “ha ha, the AI can’t play Wordle.”

The point was: don’t confuse confident narration with reliable reasoning.

This shows up in real work as “the AI drafted something that looked perfect, so we shipped it.” That’s where teams get burned — legal, product, Support, everyone.

The right move is not “don’t use AI.” The right move is: treat AI like a fast junior analyst whose work always gets reviewed before it leaves the building.

That’s how you get the boost without the fallout.