Teaching AI to Play Poker (Sort Of): When LLMs Meet Game Strategy
I’ve been fascinated by a project that’s been making the rounds lately: BalatroBench, which essentially lets large language models play Balatro, that brilliant poker-inspired roguelike that took the gaming world by storm last year. The concept is simple but elegant — feed the LLM the game state as text, let it decide what to do, and watch it either triumph or faceplant spectacularly.
For those unfamiliar, Balatro is a poker-based roguelike where you build synergies between cards, jokers, and special effects to reach increasingly absurd score targets. It’s the kind of game that requires both strategic planning and tactical decision-making, which makes it a genuinely interesting test for AI reasoning capabilities.
What strikes me about this project is that it’s one of those “finally, a real-world eval” moments. We’ve been drowning in standardised benchmarks that test LLMs on everything from multiple-choice questions to coding challenges, but watching an AI attempt to navigate the strategic complexity of an actual game? That’s something different. The developer built a mod that exposes Balatro’s game state through an HTTP API, and created a framework where you can plug in any OpenAI-compatible model and watch it play. Better yet, you can define different strategies using Jinja2 templates, which means the same model can perform wildly differently depending on how you prompt it.
There’s even a Twitch stream where you can watch these models struggle in real-time. I’ll admit, there’s something oddly entertaining about watching Claude Opus 4.6 attempt to navigate a Blue Deck run at 2am. It’s like watching a very intelligent person who’s never played poker before trying to figure out why a flush beats a straight.
The whole thing reminds me of those early chess-playing AIs, but with a modern twist. Someone in the discussion mentioned wanting to see this for Dwarf Fortress, which would be absolutely hilarious — imagine trying to explain fortress management through text prompts alone. The carnage would be spectacular.
What’s particularly interesting from a DevOps perspective is the modularity of the setup. You’re not locked into any particular model or provider. Running Ollama locally? Great. Want to burn through your API credits watching GPT-4 play cards? Go ahead. The framework doesn’t care, which is exactly how these tools should be built.
But here’s where it gets a bit thorny: one commenter raised a valid concern about data contamination. Balatro only launched in February 2024, which means models trained on more recent data might have an unfair advantage. They could have absorbed wiki entries, strategy guides, or YouTube transcripts that explain optimal play patterns. It’s not quite the level playing field you’d want for a proper benchmark, though someone else noted that even obscure game guides get hoovered up by training datasets these days. The Chinese models, apparently, are particularly aggressive about this.
This brings up a broader question about AI evaluation that’s been nagging at me. How do we test these systems on truly novel problems when they’ve potentially seen vast swaths of human knowledge during training? It’s like trying to give someone a test when they might have already memorised the answer key — you’re never quite sure what you’re measuring.
Still, there’s something delightfully practical about this approach. We spend so much time worrying about AI safety, alignment, and existential risks — all valid concerns, don’t get me wrong — that sometimes we forget to just… play with these things. See what they can do. Watch them fail in interesting ways. The Twitch stream showing Opus 4.6 struggling through a run isn’t just entertainment; it’s genuinely informative about how these models reason through complex, multi-step decisions.
The cost is another factor worth considering. Someone calculated that running Opus 4.6 for a single match could set you back a thousand dollars in API calls. That’s… well, that’s one expensive game of cards. It highlights the ongoing tension in AI development between capability and accessibility. Sure, the most powerful models can do incredible things, but if it costs the equivalent of a decent gaming PC to watch them play a single game, what’s the practical application?
The community response has been encouraging, with suggestions to run various self-evolution frameworks over the system to see which models can improve their strategies fastest. That’s when things get really interesting — not just testing static model performance, but measuring their ability to learn and adapt within a constrained environment.
For those of us in IT who remember when getting a computer to play Pong was considered cutting-edge, watching LLMs navigate complex strategy games feels simultaneously mundane and miraculous. We’ve come so far that it’s easy to forget how remarkable this technology actually is. At the same time, watching an AI consistently make rookie mistakes reminds us that we’re still far from the artificial general intelligence that keeps tech commentators up at night.
If you own Balatro and run local models, the project is open source and ready to experiment with. Even if you don’t participate, the leaderboard at BalatroBench makes for fascinating reading — a real-time snapshot of how different AI architectures handle strategic reasoning under pressure.
Sometimes the best way to understand what AI can and can’t do isn’t through academic papers or corporate demos, but by watching it try to win at cards.