Machine Learning / Chess
nanoDanya
The starting question for nanoDanya was simple: can I take a chess game, turn each move into a token, train a tiny GPT, and sample the next likely move? I had not built a language model before, and chess felt like a contained way to learn the whole stack. It is still hard, but it is smaller than language in the ways I cared about: the board is fixed, the rules are fixed, and the whole game is just a sequence of moves.
After that, the project became a way to understand each part of the system slowly: the data, the tokenization, the model, the training loop, the inference wrapper, and the benchmarks. I wanted one small system where every design choice was visible.
Vocabulary
One of the first choices was tokenisation. For normal language, nanochat uses BPE. That makes sense because language keeps producing new words, names, spellings, and variations. If you tried to treat every full word as its own token, the vocabulary would get out of hand very quickly.
Chess is much more closed-vocabulary than natural language. A string like Nf3 already means one complete move. So the natural question
was whether the model should predict Nf3 directly as one token, or build it out of smaller pieces.
The tradeoff was straightforward. Smaller tokens keep the vocabulary compact, but they also turn one chess move into several prediction steps. Whole-move tokens make the vocabulary larger, but now one next-token prediction lines up with one chess decision. I liked the simplicity of that.
To sanity-check this, I looked at the dataset. I wanted to know how many distinct SAN moves actually show up, and whether the distribution is broad or repetitive. It turned out to be much more repetitive than I expected: there are 4,517 distinct move tokens, the top 100 already cover 47.6% of all move tokens, and the top 500 reach 91.5%.
Another way to say it is that only 110 moves (2.4% of the move vocabulary) cover half of all move tokens, and 830 moves (18.4%) are enough to reach 99% coverage.
So the vocabulary gets larger, but a few thousand move tokens is still manageable. In chess, it still feels reasonable to let one token stand for one move.
It also makes the experiment easier to read. The model is predicting chess moves directly, rather than assembling move notation a piece at a time.
Model
I had not built transformers before, and I mainly wanted to get off the ground and start training something. So I started from nanochat and kept its GPT backbone mostly the same. I did not spend much time tuning the architecture. In practice, I just kept the same rough model size nanochat started from: 12 layers, 6 attention heads, and an embedding dimension of 768.
Context length was the one place where chess changed the setup. In a language model, the next token is predicted from all the tokens before it, so a large context window only helps if much earlier parts of the sequence are still useful. nanochat also uses a small context window, but there that tradeoff can show up in the quality of the outputs. In chess, once one token stands for one move, the useful context is basically just the game so far. A game is short, so there was not much reason to pay for a huge window. I checked the dataset to make sure 512 tokens were already enough to cover basically any full game I had, and they were. That kept the setup small, and it kept training and inference cheap too. Once that was set, the more interesting question was how to train it.
Training
I wanted to do this in a slightly zero-to-hero way at first. So the first training loops were just in a notebook on my own computer, on only a few thousand games, where I could watch what each step was doing. I was not trying to build a strong chess model yet. I just wanted to know if next-move prediction alone would do anything at all.
Once that seemed worth pushing a little further, I moved the same basic setup onto Modal so I could use a GPU and just get more training steps in. The first sign was the loss. I was kind of thrilled when it started dropping. Then the games started making more sense too. The model could get through openings, reach recognizable middlegames, and sometimes even win.
I kept sanity-checking it in direct ways too: letting it play full games, seeing whether the openings held together, and later trying it against weak Stockfish. Once that first run worked, I mostly just wanted to make it bigger and see what happened. So I kept the setup the same and increased the obvious things. The first serious version of that was roughly 500k games from players rated above 2000 Elo. The models got bigger too. By then I could also make the checkpoints play each other, and the difference showed up pretty quickly. In a 100-game match, the 8-layer model beat the smaller 4-layer one 53-1. The 12-layer one beat it 69-3. What I noticed more than the score was the way those games ended: the bigger models were much better at actually finishing games, while the smaller one kept drifting into long draws.
At some point I got curious about where the model was actually messing up. So I looked at the validation loss, but broken out by move number instead of averaging everything together. For each real game, I stepped through it move by move, checked the loss on the actual next move, and then grouped those losses by where they happened in the game. The opening was mostly okay. The middlegame was where the loss started getting worse.
That was not very surprising. Openings are repetitive. The model sees the same structures and move sequences over and over again there. By the middlegame, the game has branched out, the positions are sharper, and there are just many more ways to be wrong.
Around the same time, I had also started playing with Lichess puzzles as a separate dataset. That made the next hypothesis tempting: if the model was struggling in the middlegame, maybe it needed sharper tactical games. At first I expected puzzles to be the important ingredient.
To check that, I trained the same kind of model twice, changing only the games it saw. One model got roughly 500k actual games from players rated above 2000. Another got 500k full games that happened to contain Lichess puzzles. If puzzle-linked games were automatically better, this should have shown up.
It did not. The normal 500k-game model actually beat the puzzle-linked one. That was useful. It ruled out the cleanest version of the puzzle story.
What did work was adding more games. When I took the same plain next-move setup from 500k actual games to about 3M actual games, it became much stronger. When I gave the puzzle-linked version roughly five million games, that improved too. That was a little annoying, because it made the result less clean, but also more interesting.
The best model is still the large puzzle-linked one. But I no longer think the right sentence is "puzzles made it good." A more honest sentence is: the plain model keeps getting better when I give it many more good games. Whether puzzle-linked games are better than normal games at the same size is still open.
This matters because the project is trying to make the language model internalize chess, not hide mistakes in the wrapper. Runtime legality checks can keep games running, but they are not the prize. The model that moved the project forward was not the one with the cleverest inference trick or training trick. It was the plain model trained on enough good move sequences that it needed those tricks less often.
I am still playing with two other training ideas. One is giving extra weight to the actual puzzle solution moves inside full games. The other is asking the model to learn a rough position evaluation alongside the next move. I want to try both a bit more before writing much about them.
Inference
After training, the basic loop is almost embarrassingly small. Start with <bos>, feed the
tokens so far through the model, look at the logits at the last position, and choose one next token. If that
token is e4, I push e4 onto a chess board and repeat.
This is the part I liked in practice: one token, one move.
For actually playing games, I usually add a legal-move filter before sampling. That keeps the game alive, but it is also a hack in exactly the place where I would rather not have one. It lets the wrapper remove moves the model should have known not to say. So it answers a narrower question: what does the model choose among legal moves? It does not answer the more interesting question of whether the model knew the move was legal in the first place.
The other small piece is stopping. The dataset has <eos>, so the model can learn to end games
too. In practice I still keep boring guardrails like a maximum move count, because otherwise one bad sample can
turn an experiment into an infinite game. That is useful engineering, but it also goes slightly against the
spirit of the project. I want the model to learn when a game is over, not rely on the harness to rescue it.
I put this behind a Lichess bot mostly because it forced the loop to be real. Instead of sampling games in a notebook, the model has to answer one position at a time against whoever challenges it. The model-facing part is still the same: the moves so far go in, one next move comes out. You can challenge nanoDanya on Lichess; under the hood it is still the same loop, waiting for a position and answering with one move.
Benchmarking
Making games cheap enough to run
The least scientific benchmark was that I played against the model myself. That was useful for a while. It made the failures easy to feel. If the model hung a queen, repeated nonsense, or drifted into a dead ending, I noticed immediately. At one point the Lichess bot even beat me as Black, which was not real evidence of strength, but was at least a good deployment sanity check: the same next-token loop made it through a real game and I resigned.
That was a fun way to poke at it, but it was a terrible way to measure anything. I could play one game, maybe a few, and notice obvious problems. It was not enough to say one checkpoint was actually better than another.
Once I stopped being the benchmark, the next problem was speed. To compare two checkpoints, I need a small program to keep the boards, ask each model for moves, apply the moves, and record the results. My first version did that one game at a time. That meant one tiny model call for one position, then a board update, then another tiny model call. A 200-game comparison took around six minutes locally.
The fix was to make each model call do more work. Instead of asking for the next move in one game, I kept many unfinished games open and asked for the next move in all of them together. That is batch inference: the model sees many positions in one forward pass, and the GPU is less idle. Modal helped because I could run the same checkpoint-vs-checkpoint comparison on a few remote GPU workers. For a 200-game run, each worker plays a different range of games from the same matchup, and the script merges those game records into one result. Two things changed at once here: I batched the model calls, and I stopped running everything on my laptop. The practical result was still the thing I cared about: the same 200-game comparison now finishes in under a minute.
What I compared
The benchmark story is mostly about two plain next-move families. The first is actual-game data: the original 500k-game baseline, and a later 3M-game version trained the same way on more games. The second is puzzle-linked data: a 500k-game control, and a much larger full run with 5M games.
The important part is that all four are still the simple version of the project: predict the next move token from the moves so far. No tactical upweighting, no eval head, no new architecture. That keeps the comparison focused: what changed was mostly the source of the games and the amount of data.
Raw legality
The first benchmark I care about is also the simplest one: before any legal-move filter, is the model's favorite move actually legal? That felt closer to the point of the project. I do not just want a wrapper that can steer a model through chess. I want the model to learn enough chess that its first instinct is usually a real move.
Inside the actual-game family, scale helped. The 3M-game model got much cleaner than the original 500k-game baseline. Its first-choice move was illegal in 2.27% of positions, compared with 5.54% for the 500k baseline.
The large puzzle-linked model was still cleaner. Its first-choice move was illegal in only 1.05% of positions. That is the best raw legality result I have in this group. But it also saw more games than the 3M actual-game model, so it is not a clean source-only comparison.
Head-to-head games
The easiest benchmark to understand is still to let two checkpoints play each other. I used that a lot, because it catches things that loss curves do not. A model can have a nice validation loss and still make strange choices when the game gets long. Full games make that visible. But these games still use the legal-move filter from the inference loop, so I read them next to the raw legality numbers, not instead of them.
In those games, the same pattern showed up. The 3M actual-game model was much stronger than the 500k actual-game baseline, scoring 78.5% against it. The large puzzle-linked model still beat the 3M actual-game model, scoring 77.5% from the other side. That comparison is useful, but it is not size-matched: the puzzle-linked model saw 5M games, while the actual-game model saw 3M.