Machine Learning / Chess
nanoDanya
I built nanoDanya to teach myself how language models are trained. This is a small language model, and very much a software engineer's version of the project.
Vocabulary
One of the first choices was tokenisation. For normal language, nanochat uses BPE. That makes sense because language keeps producing new words, names, spellings, and variations. If you tried to treat every full word as its own token, the vocabulary would get out of hand very quickly.
Chess seemed more contained. A string like Nf3 already means one complete move. So the natural question
was whether the model should predict Nf3 directly as one token, or build it out of smaller pieces.
That seemed like the real tradeoff. Smaller tokens keep the vocabulary compact, but they also turn one chess move into several prediction steps. Whole-move tokens make the vocabulary larger, but now one next-token prediction lines up with one chess decision. I liked the simplicity of that.
To sanity-check this, I looked at the dataset. I wanted to know how many distinct SAN moves actually show up, and whether the distribution is broad or repetitive. It turned out to be much more repetitive than I expected: there are 4,517 distinct move tokens, the top 100 already cover 47.6% of all move tokens, and the top 500 reach 91.5%.
Another way to say it is that only 110 moves (2.4% of the move vocabulary) cover half of all move tokens, and 830 moves (18.4%) are enough to reach 99% coverage.
So this does make the vocabulary bigger, but not by that much. In chess, it still feels reasonable to let one token stand for one move.
It also makes the experiment easier to read. The model is predicting chess moves directly, rather than assembling move notation a piece at a time.
This is the part I liked in practice: one token, one move.
Model
I had not built transformers before, and I mainly wanted to get off the ground and start training something. So I started from nanochat and kept its GPT backbone mostly the same: 12 layers, 6 attention heads, and an embedding dimension of 768. That was not the result of much tuning. It was just a simple place to start that already worked.
The part I thought about a bit more was context length. I was used to hearing about language models with huge context windows, but chess did not seem like that kind of problem. Games are not that long, and most of what matters seems to live in a pretty short move history.
The median game in the training set is 74 moves, and even the longest is 301 moves, so 512 tokens already leaves plenty of room. That felt like a better fit for the problem, and it also made training and inference cheaper. Once that was set, the more interesting question was how to train it.