How many ways can you fill a 9x9 grid, obeying all the rules of the sudoku puzzle? The answer is too big to just calculate directly on a computer, so an exact answer takes careful analysis. But if an absolutely exact answer isn’t required, we can get a good statistical approximation using a Monte Carlo algorithm. As a bonus, the algorithm doesn’t need any application-specific analysis and works on many other problems, too. It’s a handy “stupid things that work” approach to solving problems.

## Sudoku

If you’re already familiar with how the puzzle works, you can skip straight to the next section.

A sudoku puzzle is a 9x9 grid, with some squares already containing digits from 1 to 9. The problem is to fill in the remaining squares, while obeying the following constraints:

- Each square contains one digit from 1 to 9
- Each row contains all the digits from 1 to 9 exactly once each
- Each column contains all the digits from 1 to 9 exactly once each
- Each of the nine 3x3 boxes of squares contains all the digits from 1 to 9 exactly once each

There’s normally only one possible solution. Here’s an example puzzle:

Here’s its unique solution:

## A Simple Random Counting Algorithm that Doesn’t Work

To get a feel for the problem of statistically counting the number of possible complete sudoku grids, let’s look at a stupid thing that doesn’t work. The easiest and most general way to do statistical counting is to sample and extrapolate. We generate “sample” grids by randomly filling in all squares with digits from 1 to 9, and then check how many of the resulting grids are valid. If say, 1% of the sample grids are valid, then we can estimate that 1% of all possible 9x9 grids are valid. The more sample grids we generate, the more accurate the estimate should get.

This algorithm is easy to implement, and totally mathematically correct, but it’s completely impractical because the probability of generating a valid grid is ridiculously low — much, much lower than 1%. It turns out we’re better off trying to guess cryptographic keys for a living.

As a formula, the general approach looks like this:

$N_{\star} = \frac{N_{s\star}}{N_s} N_P$Where $N_s$ is the total number of grids we generate in our sample, $N_{s\star}$ is the number that turned out to be valid, and $N_P$ is the total number of grids we *could* have randomly generated (the “population”). In this
case, $N_P$ is really easy to calculate because we just have 81 squares that can independently take on 9
possible values, giving $N_P = 9^81$. The problem is the hopeless inefficiency of $\frac{N_{s\star}}{N_s}$. We can make the sampling process more efficient by taking the sudoku constraints into account,
instead of just filling squares in blindly, but then calculating $N_P$ takes more complicated analysis. This isn’t “stupid things that work” any more.

## A Clever Stupid Thing that Actually Works

Fortunately, there are other tricks in the Monte Carlo toolbox. I’m going to use a technique that Donald Knuth explained in the 1970s. He originally presented it as an easy way to estimate if a complex problem is small enough to just solve with a brute-force backtracking solution, but it’s really a general technique for estimating anything that can be represented as a sum on a tree data structure. It’s actually pretty clever, but I’m putting it in the “stupid things that work” basket because it can be used without doing any clever analysis of the application.

To keep the diagrams small, I’ll explain the idea using the 4x4 version of sudoku.

Imagine we have a blank 4x4 grid, and we want to make a randomised attempt at generating a valid solution grid. This time we’ll take the rules of sudoku into account when choosing what digit to put where, to increase the chances of getting a valid grid. Also, for now we’ll fill in the squares left to right, top to bottom, from the top-left corner. Filling in the first square is easy — we can pick any of the four possible digits at random.

For the second square, we’ll only consider the three digits that still leave us with a potentially solvable grid.

That leaves two possibilities for the third square.

And just one for the last square in the row.

Starting the second row, the first square has two possibilities.

We keep going like this until we either successfully complete the grid, or get stuck on a blank square with no options left:

We can draw a tree diagram that branches out with each set of choices. Attempting to randomly fill a grid corresponds to taking a random walk, starting from the root.

The random walk stops when it reaches a leaf, which corresponds to either a complete, valid solution grid, or to a dead-end failure. Let’s give each node a score. A leaf that represents a valid solution grid gets a score of 1, and a dead end gets a score of 0. We might as well give the internal nodes a score of 0, too. Then the total score over all nodes in the tree (including ones we don’t see in a single random walk) is equal to the total number of leaves that represent solutions, which is equal to the total number of solutions. Knuth’s technique lets us estimate this total score.

How does it work? One random walk tells us the score of one leaf and all nodes leading up to it, but the problem is
that we don’t know anything about the other nodes. The Knuth approach is to just be naïve and assume the tree is
totally symmetric — i.e., that all the other possible walks, and the leaves they lead to, are the same as the one we
saw. For example, the third square we tried to fill in had two possibilities, so we assume that *all* nodes at
this level in the tree have two branches. This assumption is obviously wrong in general, so the estimate we calculate
will probably be wrong, too. But it’s easy to prove that it’s not biased. Sometimes it will over-estimate, sometimes it
will under-estimate, but if we keep taking an average over multiple trials, the result will converge on the correct
answer.

In this specific application, the calculation becomes extremely simple. We do a random walk. If we end up at a leaf with score 0 (i.e., we fail to generate a valid grid), we assume that all leaves have score 0, and our total estimate is 0. If we arrive at a leaf with score 1, then we need to estimate the number of leaves. Every node that branches $N$ ways increases the number of possible walks by a factor of $N$, so the estimated number of leaves (assuming that nodes keep branching in the same way) is just the product of all the number of choices we have in the walk from root to leaf.

This algorithm is a huge improvement on the first one, and it’s good enough for the 4x4 case. Here’s a set of ten samples:

384 |

0 |

0 |

384 |

384 |

384 |

384 |

384 |

0 |

384 |

The average is 268.8, which is already within 10% of the correct answer of 288 (which is actually easy to calculate by just generating all the grids). Remember that Knuth’s original application was estimating whether bruteforcing a problem would take minutes, days or years. He said he often did this test by hand before writing code.

## Improvements and Success

Unfortunately, it’s not quite good enough for 9x9 grids. The random walk often hits a dead end before completing the
first three rows, so most of our estimates are 0. We still need to make the sampling a bit more efficient. With a tiny
bit of analysis it’s not hard to come up with an algorithm that can randomly fill in the first bunch of rows, and this
would probably make the sampling much more efficient. But the game is about *not* doing application-specific
analysis, so in the name of stupid things that work, I just filled in the squares in random order instead of
sequentially. This balances things better and helps fill more squares more reliably.

The chance of successfully filling in all 81 squares is still vanishingly low, but we can solve the problem with a hybrid approach. It’s not feasible to generate all the solutions of a blank 9x9 grid, but once we’ve filled in about twenty-something squares, we can feasibly use a simple backtracking search to count all possible solutions from there on. In terms of the tree representation, this is like taking big subtrees of the tree and replacing them with leaves that have a score equal to the total score of the subtree.

This finally gets us all the way. Here’s a set of ten samples for 9x9 grids:

2064300156282470400000 |

40123114037610283008000 |

0 |

0 |

0 |

0 |

0 |

337169025526136832000 |

0 |

535793238195264000000 |

The average is 4,306,037,645,761,415,424,000, compared to the analytically calculated answer of 6,670,903,752,021,072,936,960. Empirically, it doesn’t take many samples to get into the right order of magnitude.

To complete the “stupid things that work” approach, I ran the program for about a day and generated a million samples. The resulting estimate was 6,675,149,464,993,393,541,120. How can we know how accurate that is (without referring to the answer)? Using the standard deviation of the estimates, we can calculate a 95% confidence interval of ±2%. If you needed to, you could do some analysis and come up with an algorithm that’s much more accurate with much less computation, but I think that’s a pretty good result for the amount of human effort required.