Skip to main content

Case Study · Algorithms

Generating Puzzles Worth Playing

McVroom, N. (2026). Generating Puzzles Worth Playing. NUR Technical Reports, NUR-2026-003. nickmcvroom.com/work/sculpted-erosion-aftertouch
TypeScript Vitest Procedural Generation Algorithms

Abstract

AFTERTOUCH uses sculpted erosion, Hamiltonian paths, and quality gates to produce procedural puzzles that are not just solvable but interesting. Most generated topologies are discarded. That is the point.

Problem

Procedural puzzles that are solvable but boring

Approach

Sculpt, path, quality gates, emit

Outcome

Consistent quality across 500,000 generated puzzles

Procedural puzzle generation has a well-known failure mode: the puzzles are technically solvable but not actually fun. You can guarantee that a solution exists. You can’t guarantee that finding it will feel like a satisfying challenge rather than a mechanical slog through an unstructured space.

AFTERTOUCH is a Hamiltonian path puzzle game. The player routes a signal through every node on a grid exactly once, from an input pin to an output pin. The visual language is dark PCB, contact pads, routed traces, dead silicon, but the underlying problem is pure graph theory. The generation challenge is producing topologies that reliably yield puzzles with the right difficulty profile for each tier, from tutorial grids a child could solve to expert grids that demand genuine spatial reasoning.

The solution is a three-stage pipeline: sculpt the topology, find a path, then reject anything that doesn’t meet quantifiable quality standards. Most generated topologies are discarded. That’s the point.

Sculpting by Erosion

The grid starts fully connected; every cell is traversable. Generation works by removing material, converting traversable cells into walls. The metaphor is sculptural: you start with a block and carve until a shape emerges.

Three placement strategies control where walls appear, each producing a different character of topology:

Scatter picks a random traversable cell with no existing wall neighbours. This produces isolated, distributed holes; an even texture with predictable sight lines. Tutorial levels lean heavily on scatter (weight 0.765) because the resulting topologies are uniform and readable.

Cluster picks a cell adjacent to an existing wall. This grows existing holes into larger irregular voids, creating the kind of organic shapes that force the player to route around obstacles. Medium and hard tiers increase the cluster weight to produce more complex internal geography.

Edge nibble picks a cell on the grid boundary and erodes inward. This creates irregular outlines: grids that aren’t rectangular, with peninsulas and inlets that constrain the path in unexpected ways. Medium tiers weight edge nibble at 0.510, producing dramatically shaped boundaries.

Each wall placement rolls against the combined weights to select a strategy. The weights are defined per difficulty tier, so the character of the topology, not just its size, changes as difficulty increases.

// Per-tier placement weights shape the topology character
// Tutorial: mostly scatter; uniform, predictable
// Medium:   heavy edge nibble; dramatic boundaries
// Hard:     high cluster; complex internal voids
// Expert:   scatter-dominant, no edge nibble; dense, uniform challenge

Every placement is validated. After each wall is placed, a breadth-first search confirms the remaining traversable cells form a single connected component. If a wall would disconnect the grid, it’s rejected and another candidate is tried. This invariant holds after every single placement, not just at the end; the topology is always valid, at every step.

Inputs are a seed (mulberry32 PRNG), grid dimensions, a target wall count derived from per-tier percentage ranges, and the placement weights. Everything is deterministic. The same seed always produces the same topology.

ScatterIsolated, distributed holesClusterGrowing irregular voidsEdge NibbleBoundary erosion inwardFig. 1 — Three placement strategies produce different topology characters. Indigo = traversable, grey = wall.

Finding the Path

Once a topology exists, a Hamiltonian path must be found on it. The algorithm is Warnsdorff’s heuristic: at each step, move to the unvisited neighbour with the fewest remaining unvisited neighbours of its own. Break ties randomly using the seeded PRNG. This greedy approach doesn’t guarantee a solution, it’s a heuristic, not an exhaustive solver, but it’s fast, and speed matters when most topologies will be discarded anyway.

Up to twenty attempts are made with different source positions per topology. If no Hamiltonian path is found, the topology is discarded and a fresh one is sculpted. The approach is: generate many topologies cheaply, find paths on the ones that admit them, filter by quality. Brute force in service of taste.

A separate solver exists for solution enumeration and hint computation during gameplay, but it’s not used during generation. Generation relies entirely on the Warnsdorff heuristic for speed.

Quality Gates

This is where most generation pipelines stop. A solvable puzzle exists: ship it. AFTERTOUCH doesn’t, because solvability is a necessary condition, not a sufficient one. A puzzle can be solvable and boring. A puzzle can be solvable and trivial. A puzzle can be solvable and have only one interesting decision in a grid of twenty nodes.

Two metrics quantify puzzle quality at generation time:

Fork ratio is the proportion of path nodes where the player has two or more choices. Below the per-tier threshold, the puzzle is too linear: the player is walking a corridor, not solving a puzzle. There’s only one plausible move at each step, which means the solution is obvious without requiring any reasoning.

Mean fork depth measures how punishing the wrong choices are. At each fork point, every wrong option is followed greedily to see how far the player can go before hitting a dead end. Deeper traps mean the player invests more moves before realising they’ve gone wrong; and that investment is what makes the puzzle feel consequential. Shallow traps are too forgiving; the player backtracks one step and continues. Deep traps force genuine deliberation.

A separate parameter, solution limit, bounds how many solutions the hint solver explores during gameplay. It’s defined per tier (tutorial: 100, expert: 5) and controls the computational budget for hint computation, not generation. The generator doesn’t count solutions because exhaustive enumeration is expensive and the fork metrics are a sufficient quality proxy.

If any gate fails, the entire topology is discarded and the pipeline loops; up to fifty attempts with fresh topologies before giving up on the current parameters. The rejection rate is high and intentional. Generating a topology is cheap. Shipping a bad puzzle is not.

Mechanic selection is part of the generation pipeline, not a post-processing step. Per-tier mechanicWeights define the probability of each puzzle type: standard, fog, warps, or combined fog-and-warps. The generator draws a mechanic type first, then attempts to produce a puzzle of that type. If it can’t produce the drawn mechanic within fifty attempts (some mechanics are harder to generate for certain topologies), it falls back to a standard puzzle. The generator prefers variety but prioritises delivering a puzzle over delivering the specific mechanic it drew.

Sculptscatter / cluster / nibblePathWarnsdorff heuristicQuality Gatesfork ratio + fork depthpassEmitreject: discard topology, retry with new seedFig. 2 — Most generated topologies are discarded. The rejection loop is the quality mechanism.

The SOC Aesthetic

The puzzle mechanic is pure graph theory, but the player doesn’t see nodes and edges. They see a dark circuit board.

Traversable cells are contact pads on indigo silicon. The signal trail is a bright cyan trace with a glow filter. Walls are dead silicon: dark, inert, unroutable. The source node is an input pin, pulsing blue. The destination is an output pin, pulsing green, glowing intensely when every node has been visited and the circuit is ready to complete. Warp pairs, a mechanic that connects distant nodes, render as purple vias, linked by dashed arcs.

The aesthetic isn’t decorative. It gives the abstract mechanic a narrative: you’re completing a circuit, routing a signal through a chip. “Route a signal from input to output, visiting every test point exactly once” is more compelling than “find a Hamiltonian path on this graph.” The mechanic is identical. The experience isn’t.

Grid sizes range from 3×3 at tutorial to 8×8 at expert, always portrait-oriented. The SVG viewBox is fixed at 400 logical units on the longest dimension, so grids scale consistently regardless of size. The dark palette, navy background, cyan traces, amber dead-end warnings, reads clearly at any scale and any screen brightness.

Two Products, One Engine

AFTERTOUCH ships in two forms from the same @aftertouch/shared package.

AFTERTOUCH: SIGNAL is the instant version: a free, single-puzzle-per-session game. Pick a difficulty tier, get one randomly generated puzzle, play it, see your result card, share it. No persistence, no accounts, no progression. It’s designed for web embeds and social sharing, similar in format to daily puzzle games but on-demand rather than daily.

AFTERTOUCH (the full app, planned) adds level sequences with numbered puzzles from deterministic seeds, cross-session persistence, star rating progression, and eventually platform adapters for mobile and social platforms. The engine is identical. The instant client is a thin React shell around the shared package.

Calibrating by Simulation

The quality gate thresholds were initially hand-tuned: play a generated puzzle, decide if it felt right, adjust the numbers. This worked for the first playable build. It didn’t scale across five difficulty tiers and ten grid size options with a dozen interacting parameters per tier.

The autotuner (tools/autotuner/) is an offline configuration search tool that replaces intuition with data. It replicates the full puzzle generation loop, sculpt topology, find Hamiltonian path, measure quality metrics, and uses hill climbing to find the parameter set that produces the best puzzles for each tier.

For each candidate configuration, the pipeline generates 200 puzzles and captures three metrics: fork ratio, mean fork depth, and deep fork ratio (proportion of forks with depth ≥ 3). Each metric is scored against a target band defined per tier; tutorial wants a fork ratio of 0.30–0.55, expert wants 0.55–0.70. The score combines band membership (what fraction of puzzles land inside the target) with a variance penalty (how tightly clustered they are within it).

The composite score uses a weighted geometric mean across the three metrics. The geometric mean is the critical design choice: a zero on any single metric zeroes the entire score. You cannot compensate for broken fork depth with excellent fork ratio. This prevents the optimiser from finding degenerate configurations that excel on one axis but produce unplayable puzzles. Mean fork depth carries double weight because trap depth is the primary difficulty lever.

Hill climbing with adaptive perturbation explores the space. Five mutated candidates per iteration, Gaussian noise via Box-Muller transform applied to all tuneable parameters: wall percentages, fork thresholds, placement weights, source strategy, mechanic weights. After ten iterations without improvement, the perturbation magnitude widens by 1.5× to escape local optima. A default run is 100 iterations × 5 candidates × 200 puzzles = 100,000 puzzle generations per tier, 500,000 total across all five tiers.

The output is a TypeScript object literal matching DifficultyConfig, designed to paste directly into the engine’s configuration file. A distribution summary table shows P25–P75 ranges with means, so you can see at a glance whether the calibrated config produces the spread you expect.

The autotuner shifted several values in non-obvious ways. Placement weights changed the most: hand-tuned medium tier used roughly equal scatter, cluster, and edge nibble. The autotuner pushed edge nibble to 0.510 (the dominant strategy) with scatter at 0.257 and cluster at 0.233, producing more dramatic boundary shapes that increase fork depth. Wall percentage ranges tightened from wide guesses to narrow bands; expert settled on 0.055–0.064 instead of the original 0.05–0.10. Fork depth thresholds went from all-zero placeholders to real values (tutorial: 1.771 minimum, expert: 2.446 minimum), values that couldn’t be hand-tuned at all without statistical data on what the generator actually produces.

The lesson generalises: any system with more than a handful of interacting parameters needs automated calibration, not as a retrofit, but from the moment the parameter space is defined. The hand-tuned values weren’t wrong; they were in the right ballpark. But “right ballpark” isn’t the same as “produces consistent quality across 500,000 generated puzzles.”

What I’d Do Differently

Warnsdorff’s heuristic has known weaknesses on irregular topologies. It greedily follows minimum-degree neighbours, which can lead it into corners on heavily eroded grids where the neighbour distribution is uneven. The pipeline compensates with volume: twenty retries with different source positions per topology, fifty fresh topologies per puzzle request. This works because each Warnsdorff attempt is O(n), a thousand fast attempts is still cheap.

A hybrid approach could add a depth-limited backtracking fallback: not a full exhaustive search (which is O(n!) worst case and unacceptable for client-side generation on a 60-node grid), but a bounded DFS with a node budget of, say, 10,000. The solver infrastructure already exists (solveFrom has connectivity pruning that cuts branches early when the remaining graph would become disconnected. The question is whether the trade-off is worth it: backtracking is more expensive per attempt than Warnsdorff, but it would reduce the rejection rate, meaning fewer total attempts to produce a valid puzzle.

Whether that’s a net win depends on data I should be collecting but currently am not: the actual rejection rate per tier. If Warnsdorff succeeds within two or three retries on most topologies, adding backtracking is complexity for marginal gain. If expert-tier rejection rates are high (likely, given the tight autotuned configs and heavy erosion), it’s worth it. The next step is instrumentation, not implementation.

Outcome

The generation pipeline produces puzzles across five difficulty tiers, from 3×3 tutorial grids to 8×8 expert grids, with consistent quality characteristics per tier. Fork ratio and fork depth are measurable proxies for puzzle quality: not perfect proxies, but good enough to filter out the boring, the trivial, and the unfair.

The core insight is that procedural generation isn’t a solvability problem. It’s a curation problem. Generating valid puzzles is easy. Generating puzzles that are worth playing requires defining what “worth playing” means in quantifiable terms, then throwing away everything that doesn’t meet the standard.

References

Warnsdorff, H. C. von (1823). Des Rösselsprunges einfachste und allgemeinste Lösung.

McVroom, N. (2026). Platform Abstraction in CRACKPOINT. NUR Technical Reports, NUR-2026-001./work/platform-abstraction-crackpoint