This page illustrates the RDKit random SMILES generation algorithm (implemented in Canon.cpp). You usually want a SMILES string to be unique so everyone writes the same string for the same drug. But in machine learning, unique is bad. We actually want chaos. By generating ten different valid strings for a single molecule, we force AI models to learn the actual chemical structure rather than just memorizing text patterns.
At its core, a molecule is just a graph: atoms are nodes, bonds are edges. To write a SMILES string, the computer has to flatten that web into a single line of text by walking from atom to atom. This visualizer breaks that walk down into two steps so you can see how the computer handles the tricky parts.
Before writing anything, we have to solve the ring problem. If a computer blindly followed bonds around a ring like benzene, it would loop forever. To fix this, the algorithm runs a quick scout mission using a depth-first search. It sprints through the molecule leaving markers behind. As soon as it hits a marker it placed earlier, it knows it found a loop. It flags that bond as a ring closure, which becomes the numbers in the SMILES string.
With the rings marked, the algorithm walks the graph again to generate the text. Since this is a random generator, it rolls a die at every intersection to decide where to go next.
Waiting...
-