
I read an interesting article on ABC News (Australia) about Wordle, it’s worth checking out. The author stated:
I use “share” as my first guess because it includes the two most common vowels, “s” which is the third most common letter and most common final letter in English words, while “h” and “r” are common individually and even more common in consonant clusters, so their presence or absence instantly knocks out a range of possibilities. But I admire people who play with less strategy. Lots of people guess on a whim
So, I decided to write a little Python code and test this hypothesis. The first step was to download a 100,000 English word vocabulary, read it into a Python list and keep only the 5-letter words (with apostrophes / hyphens / etc) stripped out.
Here’s a sample output for solving, the hidden word is “cents“:
Search for: cents
Guess=share, Match=_____, Avoid=har, DiffSpot=['s', '', '', '', 'e']
Guess=times, Match=____s, Avoid=harim, DiffSpot=['st', '', '', 'e', 'e']
Guess=poets, Match=___ts, Avoid=harimpo, DiffSpot=['st', '', 'e', 'e', 'e']
Guess=cents, Match=cents, Avoid=harimpo, DiffSpot=['st', '', 'e', 'e', 'e']
Some notes on the search algorithm:
- The “Match” shows which letters are in the correct spot (the green square in Wordle)
- “Avoid” are the letters we know can’t be in the word
- “Diff Spot” corresponds to the yellow boxes, the letters we know are in the word, but not in this position, i.e. Need to be used, but in a different spot
- It’s a little brute-force, no clever search algorithms and doesn’t avoid string concatenation and other GC abuses
- I didn’t spend any time on statistical significance tests, etc
- The code takes about 10-12ms per puzzle, on my Macbook Pro (M1 Pro chip)
I tried a couple of strategies for the initial word, tested by playing 1000 games that start with a random word from the vocabulary
Strategy | Average number of tries |
Random word | 4.4 |
Reasonable guess (see below) | 4.3 |
“share” | 4.3 |
“xylem” | 4.5 |
The reasonable guess was done by searching 100 random words from the list and scoring each word based on the number of letters is has from this list “eariot“, then picking the best one. If you’ve solved simple ciphers by hand, you might recognize this as the start of the letter frequency list. For example, “share” would score 3, whereas “xylem” would score 1 point
Some observations:
- Using “share” or even a reasonable guess is better than random
- A bad guess with less frequent letters is worse than random, but not hugely so
Well, if I get the time, I might let it run all night and find a list of best words. There might also be a way to use n-gram frequencies or some information theoretic approach to guess the best starting word, or to improve the intermediate guesses. Could be fun 🙂