Optimal First Guess in Wordle Using Weighted Letter Distribution

The latest sensation is wordle! As a huge fan of these types of games, I wondered to myself: given the distribution of letters across five possible locations, can we weight words in such a way that we can pick an "optimal" first choice to give us the largest chance at identifying letters?

The algorithm looks a little something like this:

Step 1: Identity the data set

The wordle website has 12,972 word choice options.
Linux american english and words have 6,006 after removing apostrophes.
There are 8,996 five-letter words in the Office Scrabble Player's Dictionary, Volume 6.

We'll go with the wordle set for fidelity.

Step 2: Transform and bucket the letter distribution

Step 3: Weight words based on letter distribution

This naive algorithm looks at all of the words in the list and counts the occurrence of each letter in each of the five possible positions. It will then calculate each word's weight by summing the value of the letter in that slot.

I put together a script in python3 that accomplishes this:

#!/usr/bin/env python3

data = []
distro = {}
with open('sorted_words.txt', 'r') as f:
  for line in f:
    data.append(line.strip())
for word in data:
  i = 0
  for letter in word:
    if not i in distro:
      distro[i] = {}
    if not letter in distro[i]:
      distro[i][letter] = 1
    else:
      distro[i][letter] += 1
    i += 1
for key in distro:
  # python 3 preserves insert order, sort of letter key
  distro[key] = dict(sorted(distro[key].items(), key = lambda item: item[0]))

word_weights = {}
for word in data:
  weight = 0
  i = 0
  for letter in word:
    weight += distro[i][letter]
  word_weights[word] = weight

word_weights = dict(sorted(word_weights.items(), key = lambda item: item[1]))
for word,weight in word_weights.items():
  print (word + ": " + str(weight))

Top 10 Suggestions:

stoss: 5772
asses: 5735
sasse: 5735
sessa: 5735
scabs: 5698
sists: 5675
sassy: 5613
casts: 5604
scats: 5604
basts: 5591

Interesting! This algorithm was easy to throw together, but these results aren't great. You can tell that the weight is heavily skewed to words beginning and ending with 'S'. Guessing double letters is high risk, but high reward in wordle.

Let's see if we can make this code a little smarter. Let's drop all words that have any duplicate letters.

Unique letters:

pacts: 4898
camps: 4776
scamp: 4776
carbs: 4761
crabs: 4761
scrab: 4761
carps: 4711
craps: 4711
scarp: 4711
scrap: 4711

Still very heavily 'S' based. Can we do better?

Let's give ourselves some more parameters. Let's require at least two vowels and no duplicates.

capes: 4386
paces: 4386
scape: 4386
space: 4386
pebas: 4373
capos: 4345
pacos: 4345
scopa: 4345
caste: 4342
cates: 4342

We've finally moved away from the end 'S' dominating the results. Notice the final 'E' is also very popular.

One more time, now with at least three vowels:

paseo: 3726
psoae: 3726
cause: 3716
sauce: 3716
abuse: 3703
beaus: 3703
saice: 3692
stoae: 3682
toeas: 3682
abies: 3679

It got tedious modifying the script for these variables, so now it will take parameters:

#!/usr/bin/env python3

import sys
import getopt

def main(argv):
  try:
    opts, args = getopt.getopt(argv, "hi:c:v:y", ["help"])
  except getopt.GetoptError:
    print ("./build_dist.py -i <inputfile> -c <minConsonants> -v <minVowels> -y (flag only, should y be treated as vowel)")
    sys.exit(2)

  inputfile = None
  minConsonants = 0
  minVowels = 0
  y_is_vowel = False
  for opt, arg in opts:
    if opt in  ('-h', "--help"):
      print ("./build_dist.py -i <inputfile> -c <minConsonants> -v <minVowels> -y (flag only, should y be treated as vowel)")
      sys.exit()
    elif opt in ("-i"):
      inputfile = arg
    elif opt in ("-c"):
      minConsonants = int(arg)
    elif opt in ("-v"):
      minVowels = int(arg)
    elif opt in ('-y'):
      y_is_vowel = True

  if inputfile == None:
    print ("inputfile required")
    print ("./build_dist.py -i <inputfile> -c <minConsonants> -v <minVowels> -y (flag only, should y be treated as vowel)")
    sys.exit(2)
  
  total_req = (minConsonants + minVowels)
  if total_req > 5:
    print ("Requsted min consonants [{}] and min vowels [{}] exceed five character word length.".format(minConsonants, minVowels))
    sys.exit(2)

  print ("Building distribution from {} with {} min consonants and {} min vowels. 'Y is a vowel'={}...".format(inputfile, minConsonants, minVowels,y_is_vowel))

  data = []
  distro = {}
  with open(inputfile, 'r') as f:
    for line in f:
      data.append(line.strip())
  for word in data:
    i = 0
    for letter in word:
      if not i in distro:
        distro[i] = {}
      if not letter in distro[i]:
        distro[i][letter] = 1
      else:
        distro[i][letter] += 1
      i += 1
  for key in distro:
    # python 3 preserves insert order, sort of letter key
    distro[key] = dict(sorted(distro[key].items(), key = lambda item: item[0]))

  vowels = ['a', 'e', 'i', 'o', 'u'] # not doing y for now
  # unique letters and at least two vowels
  word_weights = {}
  for word in data:
    disallow = False
    n_con = 0
    n_vowels = 0
    duplicates = {}
    for char in word:
      if char in vowels or (char == 'y' and y_is_vowel):
        n_vowels += 1
      else:
        n_con += 1
      if char in duplicates:
        disallow = True
      else:
        duplicates[char] = 1
    if disallow == True or n_vowels < minVowels or n_con < minConsonants:
      continue
    weight = 0
    i = 0
    for letter in word:
      weight += distro[i][letter]
    word_weights[word] = weight

  word_weights = dict(sorted(word_weights.items(), key = lambda item: item[1], reverse = True))
  for word,weight in word_weights.items():
    print (word + ": " + str(weight))

if __name__ == "__main__":
  main(sys.argv[1:])

Trying out combinations of consonants and vowels becomes much easier:

./build_dist.py -i sorted_words.txt -c 1 -v 4

Building distribution from sorted_words.txt with 1 min consonants and 4 min vowels...
adieu: 2079
miaou: 2046
audio: 2038
aurei: 2022
uraei: 2022
auloi: 1930
ouija: 1555
ourie: 1547
louie: 1496

There is also a flag to count y as a vowel. This allows for some interesting exploring of the dataset:

./build_dist.py -i sorted_words.txt -c 1 -v 4 -y

Building distribution from sorted_words.txt with 1 min consonants and 4 min vowels. 'Y is a vowel'=True...
youse: 2500
coyau: 2291
bayou: 2278
boyau: 2278
adieu: 2079
miaou: 2046
audio: 2038
aurei: 2022
uraei: 2022
aiery: 2014
ayrie: 2014
auloi: 1930
pioye: 1770
noyau: 1694
ouija: 1555
ourie: 1547
louie: 1496
ulyie: 1415
yowie: 1324

The executive summary from this data sampling is that there is not a specific subset of optimal words to begin with for wordle. It depends on your style of play: lots of vowels, Wheel of Fortune RSTLNE, so on and so forth.

Code with dataset can be found here.