Python meets Wordle

I read an interesting article on ABC News (Australia) about Wordle, it’s worth checking out. The author stated:

I use “share” as my first guess because it includes the two most common vowels, “s” which is the third most common letter and most common final letter in English words, while “h” and “r” are common individually and even more common in consonant clusters, so their presence or absence instantly knocks out a range of possibilities. But I admire people who play with less strategy. Lots of people guess on a whim

So, I decided to write a little Python code and test this hypothesis. The first step was to download a 100,000 English word vocabulary, read it into a Python list and keep only the 5-letter words (with apostrophes / hyphens / etc) stripped out.

Here’s a sample output for solving, the hidden word is “cents“:

Search for: cents
Guess=share, Match=_____, Avoid=har, DiffSpot=['s', '', '', '', 'e']
Guess=times, Match=____s, Avoid=harim, DiffSpot=['st', '', '', 'e', 'e']
Guess=poets, Match=___ts, Avoid=harimpo, DiffSpot=['st', '', 'e', 'e', 'e']
Guess=cents, Match=cents, Avoid=harimpo, DiffSpot=['st', '', 'e', 'e', 'e']

Some notes on the search algorithm:

  • The “Match” shows which letters are in the correct spot (the green square in Wordle)
  • “Avoid” are the letters we know can’t be in the word
  • “Diff Spot” corresponds to the yellow boxes, the letters we know are in the word, but not in this position, i.e. Need to be used, but in a different spot
  • It’s a little brute-force, no clever search algorithms and doesn’t avoid string concatenation and other GC abuses
  • I didn’t spend any time on statistical significance tests, etc
  • The code takes about 10-12ms per puzzle, on my Macbook Pro (M1 Pro chip)

I tried a couple of strategies for the initial word, tested by playing 1000 games that start with a random word from the vocabulary

StrategyAverage number of tries
Random word4.4
Reasonable guess (see below)4.3
“share”4.3
“xylem”4.5

The reasonable guess was done by searching 100 random words from the list and scoring each word based on the number of letters is has from this list “eariot“, then picking the best one. If you’ve solved simple ciphers by hand, you might recognize this as the start of the letter frequency list. For example, “share” would score 3, whereas “xylem” would score 1 point

Some observations:

  • Using “share” or even a reasonable guess is better than random
  • A bad guess with less frequent letters is worse than random, but not hugely so

Well, if I get the time, I might let it run all night and find a list of best words. There might also be a way to use n-gram frequencies or some information theoretic approach to guess the best starting word, or to improve the intermediate guesses. Could be fun 🙂

Posted in Uncategorized | Leave a comment

Merry Christmas !!!

Wishing everybody a wonderful celebration. A friend sent me this Christmas math joke, just had to share…

Posted in Uncategorized | Leave a comment

Kendall’s Tau and the CrossFit Games

Correlation of Results

Correlation coefficients are a measure of agreement, and Kendall’s Tau is used to rank agreement of the order of items. An example of this could be seeing how closely different athletes are in ranking.

Suppose Alice was a gymnast and loves everything gymnastics-related. Brenda spent time as a weightlifter and loves throwing heavy things around. Cindy was a runner and swimmer, and excels at anything aerobic. Now, if they perform workouts and get these results:

WorkoutWinnerSecondThird
1. Pushups & DeadliftsAliceBrendaCindy
2. Burpees over the BoxCindyAliceBrenda
3. RowerAliceCindyBrenda

Comparing Workout 1 & Workout 2, we have 3 pairings of our athletes (A-B, A-C, B-C) and only one of these matches (Alice beat Brenda in both) and two don’t, so our correlation is defined as:

(concordant_pairs – discordant_pairs) / (num_pairs)

In this case, it’s (1-2)/3 = -0.33. Similarly, if we compare Workouts 2 & 3, and two pairings are concordant (A>B and B>C), so these correlate (2-1)/3 = 0.33. Workouts 1 & 3 have 2 concordant pairs and correlate (2-1)/3 = 0.33. We might then conclude that Workouts 1 & 3 most differentiate the athlete rankings.

Much like regular correlation coefficients, if the athletes finish in exactly the same order, the score would be 1.0. And, if they finished in exactly reverse order, the correlation would be -1.0. BTW, you could also use this to look at sports with subjective judging and see which judges are voting in blocks against the consensus, etc.

2021 CrossFit Games

For this analysis, I took the 20 highest ranked athletes and the order in which they ranked in each of the 15 events in the finals. I chose 20 as this many were in the final cut and went through all 15 workouts. I typed the data in manually, did the calculations with Python and then built a heat map with Tableau:

In this, Event 0 represents the overall ranking. The intensity of the blue shows a positive correlation, and orange shows a negative correlation. The first thing I noticed was that the men’s events correlated a lot less than the women’s event. The women were more likely staying in the same rank order across all events.

A few more observations, just looking at individual event pairs:

  • For the men, the worst predictor of overall ranking was Event 10 (30 toes-to-bar, 1.5 mile run, 30 toes-to-bar, 1.5 mile run, 30 toes-to-bar). It had a slight negative correlation (-0.147)
  • The strongest positive correlation was Event 4 (wall walks and thrusters) and Event 14 (deadlifts and hand-stand pushups)
  • The strongest negative correlation (-0.432) was Event 10 with Event 8 (handstand obstacle course. The women tended to stay in the same order (tau = 0.211)
  • For the women, the strongest correlation of finishing in the same order was Events 6 & 7. These were basically the same thing, 5 rounds of running and a clean, so this is not surprising. The least correlated Events for the women were 9 & 11.

So, what could we do with this information? I’m still thinking about that. One might do a larger analysis and look at this over several years to see what types of events correlate least to final position – then schedule these types of events earlier in the finals. This would create a lot more chaos on the leaderboard 😎

Note: I am not affiliated with CrossFit or sponsored by them, this was just a hobby project. The event descriptions and results are at:
https://games.crossfit.com/workouts/games/

Posted in Uncategorized | Tagged , , | Leave a comment

Polar Charts and Best Glide Speed

One of things pilots learn about their aircraft is the Best Glide Speed (Vbg). Gliders come with a page in the manual the shows the sink rate at various speeds and glider pilots use a couple of rules of thumb to adjust their glide speed for headwinds and tailwinds. One day, for fun, I took out a Sport Cruiser (CZAW) and had a friend take pictures of the display while I tried different glide speeds – then I plotted them in Excel to see if I got the same basic shape:

Notice how quickly the aircraft descends when it drops below 50 knots? The clean stall speed is 39 knots and once you hold it in the “falling leaf” maneuver, it starts dropping very quickly. Also, if you err 10 knots too fast, at least you’re getting closer to somewhere to land, if you’re 10 knots too slow then your choices are limited as you’re dropping but not going as far. BTW, I think it would be interesting to do similar experiments and see what happens during the “impossible turn“.

Note: These experiments were all done at least 3000 feet AGL, I have aerobatic experience in several different types, a glider rating, and the Sport Cruiser has a BRS parachute. Please be careful up there!

Posted in Uncategorized | Tagged | Leave a comment

Some time back I watched an excellent video of Geoff Hinton presenting deep learning at Google, http://www.youtube.com/watch?v=AyzOUbkUf3M, and I decided that I really needed to learn more about this technology.

Well, the best way to learn is to do, so I decided to start coding a Restricted Boltzmann Machine (RBM) in R.  I found this really good article and code online, http://bayesianthink.blogspot.com/2013/05/the-restricted-boltzmann-machine-rbm.html, and decided to see if I make an S3 object out of it and make it run a little faster…

# Setup the training data 
image_rows <- 5 
image_cols <- 9 
up <- c(0,0,0,0,1,0,0,0,0,
        0,0,0,1,1,1,0,0,0,
        0,0,1,1,1,1,1,0,0, 
        0,1,1,1,1,1,1,1,0, 
        1,1,1,1,1,1,1,1,1) 
down <- c(1,1,1,1,1,1,1,1,1, 
          0,1,1,1,1,1,1,1,0, 
          0,0,1,1,1,1,1,0,0, 
          0,0,0,1,1,1,0,0,0, 
          0,0,0,0,1,0,0,0,0) 
# Reverse each list, so that the image is drawn correctly 
up <- rev(up) 
down <- rev(down)
training_set <- rbind(up, down)

This gives us two arrows, one pointing up and the other pointing down:

RBM-up-arrowRBM-down-arrow

Next, we can run the RBM with the following code.  It’s based on the reference above, but I have added bias units and recoded it to make better use of R’s vector functions.  This gave about a 10x speedup on my Mac Mini.  I haven’t yet added mini-batches, convergence testing, or anything that would make it more industrial strength, but that would only complicate it for now.

rbm.prob <- function(a) {
  1/(1 + exp(-a))
}

fit.rbm <- function(data, maxiter=5000, alpha=0.01, hidden=3) {
  # Add 1 to the sizes for the bias units
  hstates <- hidden + 1
  vstates <- ncol(data) + 1
  init.std = 0.5
  w <- matrix(data=rnorm(hstates * vstates, 0, init.std), nrow=vstates)

  for (iter in seq(1:maxiter)) {
    # Get a training sample
    pos <- sample(1:nrow(data),1)
    visible.state <- c(data[pos,], 1)  # Add in bias unit

    # Forward pass, to get hidden unit activations
    hidden.prob <- as.vector( rbm.prob( visible.state %*% w ) )
    hidden.state <- ifelse(hidden.prob > runif(hstates,0,1), 1, 0)
    hidden.state[length(hidden.state)] <- 1  ## Bias
    iter0 <- visible.state %o% hidden.state

    # Reconstruction pass, to get visble unit activations from hidden
    visible.prob <- as.vector( rbm.prob( w %*% hidden.state ) )
    visible.state <- ifelse(visible.prob > runif(vstates,0,1), 1, 0)
    visible.state[length(visible.state)] <- 1  ## Bias

    # Now go forwards again
    hidden.prob <- rbm.prob( visible.state %*% w )
    hidden.prob[length(hidden.prob)] <- 1  ## Bias
    hidden.state <- ifelse(hidden.prob > runif(hstates,0,1), 1, 0)

    # Update weights
    iter1 <- visible.state %o% as.vector( hidden.prob )
    w <- w + alpha * (iter0 - iter1)
  }

  rc = list(weights=w, visible=vstates, hidden=hstates)
  class(rc) <- "rbm"
  rc
}

And now we can take a look at the output graphically.  If we show the visible layer an upwards arrow and activate the hidden layer, this shows some of the images that it “reconstructs” – so now we can see what it dreams about:

RBM-dreaming-1 RBM-dreaming-2 RBM-dreaming-3 RBM-dreaming-4 RBM-dreaming-5

Here’s the code that draws the little picture, in case you’d like to try it:

dream.rbm <- function(x, sample) {
  # Add bias unit
  visible.state <- c(sample, 1)
  hidden.prob <- as.vector( rbm.prob( visible.state %*% x$weights ) )
  hidden.state <- ifelse(hidden.prob > runif(x$hidden,0,1), 1, 0)
  dream.prob <- as.vector( rbm.prob( x$weights %*% hidden.state ) )
  dream.state <- ifelse(dream.prob > runif(x$visible,0,1), 1, 0)
  # remove bias unit
  dream.state[1:length(dream.state)-1]
}

d <- dream.rbm(r, up)
image( t( matrix(d, nrow=image_rows, byrow=TRUE) ), xaxt="n:", yaxt="n" )

Oh, and one more fun thing to play with. We can set the hidden state to whatever we want and see what type of images are recalled in the visible layer:

try_hidden.rbm <- function(x, h) {
  hidden.state <- c(h,1)
  visible.prob <- as.vector(rbm.prob( x$weights %*% hidden.state ))
  visible.state <- ifelse(visible.prob > runif(x$visible,0,1), 1, 0)
  visible.state[1:length(visible.state)-1]
}

d2 <- try_hidden.rbm(r, c(0,1,0))
image( t( matrix(d2, nrow=image_rows, byrow=TRUE) ), xaxt="n:", yaxt="n" )

And using (0,1,0) results in images that look like this:

RBM-1011-1 RBM-1011-2 RBM-1011-3

This is one of the building blocks of deep learning.  The next two steps are:

  • Stack 2 or more unsupervised stages together, to find even deeper structure in the data.  Hinton’s video shows some of the things that are possible and should really whet your appetite to dive into deep learning.
  • Build a supervised model that uses the hidden layer units as inputs.  For example, logistic regression is so much easier using the 3 hidden units, instead of the 45 visible units.

Enjoy!

Posted on by alanw | Leave a comment

The new SAT

Last night I was briefly watching the news, and they interviewed Susan Hansen as a “test prep expert” talking about the new SAT.  She made a comment that the 1/4 point penalty for a wrong answer doesn’t reward guessing.

Actually, it’s a little more complicated than this and the penalty for wrong answers is quite interesting.  If you can cross out a couple of obviously wrong answers, then you should just guess between the remaining choices.  Let’s take a look at the College Board SAT Question Of The Day ** for today:

Troy was ——- when he wasn’t elected class president: his spirits were so low that there was nothing we could say or do to cheer him up.

  • (A) unctuous
  • (B) disconsolate
  • (C) ebullient
  • (D) inscrutable
  • (E) tenacious

Now, suppose you don’t know the correct answer, but you do know what “tenacious” and “inscrutable” mean.  Cross these two off the list and pick one of the remaining three answers at random.  One third of the time you’ll be correct and get one point, two thirds of the time you’ll be wrong and lose a quarter point.

Expected score = (1/3 * 1) – (2/3 * 1/4) = 1/6

So, the expectation is positive, so you’re better off guessing between the three remaining options than leaving it blank.  If you can narrow it down to two possible answers, then:

Expected score = (1/2 * 1) – (1/2 * 1/4) = 3/8

So you get 3/8 point by guessing.  The system rewards you for what you do know, even if you don’t have the exact answer.

I tested out of a third of my undergraduate degree, via the CLEP exams.  In every test that had multiple choice answers, I used this strategy.  It works!

** http://sat.collegeboard.org/practice/answered-question-of-the-day

Posted in Uncategorized | Leave a comment

What R taught me about Java

I was recently using glmnet to do some classification work.  It’s a great package and worked really well for me, but I kept getting some really frustrating errors:

> Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
 > NA/NaN/Inf in foreign function call (arg 5)

This also happened with elnet, mrelnet…

A quick search of the Internet suggested that it often happened when you pass in bad data. Well, since I had written the Java code to create the input data, I was pretty sure that the data was correctly formatted. How hard can it be to create a CSV file and then read it into a data frame in R?

Well, turns out that I didn’t flush the file or close it, and the garbage collector doesn’t do that for you.  So, the file I wrote was incomplete and I hadn’t looked at the last few lines.

Lesson learned – always close your files 🙂

Posted in Uncategorized | Tagged , | Leave a comment

Three is less than a hundred, right?

I’ve recently started consulting as a Data Scientist, and wanted to share something interesting with Hadoop.  One of the people on my team was having trouble getting her code to run, as it kept running out of space.

The code was forecast to write about 3 GB of output, and her quota was 30 GB, and it ran out of space.  The systems guys upped her quota to 100 GB, and yet it still failed.  As I helped her dig into the code, it became obvious…

The Pig script chose to use 135 reducers, and each reducer writes an output file to HDFS.  The Oracle Big Data Appliance (BDA) had been setup for a 256 MB block size, so this is how the Name Node considers the maximum amount of space that could be used:

(135 reducers) x (256 MB) x (3 replication factor) = 103 GB

So, the solution to making this work within the quota was to reduce the number of reducers, which didn’t hurt the parallelism of the job.  Writing smaller blocks can also be done, but wasn’t needed for this job.

As I’m working now with Hadoop, I’ll be posting some more on it.

Posted in Uncategorized | Tagged , , | Leave a comment

Futoshiki – the code

Oh, I thought I might publish the code.  I haven’t done much in Python – but the interface to Gurobi is really nice and makes great use of list comprehensions.  I would rather use Python for mathematical programming than AMPL or OPL, as Python is a much more complete language and I can get help and extensions on the web…

#
# Futoshiki puzzle, using Gurobi
#
from gurobipy import *
# Data
dim = [1,2,3,4,5]
greater = [
  (1,1, 1,2),
  (2,3, 2,4),
  (3,2, 2,2),
  (3,3, 2,3),
  (3,4, 3,3),
  (4,3, 5,3),
  (4,5, 5,5),
  (5,2, 4,2),
  (5,3, 5,2),
  (5,3, 5,4)
  ]

try:
  # Create a new model
  m = Model("plan1")

  # Create variables
  x = {}
  for row in dim:
    for col in dim:
      x[row,col] = m.addVar(lb=1, ub=5,vtype=GRB.INTEGER, name='x_%s_%s' % (row, col))

  # Integrate new variables into the model
  m.update()

  # Relationship constraints
  for row1,col1, row2,col2 in greater:
    m.addConstr(x[row1,col1] >= 1 + x[row2,col2], 'gt_%s_%s_%s_%s' % (row1,col1,row2,col2))

  # Each digit appears only once per row
  for i in dim:
    m.addConstr(quicksum(x[i,col] for col in dim) == 15, 'row_%s' % (i))
    m.addConstr(quicksum(x[row,i] for row in dim) == 15, 'col_%s' % (i))
  m.addConstr(quicksum(x[i,j] for i,j in x) == 15*5, 'total')

  # Solve it
  m.optimize()

  # Display the solution
  for v in m.getVars():
    print v.varName, v.x

  print 'Obj:', m.objVal

except GurobiError:
  print 'Error reported'
Posted in Uncategorized | Leave a comment

Futoshiki

A few weeks ago I was introduced to Futoshiki, a Japanese puzzle.  It was in the Gulf News, which I was reading in the U.A.E. while there on a business trip.

Once I got back home, I was wondering if I could write a program to do this by Integer Programming (IP).  The technique is similar to solving Sudoku puzzles via IP.

I wrote a description of the formulation and uploaded it as a PDF file, I hope you find this interesting.  To solve the puzzle I used the Gurobi solver, coding in Python.  The problem size, with a 5×5 puzzle, is small enough to solve with the trial edition of Gurobi.  In future posts I’ll find new ways to abuse math programming systems for games and other diversions 🙂

https://mathstream.com/wp-content/uploads/2012/08/futoshiki.pdf

Posted in Uncategorized | Tagged , , , | Leave a comment