Saturday, January 31, 2015

Predicting Citrus varieties from Reflectance Data


This is yet another exercise from the Caltech/JPL Summer School on Big Data Analytics. This one is for the lectures by David Thompson on Dimensionality Reduction. The problem is to differentiate citrus plants from non-citrus plants (and then between different varieties of citrus plants) from the light spectrum they reflect as observed by an airborne instrument. The exercise requires the student to do some basic data analysis, then perform Principal Component Analysis (PCA) to find the best features to differentiate between Citrus and Non-Citrus plants. In the next part of the exercise, the student is asked to use Linear Discriminant Analysis (LDA) to predict one of six varieties of Citrus plants.

The dataset consists of 4 comma-separated (CSV) text files. The first two are for the first (PCA) exercise, and the second two are for the second (LDA) exercise. Each pair consists of a data file with the class (citrus or non-citrus in the first pair and the citrus species in the second case), the plant number and a set of reflectance numbers, each corresponding to a different wavelength. There are 426 wavelengths in the first case and 388 in the second. Each plant has multiple records, possibly corresponding to observations at different times. The actual wavelength values are supplied in the second file (for each pair), one wavelength per line. The first dataset has 5,944 records and the second dataset has 2,109 records.

In this post I describe my attempt at solving the exercise. I used Python with Numpy and Scikit-Learn, and Matplotlib for the visualizations.

We first attempt to visualize the data and see if we can observe a difference in the reflectance properties between citrus and non-citrus plants. This is

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from sklearn.cross_validation import KFold
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import random

DATA_DIR = "../../data/citrus_pca"

def plot_by_class(X, y, w):
    """Plot reflectance across wavelengths for citrus vs non-citrus"""
    viz_df = pd.DataFrame()
    viz_df["wavelength"] = w
    colnames = {1: "citrus", 2: "noncit"}
    for yval in range(1,3):
        Xc = X[np.where(y == yval)[0], :]
        xmin = np.min(Xc, axis=0)
        xavg = np.mean(Xc, axis=0)
        xmax = np.max(Xc, axis=0)
        viz_df[colnames[yval] + "_min"] = xmin.T
        viz_df[colnames[yval] + "_avg"] = xavg.T
        viz_df[colnames[yval] + "_max"] = xmax.T
    plt.subplot(211)
    viz_df.plot(x="wavelength", y=["citrus_min", "citrus_avg", "citrus_max"],
                title="Citrus Plants", style=["r--", "b", "g--"])
    plt.xticks([])
    plt.xlabel("")
    plt.ylabel("reflectance")
    plt.subplot(212)
    viz_df.plot(x="wavelength", y=["noncit_min", "noncit_avg", "noncit_max"],
                title="Non-Citrus Plants", style=["r--", "b", "g--"])
    plt.ylabel("reflectance")
    plt.show()
    
cid_df = pd.read_csv(os.path.join(DATA_DIR, "Citrus_Identification_Data.txt"))
y_class = np.asarray(cid_df["class"])
y_plant = np.asarray(cid_df["plant"])
X = np.matrix(cid_df[cid_df.columns[3:]])  # capture the bNNN cols (representing wavelengths)

wavelengths = np.loadtxt(os.path.join(DATA_DIR, "Citrus_Identification_Wavelengths.txt"))

# How many classes?
print "#-classes:", len(np.unique(y_class))
print "#-plants:", len(np.unique(y_plant))

# Plot reflectance vs wavelength for citrus vs non-citrus plants
plot_by_class(X, y_class, wavelengths)

The code above tells us that the data describes two classes (Citrus, Non-Citrus) and 160 unique plants. The plot shown below shows the variation in reflectance for each class of plant for different wavelengths. The dashed green and red lines represent the maximum and minimum reflectance for all plants in that class, and the blue line represents the mean. As you can see, the graphs look quite similar, although there are small differences in the reflectance.


Our next attempt is to see if there are differences across individual plants. The following code calculates the mean spectrum for each of 10 randomly chosen plants. Once more, differences appear to be in the amount of reflectance of each plant, the shape of the spectrum looks fairly uniform across plants.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def plot_random_plants(X, y, w, n):
    """Plot reflectance vs wavelengths for n random plants"""
    viz_df = pd.DataFrame()
    viz_df["wavelength"] = w
    # generate n random plant ids
    plant_ids = set()
    num_unique = len(np.unique(y))
    while len(plant_ids) < n:
        plant_ids.add(int(num_unique * random.random()))
    # for each plant plot mean value of wavelengths
    plant_ids_lst = list(plant_ids)
    pcols = []
    for i in range(n):
        Xp = X[np.where(y == plant_ids_lst[i])[0], :]
        xavg = np.mean(Xp, axis=0)
        pcols.append("p" + str(i))
        viz_df["p" + str(i)] = xavg.T
    viz_df.plot(x="wavelength", y=pcols, 
                title="Mean for %d random plants" % (n))
    plt.ylabel("reflectance")        
    plt.show()

## Plot Reflectace vs Wavelengths for 10 random plants on same chart
plot_random_plants(X, y_plant, wavelengths, 10)


We then attempt to calculate the correlation between features. As this blog post points out, high correlation between features may be an indication that there is one or more underlying factors that drives their similarity, and some less important factors drive their differences. When features are highly correlated, it may be possible to describe them adequately with a small number of principal components.

The code below computes an all-pairs correlations across all pairs of features, and plots the correlation matrix as a heatmap. I scaled the data first (mean of 0 and standard deviation of 1) first because PCA (next step) needs the input to be scaled. This does not have an effect on the correlation coefficients. As you can see from the heatmap, there is some correlation around the lower end of the spectrum and once again some (weaker) correlation at the higher end.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def plot_correlations_as_heatmap(X):
    """Plots heatmap from blue (low) to pink (high)"""
    corr = np.corrcoef(X)
    plt.imshow(corr, interpolation="nearest", cmap=plt.cm.cool)
    plt.xticks(())
    plt.yticks(())
    plt.title("Correlation Matrix")
    plt.colorbar()
    plt.show()
    
# plot heat map of correlations between features
scaler = StandardScaler()
Xscaled = scaler.fit_transform(X)
plot_correlations_as_heatmap(Xscaled)


We then perform PCA on the scaled dataset and see how the eigenvalues decay across the number of components. We restrict our plot to the first 50 principal components (to keep the chart readable), which explains 85.6% of the total variance. The first few components account for the bulk of the variance, so the behavior of the system can be approximated by considering only the first few components. Since the correlations are not very strong, the decay is not that dramatic.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def plot_fraction_of_variance_explained(s, n):
    """Plot the first n values of eigenvalue s as fraction of total"""
    fve = s / np.cumsum(s)[-1]
    plt.plot(np.arange(len(fve))[0:n], fve[0:n])
    plt.title("Eigenvalue Decay")
    plt.xlabel("Principal Eigenvalues")
    plt.ylabel("Fraction of Variance Explained")
    plt.show()
    return fve

# Perform a PCA on the (scaled) matrix
U, s, V = np.linalg.svd(Xscaled)

# Plot eigenvalue decay (as fraction of variance explained)
fve = plot_fraction_of_variance_explained(s, 50)


We then project our data to the first 10 components (accounting for 69.4% of the total variance) and plot the correlations between the first few components. We do this for both the original data and the projected data. As expected, the original data exhibit higher correlation than the projected data. However, even in the projected data, there is no clear separation between citrus (blue) and non-citrus (red).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def project_data(X, U, k):
    Ured = U[:, 0:k]
    Sred = np.diag(s[0:k])
    return np.dot(Ured, Sred)

def recover_data(Xred, V, k):
    Vred = V[0:k, :]
    return np.dot(Xred, Vred)
    
def plot_scatter_plots(X, y, title):
    pairs = [(0, 1), (1, 2), (0, 2)]
    Xparts = [X[np.where(y == 1)[0], :], X[np.where(y == 2)[0], :]]
    for i in range(len(pairs)):
        plt.subplot(1, 3, i + 1)
        for j in range(len(Xparts)):
            ecol = 'b' if j == 0 else 'r'
            plt.scatter(Xparts[j][:, pairs[i][0]], Xparts[j][:, pairs[i][1]], 
                        s=10, edgecolors=ecol, facecolors='none')
        if i == 1:
            plt.title(title)
        plt.xlabel("X" + str(pairs[i][0]))
        plt.ylabel("X" + str(pairs[i][1]))
        plt.xticks(())
        plt.yticks(())
        i = i + 1
    plt.show()

# Project dataset onto the first k components
k = 10
print "Variance explained with %d principal components = %0.3f" % (k, np.sum(fve[0:k]))

# Project Xscaled to first k dimensions
Xprojected = project_data(Xscaled, U, k)
plot_by_class(Xprojected, y_class, np.arange(k))

# Scatter plot between 1/2. 2/3, 1/3
plot_scatter_plots(Xscaled, y_class, "Original Data")
plot_scatter_plots(Xprojected, y_class, "Projected Data")




The final step in this part of the exercise was to build a classifier to predict between citrus and non-citrus plants. Because it appeared that the reflectance of citrus plants are spread out higher than non-citrus plants (also borne out by the original chart of citrus vs non-citrus reflectances), I decided to use a Support Vector Machine with a Radial (RBF) kernel. The results were a bit counter-intuitive - with 3-fold cross-validation over the original data, the accuracy was 85.1% but after PCA only 80.1%. I also tried using a RandomForest with 200 trees (code not shown here), and obtained an accuracy of 84.3% with the original data and 85.7% after PCA (data projected to the first 10 components).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def evaluate_classification(X, y):
    """Evaluate classifier using 10 fold cross validation"""
    kfold = KFold(X.shape[0], 3)
    scores = []
    for train, test in kfold:
        Xtrain, Xtest, ytrain, ytest = X[train], X[test], y_class[train], y_class[test]
        clf = SVC()  # rbf kernel default
        clf.fit(Xtrain, ytrain)
        ypred = clf.predict(Xtest)
        scores.append(accuracy_score(ypred, ytest))
    return 1.0 * sum(scores) / len(scores)
    
# Train classifier on original data with 10-fold cross validation
score_orig = evaluate_classification(Xscaled, y_class)
score_pca = evaluate_classification(Xprojected, y_class)
print "Classification accuracy: original: %.3f, after PCA: %.3f" % (score_orig, score_pca)

The next part of the exercise was to build an LDA classifier to differentiate between different species (varieties) of citrus plants, again based on similar reflectance data. The first step is to do PCA to project the data onto its two principal components. As you can see, there is some separability between the different varieties.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from sklearn.cross_validation import train_test_split, KFold
from sklearn.lda import LDA
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd

DATA_DIR = "../../data/citrus_pca"

cvdf = pd.read_csv(os.path.join(DATA_DIR, "Citrus_Varieties.txt"))
y_species = np.asarray(cvdf["class"])
y_plant = np.asarray(cvdf["plant"])
X = np.matrix(cvdf[cvdf.columns[3:]])

wavelengths = np.loadtxt(os.path.join(DATA_DIR, "Citrus_Varieties_Wavelengths.txt"))

# Reduce to 2 dimensions and visualize
scaler = StandardScaler()
Xscaled = scaler.fit_transform(X)

U, s, V = np.linalg.svd(Xscaled)
Ured = U[:, 0:2]
Sred = np.diag(s[0:2])
Xproj = np.dot(Ured, Sred)

species_ids = np.unique(y_species)
print "species_ids:", species_ids
colors = ['b', 'g', 'r', 'c', 'm', 'y']
i = 0
for species_id in species_ids:
    Xpart = Xproj[np.where(y_species == species_id)[0], :]
    plt.scatter(Xpart[:, 0], Xpart[:, 1], color=colors[i])
    i = i + 1
plt.title("Citrus Species (first 2 Principal Components)")
plt.xlabel("X0")
plt.ylabel("X1")
plt.show()


The next step is to train a multiclass LDA classifier and test its accuracy on a holdout test dataset. We split our dataset 75/25 train/test, train a Scikit-Learn LDA classifier on the training data, and obtain a pretty impressive score of 97.3% accuracy on the hold out test set. LDA provides the "mean" vector for each of its classes. In order to check which species are most similar, we build a correlation matrix and represent it as a heatmap. As can be easily seen from the heatmap, the most similar species (in terms of reflectance spectrum) are the Mandarin, Pummelo and the Hybrid.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Perform multiclass LDA 
Xtrain, Xtest, ytrain, ytest = train_test_split(Xscaled, y_species, 
                                                test_size=0.25, 
                                                random_state=42)
clf = LDA(len(species_ids))
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
print "LDA Accuracy Score: %.3f" % (accuracy_score(ypred, ytest))

# What varieties are most spectrally similar?
corr = np.corrcoef(clf.means_)
plt.imshow(corr, interpolation="nearest", cmap=plt.cm.cool)
plt.xticks(np.arange(len(species_ids)), species_ids, rotation=45)
plt.yticks(np.arange(len(species_ids)), species_ids)
plt.colorbar()
plt.show()


Finally, in an effort to test if we can predict species (and then plant) effectively from the reflectance data, we build small harnesses to do 10-fold cross validation with the LDA classifier. The cross-validation accuracy score for the species predictor is a slightly less impressive 78.6% and for the plant predictor an abysmal 3.2%.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Find LDA classifier accuracy using cross validation
kfold = KFold(Xscaled.shape[0], 10)
scores = []
for train, test in kfold:
    Xtrain, Xtest, ytrain, ytest = Xscaled[train], Xscaled[test], \
        y_species[train], y_species[test]
    clf = LDA(len(species_ids))
    clf.fit(Xtrain, ytrain)
    ypred = clf.predict(Xtest)
    scores.append(accuracy_score(ypred, ytest))
print "LDA accuracy predicting species: %.3f" % (1.0 * sum(scores) / len(scores))

# Find LDA classifier accuracy for predicting specific plants
kfold = KFold(Xscaled.shape[0], 10)
scores = []
for train, test in kfold:
    Xtrain, Xtest, ytrain, ytest = Xscaled[train], Xscaled[test], \
        y_plant[train], y_plant[test]
    clf = LDA(len(species_ids))
    clf.fit(Xtrain, ytrain)
    ypred = clf.predict(Xtest)
    scores.append(accuracy_score(ypred, ytest))
print "LDA accuracy predicting plants: %.3f" % (1.0 * sum(scores) / len(scores))

This is all I have for today. As always (at least for most of my recent posts), this code is available on github here. I had a lot of fun doing this exercise even though the results are probably not that great. I did not know about the LDA classifier before this, got to learn about this. I also learned quite a few Matplotlib tricks. I hope you enjoyed it too.

Monday, January 19, 2015

Using Random Forests to find Interesting Features on Comets


This is another exercise from the Caltech/JPL Summer School on Big Data Analytics, specifically the one following the lectures on Decision Trees and Random Forests by Thomas Fuchs of JPL. The object of the exercise is to predict interesting features on comets and asteroids as space probes fly by them, so that onboard cameras can take detailed pictures of those spots autonomously. Manual control from the ground is not an option because time lags are significant at these distances. The exercise asks to build and train a Random Forest classifier and investigate its performance.

The dataset provided is of surface features from the comet Hartley. There are 2,089 structured observations provided as a R dataset (.rdata). The structure of the dataset is as follows:

1
2
3
4
5
6
7
8
struct data {
  float[11][11] gray;      // grayscale values normalized 0..1
  float[11][11] median;    // grayscale for median filtered image (0..1)
  int Y;                   // 0 = boring; 1 = interesting
  int[4] rect;             // bounding box of the observation
  char* frameName;
  int frameId;
}

Since I prefer to work in Python, I decided to build my classifier using the Scikit-Learn toolkit. I initially tried the rpy2 to read this data from within Python, but it complained that the .rdata file was corrupt. So I switched around and used a combination of my rudimentary R skills and the starter code to write the data out from R into flat files, using the code below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Source: rdata_reader.R
setwd("/.../mlia-examples/data/hartley")
dat <- dget("patchesHartley2.rdat")

y <- sapply(dat, function(x) x$Y)
write.table(y, file="y.csv", row.names=FALSE, col.names=FALSE)

Xgray <- matrix(nrow=length(dat), ncol=121)
Xmedian <- matrix(nrow=length(dat), ncol=121)
for (i in 1:length(dat)) {
  Xgray[i, ] = as.vector(dat[[i]]$gray)
  Xmedian[i, ] = as.vector(dat[[i]]$median)
}
write.table(Xgray, file="Xgray.csv", row.names=FALSE, col.names=FALSE)
write.table(Xmedian, file="Xmedian.csv", row.names=FALSE, col.names=FALSE)

This creates 3 files - y.csv with 2,089 target variables one per line, and Xgray.csv and Xmedian.csv, each also with 2,089 lines and 121 (11x11) grayscale values per line.

Our Python code (shown below) builds various models with different combinations of Xgray and Xmedian, and measures their OOB (Out of Bag) error estimates. As mentioned in article, in case of Random Forests, cross validation is not required to get an estimate of the test set error. The OOB is estimated internally and serves as a good estimate.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# Source: hartley_classify.py
from operator import itemgetter
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc
import numpy as np
import matplotlib.pyplot as plt
import os

DATA_DIR = "../../data/hartley"

def top_n_features(fimps, n):
    return map(lambda x: x[0], sorted(enumerate(fimps), key=itemgetter(1), 
                                       reverse=True)[0:n])
    
plt.figure()

y = np.loadtxt(os.path.join(DATA_DIR, "y.csv"))
for model_id in range(5):    
    Xgray = np.loadtxt(os.path.join(DATA_DIR, "Xgray.csv"), delimiter=" ")
    Xmedian = np.loadtxt(os.path.join(DATA_DIR, "Xmedian.csv"), delimiter=" ")
    if model_id == 0:
        # gray values
        X = Xgray
    elif model_id == 1:
        # gray values minus median values
        X = Xgray - Xmedian
    elif model_id == 2:        
        # gray values concatenated with median values
        X = np.concatenate((Xgray, Xmedian), axis=1)
    elif model_id == 3:
        # combination 1 and 2
        X = np.concatenate((Xgray, Xmedian, Xgray - Xmedian), axis=1)
    elif model_id == 4:
        # mean shift values and add them in
        Xgray_mu = np.mean(Xgray)
        Xgray_ms = Xgray - Xgray_mu
        Xmedian_mu = np.mean(Xmedian)
        Xmedian_ms = Xmedian - Xmedian_mu
        X = np.concatenate((Xgray, Xmedian, Xgray_ms, Xmedian_ms, 
                            Xgray - Xmedian, Xgray_ms - Xmedian_ms), axis=1)
    clf = RandomForestClassifier(n_estimators=200, max_features="auto", 
                                 oob_score=True, n_jobs=-1)
    # Split data train/test = 75/25
    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25, 
                                                    random_state=42)
    clf.fit(Xtrain, ytrain)
    print "OOB Score:", clf.oob_score_
    print "Top 5 Features:", top_n_features(clf.feature_importances_, 5)
    # compute ROC curve data
    ypred = clf.predict_proba(Xtest)
    fpr, tpr, threshold = roc_curve(ytest, ypred[:, 1])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label="Model#%d (AUC=%.2f)" % (model_id + 1, roc_auc))

# baseline, axes, labels, etc
plt.plot([0, 1], [0, 1], "k--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.show()

The first model uses Xgray values only as input variables, the second uses the difference between Xgray and Xmedian (ie the highlights), the third uses a concatenation of Xgray and Xmedian, the fourth a concatenation of Xgray, Xmedian and the difference, and the fifth uses Xgray, Xmedian, mean shifted Xgray, mean shifted Xmedian and the difference between mean shifted Xgray and mean shifted Xmedian. OOB Estimates on the full dataset were in the 95-97% range, and 90-93% on a 25% held out test set. Here are the OOB estimates for each model.

1
2
3
4
5
Model 1: 0.924648786718
Model 2: 0.907407407407
Model 3: 0.92975734355
Model 4: 0.925287356322
Model 5: 0.932950191571

From the results, Model #5 performed the best. Random Forests can also provide an ordering of their input features by importance. This information can be used to retrain the classifier with only the top N features. I haven't tried this, but it may be worth trying out to see if results improve further. The top 5 features for each model was as follows.

1
2
3
4
5
Model 1 Top 5 Features: [60, 71, 61, 49, 50]
Model 2 Top 5 Features: [60, 57, 49, 46, 71]
Model 3 Top 5 Features: [181, 170, 60, 192, 61]
Model 4 Top 5 Features: [181, 192, 170, 60, 182]
Model 5 Top 5 Features: [181, 423, 412, 192, 60]

Finally, we plot the ROC (Reciever Operating Characteristic, basically a plot of False Positive Rate against True Positive Rate) curves for each of the 5 classifiers. The dotted line represents the baseline. Using the AUC (Area under the curve), Model 1 looks slightly better than Model 5. However, all the models look almost equally good, probably because it is a toy example.


Thats all I have for today. In the past, I hadn't looked at Random Forests too much because it is not that commonly used for text classification (although I suspect the reason for that may just be a historical preference for other methods).

Sunday, January 11, 2015

Coding A Better Strategy for Hangman


Happy New Year! In this post, I will describe a fun little project that came about while going through the lectures in The Caltech-JPL Summer School on Big Data Analytics course on Coursera (really a dump of the lecture notes and videos from the actual week-long sessions rather than a formal course), I came across an interesting assignment - implementing the ideas outlined in this lifehacker article - A Better Strategy for Hangman. Hangman is a simple word-guessing game played by two players, one of whom (the Proposer) thinks of a word and the other (the Solver) tries to guess it. The guessing is done one letter at a time - the strategy described in the article uses letter frequencies to inform these guesses. The article goes into much more detail, and you should read it to get a full understanding of the strategy.

In real life, the game is played between two people, each of whom have their own (probably different sized) vocabularies. In our code, we use the words from our /usr/share/dict/words file (in Ubuntu, and likely available for other Unix based systems as well). So both Proposer and Solver use the same list of words as their "vocabulary". The Proposer selects a random word from the list and the Solver uses letter frequencies to make letter guesses until he guesses the word or the allowed number of guesses (11) run out. Here is the code for the game.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
# Source: src/hangman/game.py
# -*- coding: utf-8 -*-
import operator
from random import random

MAX_GUESSES_PER_GAME = 11
MAX_WORDS_IN_SUBSET = 25

########################### Preprocessing ############################

non_ascii_mappings = list([
    ('', 'a'), ('', 'a'), ('', 'a'), ('', 'a'), ('', 'a'), ('', 'a'),
    ('', 'c'), ('', 'c'),
    ('', 'e'), ('', 'e'), ('', 'e'), ('', 'e'),
    ('', 'i'),
    ('', 'n'),
    ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'),
    ('', 'u'), ('', 'u'),
    ('', '\''), ('', '"')
])

def ascii_fold(s):
    for x in non_ascii_mappings:
        s = s.replace(x[0], x[1])
    return s

def preprocess(dictfile):    
    fwords = open(dictfile, 'rb')
    wset = set()
    for line in fwords:
        word = line.strip().lower()
        word = ascii_fold(word)
        if word.endswith("'s"):
            word = word[:-2]
        word = word.replace(" ", "").replace("'", "").replace("\"", "")
        wset.add(word)
    fwords.close()
    return list(wset)

############################# Proposer Side #############################
    
def select_secret_word(words):
    widx = int(random() * len(words))
    return words[widx]
    
def find_all_match_positions(secret_word, guess_char):
    positions = []
    curr_pos = 0
    while curr_pos < len(secret_word):
        curr_pos = secret_word.find(guess_char, curr_pos)
        if curr_pos < 0:
            break
        positions.append(curr_pos)
        curr_pos += 1
    return positions    
    
def update_guessed_word(guessed_word, matched_positions, guessed_char):
    for pos in matched_positions:
        guessed_word[pos] = guessed_char
    
def is_solved(guessed_word):
    chars_remaining = len(filter(lambda x: x == '_', guessed_word))
    return chars_remaining == 0
    

############################# Solver Side ###############################

letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", 
           "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"]

def most_frequent_char(words, previously_guessed):
    cmap = {x:0 for x in letters}
    for word in words:
        cset = set()
        for c in word:
            if c not in previously_guessed:
                cset.add(c)
        for c in cset:
            cmap[c] += 1
    return sorted(map(lambda x: (x, cmap[x]), cmap.keys()), 
                 key=operator.itemgetter(1), reverse=True)[0][0]
    
def best_guess(words, word_len, bad_guesses, good_guesses, guessed_word):
    temp_words = filter(lambda x: len(x) == word_len, words)
    if len(bad_guesses) > 0:
        for bad_guess in bad_guesses:
            temp_words = filter(lambda x: x.find(bad_guess) == -1, temp_words)
    if len(good_guesses) > 0:
        for good_guess in good_guesses:
            temp_words = filter(lambda x: x.find(good_guess) > -1, temp_words)
    previously_guessed = set(filter(lambda x: x != '_', guessed_word))
    return temp_words, most_frequent_char(temp_words, previously_guessed)
    
def init_guess(wordlen):
    initial_guess = []
    for i in range(wordlen):
        initial_guess.append("_")
    return initial_guess

def match_words_against_template(words, guessed_word):
    if len(words) > MAX_WORDS_IN_SUBSET:
        return words
    matched_words = []
    for word in words:
        word_chars = [c for c in word]
        merged = zip(guessed_word, word_chars)
        diff = len(filter(lambda x: x[0] != x[1], 
                          filter(lambda x: x[0] != '_', merged)))
        if diff == 0:
            matched_words.append(word)
        if len(matched_words) > 1:
            break
    return matched_words

def replace_guessed_word(guessed_word, matched_word):
    matched_chars = [c for c in matched_word]
    for i in range(len(matched_chars)):
        guessed_word[i] = matched_chars[i]
    
################################# Game ###################################
    
def single_round(words, debug=False):
    solver_wins = False
    secret_word = select_secret_word(words)
    if debug:
        print "secret word:", secret_word
    word_len = len(secret_word)
    bad_guesses = set()
    good_guesses = set()
    guessed_word = init_guess(word_len)
    for num_guesses in range(MAX_GUESSES_PER_GAME):
        filtered_words, guess_char = best_guess(words, word_len, bad_guesses, 
                                                good_guesses, guessed_word)
        if debug:
            print "guessed char:", guess_char
        matched_positions = find_all_match_positions(secret_word, guess_char)
        if len(matched_positions) == 0:
            bad_guesses.add(guess_char)
        else:
            good_guesses.add(guess_char)
        update_guessed_word(guessed_word, matched_positions, guess_char)
        matched_words = match_words_against_template(filtered_words, guessed_word)
        if len(matched_words) == 1:
            replace_guessed_word(guessed_word, matched_words[0])
        if debug:
            print "#", num_guesses, "guess:", " ".join(guessed_word)
        if is_solved(guessed_word):
            solver_wins = True
            break
    return len(secret_word), solver_wins, num_guesses
    
def multiple_rounds(words, num_games, report_file):
    fdata = open(report_file, 'wb')
    for i in range(num_games):
        word_len, solver_wins, num_guesses = single_round(words, False)
        fdata.write("%d,%d,%d\n" % (word_len, 1 if solver_wins else 0, num_guesses))
    fdata.close()

################################# Main ###################################

words = preprocess("/usr/share/dict/words")

#single_round(words, True)

multiple_rounds(words, 10000, "hangman.csv")

The /usr/share/dict/words file has 99,171 English words, and contains several letters with diacritics for words that are adopted from other languages. We intially normalize these words so they are all in lowercase, and replace letters with diacritics with their corresponding ASCII letters from the range 'a' through 'z'. We also replace words ending with 's with their base forms, and remove duplicates. At the end of this, we are left with 72,167 words.

At the beginning of the game, the solver knows the number of letters in the word, so in order to maximize the chances of getting a letter, it makes sense to guess the most frequent letter for that size of word in the dictionary. If we get the letter, we are obviously one step closer to a win. However, even if we don't get a letter, it still gives us some information which we can use for subsequent guesses - ie, our subset of word candidates do not contain this letter.

This approach of consistently choosing the most frequent letter (minus any letters known to not exist in the word) works upto a point. Often, it is possible to use the word template generated from previous guesses to match against the word corpus. If the template matches to a single word, we can just guess that word and win. I have implemented this strategy in the match_words_against_template() function in the code above.

An optimization for cases where the template matches multiple words would be to select a letter that resolves the ambiguity (such as a guess of "s" for the pattern "_ e t t l e" where the word could be either "settle" or "kettle") rather than the most frequent character.

Here are traces from a few games. You can run a single game by calling the single_round() function with debug set to True. The word is selected randomly from the word list, but it is quite simple to make the change to take in a specific word instead if you want to do that.

secret word: undergo
guessed char: e
# 0 guess: _ _ _ e _ _ _
guessed char: s
# 1 guess: _ _ _ e _ _ _
guessed char: r
# 2 guess: _ _ _ e r _ _
guessed char: a
# 3 guess: _ _ _ e r _ _
guessed char: i
# 4 guess: _ _ _ e r _ _
guessed char: o
# 5 guess: _ _ _ e r _ o
guessed char: d
# 6 guess: _ _ d e r _ o
guessed char: l
# 7 guess: _ _ d e r _ o
guessed char: n
# 8 guess: _ n d e r _ o
guessed char: u
# 9 guess: u n d e r _ o
guessed char: b
# 10 guess: u n d e r g o
    
secret word: steakhouses
guessed char: e
# 0 guess: _ _ e _ _ _ _ _ _ e _
guessed char: i
# 1 guess: _ _ e _ _ _ _ _ _ e _
guessed char: r
# 2 guess: _ _ e _ _ _ _ _ _ e _
guessed char: s
# 3 guess: s _ e _ _ _ _ _ s e s
guessed char: a
# 4 guess: s _ e a _ _ _ _ s e s
guessed char: l
# 5 guess: s _ e a _ _ _ _ s e s
guessed char: t
# 6 guess: s t e a _ _ _ _ s e s
guessed char: n
# 7 guess: s t e a _ _ _ _ s e s
guessed char: o
# 8 guess: s t e a k h o u s e s
    
secret word: coffers
guessed char: e
# 0 guess: _ _ _ _ e _ _
guessed char: s
# 1 guess: _ _ _ _ e _ s
guessed char: r
# 2 guess: _ _ _ _ e r s
guessed char: a
# 3 guess: _ _ _ _ e r s
guessed char: i
# 4 guess: _ _ _ _ e r s
guessed char: o
# 5 guess: _ o _ _ e r s
guessed char: t
# 6 guess: _ o _ _ e r s
guessed char: c
# 7 guess: c o _ _ e r s
guessed char: u
# 8 guess: c o _ _ e r s
guessed char: k
# 9 guess: c o _ _ e r s
guessed char: p
# 10 guess: c o _ _ e r s
    
secret word: soho
guessed char: a
# 0 guess: _ _ _ _
guessed char: e
# 1 guess: _ _ _ _
guessed char: o
# 2 guess: _ o _ o
guessed char: s
# 3 guess: s o _ o
guessed char: t
# 4 guess: s o _ o
guessed char: l
# 5 guess: s o _ o
guessed char: b
# 6 guess: s o _ o
guessed char: n
# 7 guess: s o _ o
guessed char: p
# 8 guess: s o _ o
guessed char: d
# 9 guess: s o _ o
guessed char: w
# 10 guess: s o _ o
    

Now that the game had been reduced to closed-world (ie vocabulary drawn from the same list) interactions between two entities which had computers for minds, I wanted to see how successful the Solver was against the Proposer. For this, I ran a small experiment and let the code run for 10,000 games and recorded their results. The data is generated by calling the multiple_rounds() function with the number of runs desired and the output filename.

Here is some code that reads the resulting file and generates the necessary charts for visualization.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Source: src/hangman/gamestats.py
# -*- coding: utf-8 -*-
from __future__ import division
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("hangman.csv", header=None, 
                 names=["WORD_LEN", "SOLVER_WINS", "NUM_GUESSES"])

# wins vs losses
nloss = df[df["SOLVER_WINS"] == 0].count()["SOLVER_WINS"]
nwins = df[df["SOLVER_WINS"] == 1].count()["SOLVER_WINS"]
print "probability of winning=", nwins / (nwins + nloss)
print "probability of losing=", nloss / (nwins + nloss)

## probability of winning and losing for different word lengths
df2 = df.drop("NUM_GUESSES", 1)
df2_wins = df2[df2["SOLVER_WINS"] == 1].groupby("WORD_LEN").count().reset_index()
df2_losses = df2[df2["SOLVER_WINS"] == 0].groupby("WORD_LEN").count().reset_index()
df2_losses.rename(columns={"SOLVER_WINS": "SOLVER_LOSES"}, inplace=True)
df2_merged = df2_wins.merge(df2_losses, how="inner", on="WORD_LEN")
df2_merged.plot(kind="bar", stacked=True, x="WORD_LEN", 
                title="Win/Loss Counts by Word Length") 
plt.show()
df2_merged["NUM_GAMES"] = df2_merged["SOLVER_WINS"] + df2_merged["SOLVER_LOSES"]
df2_merged["SOLVER_WINS"] = df2_merged["SOLVER_WINS"] / df2_merged["NUM_GAMES"]
df2_merged["SOLVER_LOSES"] = df2_merged["SOLVER_LOSES"] / df2_merged["NUM_GAMES"]
df2_merged.drop("NUM_GAMES", axis=1, inplace=True)
df2_merged.plot(kind="bar", stacked=True, x="WORD_LEN", 
                title="Win/Loss Probabilities by Word Length") 
plt.show()

# how number of guesses to win varies with word length (winning games only)
df3 = df[df["SOLVER_WINS"] == 1]
df3.drop("SOLVER_WINS", 1)
df3_grouped = df3.drop("SOLVER_WINS", 1).groupby("WORD_LEN").mean().reset_index()
df3_grouped.plot(kind="bar", x="WORD_LEN", 
                 title="Avg Guesses for Different Word Lengths")
plt.show()

The empirical probability of winning the game, based on this data, is about 0.66. As expected, the distribution of word lengths thought up by the Proposer follows a roughly normal distribution. However, shorter words have a lower probability of getting solved than longer words. This is more easily seen in the chart showing win/loss probabilities. So treating win probability as a proxy for difficulty, winning at Hangman is actually harder for shorter words than for longer words.




I also plotted the average number of guesses the solver made to guess different sized words thought up by the Proposer. This time the distribution is more uniform, although it falls off to the right for long words.


This is all I have for today. I haven't been posting these past few weeks because I was out of town on vacation. Hopefully I should get back to a more regular schedule in the New Year.