Saturday, February 17, 2018

Generating labels with LU Learning -- a case study using Expectation Maximization and Keras


We often hear that data is the new oil. Like oil, there is a process of refining data so it can become useful. Classification is by far still the most widely used (and useful) end-product of data. But classification involves training models, which often involves manual labeling work by domain experts to generate the training data. Given that, it is often quite expensive to go from data to information via the classification route.

Classifier models based on Deep Learning are generally preferable to more traditional Machine Learning models because of the ability to build end-to-end models and skip (or spend less time on) the feature engineering step. You compensate for this by giving the Deep Learning model more data and compute so it can use it to figure out the model space by looking at more examples and doing more processing.

All this points to a need for more and more labeled data. In the past, I have done this by starting with a (relatively) small labeled dataset, generating a model, running predictions on a larger unlabeled dataset, having a human expert look at some of the predictions and correcting it, and retrain the model with the larger sample. The model is trained iteratively with progressively larger and larger samples until it is "good enough".

I recently read about some different strategies in the book Web Data Mining by Prof Bing Liu. These strategies are part of a broader class of techniques called Partially Supervised Learning, or more specifically, LU (Labeled-Unlabeled) Learning. In this post, I will describe my implementation of one of the LU Learning strategies that uses the Expectation Maximization (EM) algorithm to wrap a Keras based classifier.

The EM algorithm consists of two steps - the Expectation (E) step that fills in the missing data based on the current estimate of the parameters, and the Maximization (M) step maximizes the likelihood by re-estimating the parameters. The two steps are run iteratively until the parameters stabilize. Concretely, in our case, given a small labeled set L and an larger unlabeled set U, we will generate predictions of U using our model in the E-step, then use these predictions to train a model in the M-step. We will continue to do this iteratively until the cross-entropy between the labels generated in the E-step and the predictions generated from the model in the M-step does not change appreciably between iterations. This is expressed more succinctly by the following algorithm.


    learn an initial classifier f from labeled set L
    repeat
        // E-step
        use current classifier f to compute labels p for every document in U
        // M-step
        learn a new classifier f' from documents L ∪ U
        use classifier f' to compute probabilistic labels q for each document in U
        compute cross-entropy between p and q
        ff'
    until cross-entropy below threshold


The data I am using comes from an internal hackathon, and consists of short passages of 1-3 sentences each. The task is to classify these sentences into one of four classes. For the purposes of this post, we will assume that the text is partitioned into a labeled set L of 10,000 records and an unlabeled set U of 45,000 records. Of the 10,000 labeled records, we set aside 1,000 for validating our original and final models. So we end up with three lists of texts, texts_l for the 9,000 labeled records for training our original model, texts_v of 1,000 labeled records for validation, and texts_u containing 45,000 unlabeled records. The corresponding labels for the training and validation sets are contained in the lists labels_l and labels_v.

Our Keras model wants the data as a sequence of integers, each integer representing a word in the vocabulary. Instead of building up a vocabulary, we use the hashing trick that maps the original vocabulary into a smaller space. Multiple words from the original vocabulary can map to the smaller one. We then pad each of these integer sequences with space characters so they are all the same length. In addition, we convert the integer labels to categorical labels. We end up with fixed size integer sequences each of length 117, and categorical labels with 4 columns in each row.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from keras.callbacks import ModelCheckpoint
from keras.layers.core import Dense, SpatialDropout1D
from keras.layers.convolutional import Conv1D
from keras.layers.embeddings import Embedding
from keras.layers.pooling import GlobalMaxPooling1D
from keras.models import Sequential, load_model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import hashing_trick
from keras.utils import np_utils
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
import math
import matplotlib.pyplot as plt
import numpy as np
import os

DATA_DIR = '../../data/common'

VOCAB_SIZE = 5000

# convert texts to int sequence
def convert_to_intseq(text, vocab_size=VOCAB_SIZE, hash_fn="md5", lower=True):
    return hashing_trick(text, n=vocab_size, hash_function=hash_fn, lower=lower)

xs_l = [convert_to_intseq(text) for text in texts_l]
xs_v = [convert_to_intseq(text) for text in texts_v]
xs_u = [convert_to_intseq(text) for text in texts_u]

# pad to equal length input
maxlen = max([max([len(x) for x in xs_l]),
              max([len(x) for x in xs_v]),
              max([len(x) for x in xs_u])])

Xl = pad_sequences(xs_l, maxlen=maxlen)
Xv = pad_sequences(xs_v, maxlen=maxlen)
Xu = pad_sequences(xs_u, maxlen=maxlen)

# labels are 1-based, making it 0 based for to_categorical
Yl = np_utils.to_categorical(np.array(labels_l)-1, num_classes=NUM_CLASSES)
Yv = np_utils.to_categorical(np.array(labels_v)-1, num_classes=NUM_CLASSES)

print(Xl.shape, Yl.shape, Xv.shape, Yv.shape, Xu.shape)

Next we declare some functions to build and train a Keras model. Since we will train the model multiple times with different labels inside the EM algorithm, it makes sense to have them be their own functions.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
EMBED_SIZE = 100
NUM_FILTERS = 256
NUM_WORDS = 3
NUM_CLASSES = 4

BATCH_SIZE = 64
NUM_EPOCHS = 5

def build_model(maxlen=0, vocab_size=VOCAB_SIZE, embed_size=EMBED_SIZE, 
                num_filters=NUM_FILTERS, kernel_size=NUM_WORDS, 
                num_classes=NUM_CLASSES, print_model=False):
    model = Sequential()
    model.add(Embedding(vocab_size, embed_size, input_length=maxlen))
    model.add(SpatialDropout1D(0.2))
    model.add(Conv1D(filters=num_filters, kernel_size=kernel_size, 
                     activation="relu"))
    model.add(GlobalMaxPooling1D())
    model.add(Dense(num_classes, activation="softmax"))
    # compile
    model.compile(optimizer="adam", loss="categorical_crossentropy",
                  metrics=["accuracy"])
    if print_model:
        model.summary()
    return model


def train_model(model, X, Y, batch_size=BATCH_SIZE, num_epochs=NUM_EPOCHS, 
                verbosity=0, model_name_template=MODEL_TEMPLATE,
                iter_num=0):
    best_model_fn = model_name_template.format(iter_num)
    checkpoint = ModelCheckpoint(filepath=best_model_fn, save_best_only=True)
    history = model.fit(X, Y, batch_size=batch_size, epochs=num_epochs,
                        verbose=verbosity, validation_split=0.1,
                        callbacks=[checkpoint])
    return model, history


def evaluation_report(model_path, X, Y):
    model = load_model(model_path)
    Y_ = model.predict(X)
    y = np.argmax(Y, axis=1)
    y_ = np.argmax(Y_, axis=1)
    acc = accuracy_score(y, y_)
    cm = confusion_matrix(y, y_)
    cr = classification_report(y, y_)
    print("\naccuracy: {:.3f}".format(acc))
    print("\nconfusion matrix")
    print(cm)
    print("\nclassification report")
    print(cr)

We then build our initial model, training it against the labeled set of 9,000 records, and evaluating the trained model against the labeled validation set of 1,000 records.

1
2
3
4
5
6
7
8
9
MODEL_TEMPLATE = os.path.join(DATA_DIR, "keras-lu-{:d}.h5")

model = build_model(maxlen=maxlen, vocab_size=VOCAB_SIZE, 
                    embed_size=EMBED_SIZE, num_filters=NUM_FILTERS, 
                    kernel_size=NUM_WORDS, print_model=True)
model, _ = train_model(model, Xl, Yl, batch_size=BATCH_SIZE, 
                       num_epochs=NUM_EPOCHS, verbosity=1,
                       model_name_template=MODEL_TEMPLATE)
evaluation_report(MODEL_TEMPLATE.format(0), Xv, Yv)

The train model function declares a checkpointing callback, and checks for validation accuracy at each epoch of training and saves the best one at each iteration. In this case our iteration number is 0 since it is the initial model. The model gives us an accuracy of 0.995 against the validation set.

The next step is to set up a loop of alternating E and M steps. At each stage, the E step will generate predictions against the unlabeled (U) dataset, and use the combination of the labeled (L) and unlabeled (U) datasets to train a new model. The goodness of the resulting model is measured by the cross entropy between the labels and the corresponding predictions of the model against the U dataset. The expectation is that it will gradually converge, and at some point there are no gains to be had by further training.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
BEST_MODEL_EM = os.path.join(DATA_DIR, "keras-em-best.h5")

def e_step(model, Xu, Yu=None):
    if Yu is None:
        # predict labels for unlabeled set U with current model
        return np.argmax(model.predict(Xu), axis=1)
    else:
        # reuse prediction we got from M-step
        return np.argmax(Yu, axis=1)


def m_step(Xl, Yl, Xu, yu, iter_num, **kwargs):
    # train a model on the combined set L+U
    model = build_model(maxlen=kwargs["maxlen"], 
                        vocab_size=kwargs["vocab_size"],
                        embed_size=kwargs["embed_size"], 
                        num_filters=kwargs["num_filters"],
                        kernel_size=kwargs["kernel_size"],
                        print_model=False)
    X = np.concatenate([Xl, Xu], axis=0)
    Y = np.concatenate([Yl, np_utils.to_categorical(yu)], axis=0)
    model, _ = train_model(model, X, Y, 
                          batch_size=kwargs["batch_size"],
                          num_epochs=kwargs["num_epochs"],
                          verbosity=1,
                          model_name_template=kwargs["model_name_template"],
                          iter_num=iter_num+1)
    # load new model
    model = load_model(kwargs["model_name_template"].format(iter_num+1))
    return model


# expectation maximization loop
epsilon = 1e-3
model = load_model(MODEL_TEMPLATE.format(0))
q = None
prev_loss = None
losses = []
for i in range(10):
    # E-step (prediction on U)
    p = e_step(model, Xu, q)
    # M-step (train on L+U, prediction on U, compute log-loss)
    model = m_step(Xl, Yl, Xu, p, i, maxlen=maxlen, 
                   vocab_size=VOCAB_SIZE,
                   embed_size=EMBED_SIZE,
                   num_filters=NUM_FILTERS,
                   kernel_size=NUM_WORDS,
                   batch_size=BATCH_SIZE,
                   num_epochs=NUM_EPOCHS,
                   model_name_template=MODEL_TEMPLATE)
    q = model.predict(Xu)
    loss = log_loss(p, q)
    losses.append(loss)
    print("\n**** Iteration {:d}, log-loss: {:.7f} ****\n\n".format(i+1, loss))
    if prev_loss is None:
        model.save(BEST_MODEL_EM)
    else:
        if loss < prev_loss:
            model.save(BEST_MODEL_EM)
        if math.fabs(prev_loss - loss) < epsilon:
            break
    prev_loss = loss

We have a fixed size loop, but we exit the loop once there is not much improvement (0.0001 in our case) in the cross entropy loss between label and prediction. In this particular run, we just had to do three iterations before we reach this threshold. The chart below shows the change in cross-entropy loss with iterations in our run.


Finally, we evaluate our generated model on the combined L+U dataset against our held out validation set. We end up with an accuracy of 0.994, which seems to indicate that the classifier did not suffer too much with the additional unlabeled data. More importantly, it tells us that the automatically generated labels seem to be of fairly high quality. Thus, this approach could be a good way of generating labels on a large unlabeled (U) dataset by leveraging the labels from a smaller labeled (L) dataset.


Saturday, February 03, 2018

Using Snorkel Probabilistic Labels for Classification


Last week, I wrote about using the Snorkel Generative model to convert noisy labels to an array of marginal probabilities for the label being in each class. This week, I will describe the second part of the experiment, where I use these probabilistic labels to train a Discriminative model such as a Classifier. As a reminder, the standard pipeline for a Snorkel use-case looks like the diagram shown below.


Noisy labels can be generated in a variety of ways, such as weak supervision through the use of labeling functions, distant supervision through reference ontologies, unsupervised models, or predictions from weaker models. These labels may overlap or conflict with other labels. So assuming N labeling functions, we would start with N noisy labels per input record. Assuming that we want to build a k-class classifier, the labels would need to be 1 of k classes, the cardinality of the Generative model would be k, and the output of the generative model would be an array of size k for each input record. These k numbers represent the probability of the record being in the corresponding class, and add up to 1.

The next step is to train a noise-aware discriminative model, using the original data and these probabilistic labels. A noise-aware discriminative model uses a noise-aware loss function, which is just the expected loss with respect to the noisy training set model. This turns out to just be the cross-entropy between the label and prediction (see this blog post on Data Programming with Tensorflow by the Snorkel team for the derivation for a binary classification model, but I think it can be easily extended to a k-class classification scenario as well).

As an example, consider a 3-class probabilistic label ytrue_prob and the corresponding categorical label ytrue_cat. We see that we can compute the cross entropy between either of these label vectors against a prediction vector ypred in the exact same way using the Keras loss function categorical_crossentropy. So we can use the probabilistic label generated by the Snorkel discriminative model in the exact same way we could using real labels (converted to categorical form).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from keras.losses import categorical_crossentropy
import keras.backend as K
import numpy as np

with K.get_session() as sess:
    
    ytrue_prob = K.constant(np.array([0.9, 0.03, 0.07]))
    ytrue_cat = K.constant(np.array([1., 0., 0.]))

    ypred = K.constant(np.array([0.7, 0.15, 0.15])) 
    
    loss_cat = categorical_crossentropy(ytrue_cat, ypred)
    loss_prob = categorical_crossentropy(ytrue_prob, ypred)
    
    loss_cat_val, loss_prob_val = sess.run([loss_cat, loss_prob])
    print(loss_cat_val, loss_prob_val)

The output values are obviously different, but both are floating point scalars, and our objective is to minimize it during the training. I decided to test this intuition to see if I could train a network to classify based on probabilistic labels and see how much performance gain or loss I got with the corresponding categorical label. The data I used is the same as the one I used to train the Snorkel Generative model, from the Snorkel crowdsourcing example.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
from keras import regularizers
from keras.callbacks import ModelCheckpoint
from keras.layers import Input
from keras.layers.core import Dense, Dropout, Flatten
from keras.layers.embeddings import Embedding
from keras.models import Model, load_model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import one_hot
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import collections
import nltk
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd

# constants
DATA_DIR = "data"

CLEAN_LABELS_FILE = os.path.join(DATA_DIR, "train-clean-labels.csv")
NOISY_LABELS_FILE = os.path.join(DATA_DIR, "train-noisy-labels.csv")
LABEL_LOOKUP_FILE = os.path.join(DATA_DIR, "label-lookup.csv")

BEST_MODEL_P = os.path.join(DATA_DIR, "disc-model-p-best.h5")
FINAL_MODEL_P = os.path.join(DATA_DIR, "disc-model-p-final.h5")

BEST_MODEL_C = os.path.join(DATA_DIR, "disc-model-c-best.h5")
FINAL_MODEL_C = os.path.join(DATA_DIR, "disc-model-c-final.h5")

# extract data
noisy_df = pd.read_csv(NOISY_LABELS_FILE)
clean_df = pd.read_csv(CLEAN_LABELS_FILE)
data_df = noisy_df.join(clean_df.set_index("tweet_id"), 
                        how="inner", on="tweet_id", rsuffix="_r")
data_df = data_df.loc[:, ["tweet_id", "tweet_body", 
                          "cls_1", "cls_2", "cls_3", "cls_4", "cls_5",
                          "sentiment"]]

data, prob_labels, true_labels = [], [], []
max_num_words = 0
word_counts = collections.Counter()
for row in data_df.values:
    # read tweet, normalize, tokenize and collect word counts
    words = [word.lower() for word in nltk.word_tokenize(row[1])
                          if not word.startswith("@")]
    if max_num_words < len(words):
        max_num_words = len(words)
    for word in words:
        word_counts[word] += 1
    data.append(" ".join(words))
    prob_labels.append(row[2:7])
    true_labels.append(row[7])

# constants derived from data after exploratory analysis
num_recs = len(data)
max_len = 30
vocab_size = 1300
num_classes = len(prob_labels[0])

# convert data to matrices
X = np.zeros((num_recs, max_len))
Yp = np.zeros((num_recs, num_classes))
Yc = np.zeros((num_recs, num_classes))

for i, (tweet, prob_label, true_label) in enumerate(zip(data, prob_labels, true_labels)):
    X[i] = np.array(pad_sequences([one_hot(tweet, vocab_size, split=" ")], 
                                  maxlen=max_len))
    Yp[i] = np.array(prob_label)
    Yc[i] = to_categorical(true_label-1, num_classes=num_classes)
    
Xtrain, Xtest, Yptrain, Yptest, Yctrain, Yctest = train_test_split(X, Yp, Yc,
    train_size=0.7, test_size=0.3, random_state=42)

This gives use two sets of data, the training set with 700 records and a test set of 300 records. Each tweet is represented by a integer sequence of size 30 - during exploratory analysis, we found that the maximum length of a tweet was 39 words, and the number of unique words in the vocabulary was 3,781, of which 1,295 occurred more than once. So we decided to cut our vocabulary size to 1300. Since we are using the Keras one_hot function, this uses the hashing trick and projects our vocabulary of 3,781 words to 1300 positions. We also pad shorter sentences to 30 words, so we need an additional PAD character (0).

The Yp* array contains the probabilistic labels. Since we are looking at 5 classes, each row has 5 columns. The Yc* array projects the categorical variable onto a 1-hot space of 5 columns, so each row of this matrix has 5 columns.

Our objective is to train two identical networks, one using the probabilistic labels and one with the categorical labels and evaluate them. Below we define some functions that we will reuse across the two networks.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def build_model():
    seq_input = Input(shape=(max_len,), dtype="int32")
    x = Embedding(vocab_size + 1, 100, input_length=max_len)(seq_input)
    x = Flatten()(x)
    x = Dense(64, activation="relu")(x)
    preds = Dense(num_classes, activation="softmax")(x)
    model = Model(inputs=[seq_input], outputs=[preds])
    return model

def compile_model(model):
    model.compile(loss="categorical_crossentropy", optimizer="adam", 
                metrics=["acc"])
    return model

def fit_model(model, best_model_file, Xtrain, Ytrain):
    checkpoint = ModelCheckpoint(filepath=best_model_file, save_best_only=True)
    history = model.fit(Xtrain, Ytrain, validation_split=0.1, 
                        epochs=10, batch_size=64,
                        callbacks=[checkpoint])
    return history

def eval_report(title, Ytest, Ytest_):
    ytest = np.argmax(Ytest, axis=1)
    ytest_ = np.argmax(Ytest_, axis=1)
    acc = accuracy_score(ytest, ytest_)
    cm = confusion_matrix(ytest, ytest_)
    print("\n*** {:s}".format(title.upper()))
    print("accuracy: {:.3f}".format(acc))
    print("confusion matrix")
    print(cm)

We use a very simple neural network to build a word embedding from our integer sequence - each word ends up getting represented by a vector of size 100. This tensor is then flattened and sent through two Dense layers. The network structure is shown below.


We then train the first network using the probabilistic labels and the second one using the categorical labels, and evaluate them.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
model_p = build_model()
model_p = compile_model(model_p)
fit_model(model_p, BEST_MODEL_P, Xtrain, Yptrain)
best_model_p = load_model(BEST_MODEL_P)
Yptest_ = best_model_p.predict(Xtest)
eval_report("probabilistic", Yptest, Yptest_)

model_c = build_model()
model_c = compile_model(model_c)
fit_model(model_c, BEST_MODEL_C, Xtrain, Yctrain)
best_model_c = load_model(BEST_MODEL_C)
Yctest_ = best_model_c.predict(Xtest)
eval_report("categorical", Yctest, Yctest_)

In our results, the network trained on categorical labels did slightly better (accuracy: 0.927) than the one trained on probabilistic labels (accuracy: 0.923). This kind of makes sense, since the categorical labels are created (presumably at great expense) by humans, while the probabilistic labels are generated from less expert crowdsourced workers in this case, and by cheaper automatic methods in general. However, at least for this dataset, the difference in performance is very small.

I thought this was particularly encouraging for use cases around deep learning, which typically need large amounts of training data, and which use categorical labels. Generating noisy labels and cleaning them up using the Snorkel Generative model seems to be a good approach to getting large amounts of usable labeled data for classification.