Sunday, December 18, 2016

Random vs Grid Search: Which is Better?


Off and on, I get this urge to work with hyperparameter optimization. I first realized that there were better approaches for doing this than plain grid search about a year ago, when I read about Bayesian Hyperparameter search. Around that time I described an experiment using Bayesian Hyperparameter search to choose optimal parameters for a scikit-learn Random Forest classifier. Few months later, I wrote about another experiment where I used it against a Spark Logistic Regression based classifier.

In both these approaches, I used an acceptance function to compare each solution against the previous solution. While this does lead to optimal or near-optimal solutions most of the time, one of the advantages of grid search is that they can all be run independently, so if you have sufficient compute capacity you can run them in parallel. Using an acceptance function forces the searches to be sequential, since the current search depends on the previous search.

Around the same time, a colleague sent me a link to the paper Random Search for Hyper-Parameter Optimization by Bergstra and Bengio, where the authors show empirically and theoretically that random search is more efficient for parameter optimization than grid search. Having seen first hand the gains one can get with random search with acceptance functions, I was on board with it out-performing grid search as long as we used some form of acceptance function, but I couldn't quite wrap my mind around pure random search doing the same thing.

In any case, since that time, I have used a combination of sparse grid search followed by a more focused manual search when I needed to find the best parameters for my model, so I didn't think too much about random search. Quite recently, however, when driving through some particularly scenic countryside (winter landscapes in the East Bay of Northern California, where I live, are extremely beautiful, especially when it has just stopped raining), I had an idea as to how I might test this for myself, and perhaps convince myself (or otherwise) about the effectiveness of random search. This post describes the idea, the implementation and the results I got from it.

The experimental setup I came up with was as follows. My hyperparameter space would be a two dimensional space of size (1024, 1024), and represented by a matrix of the same shape. Our budget for finding the best hyperparameter setting is 25 experiments. So this would allow us to do a grid search where we consider combinations of 5 values of each hyperparameter, or alternatively 25 random searches in that space. In addition, I consider a "batched" version of random search, where I execute 5 batches of 5 random searches, refining the search after each batch. I generate many such random hyperparameter spaces and count how many times the two varieties of random search outperform grid search.

Generating the space


The idea is to generate random 2D hyperparameter spaces, ie, some kind of terrain. I used the Terrain Generation Tutorial: Hill Algorithm from robot frog: 3d. Its a simple and elegant algorithm, and very well explained. My Python implementation looks like this:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from __future__ import division, print_function
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import operator
%matplotlib inline

SIZE = 1024
NUM_ITERATIONS = 200

terrain = np.zeros((SIZE, SIZE), dtype="float")
mod_iter = NUM_ITERATIONS // 10
for iter in range(NUM_ITERATIONS):
    if iter % mod_iter == 0:
        print("iteration: {:d}".format(iter))
    r = int(np.random.uniform(0, 0.2 * SIZE))
    xc, yc = np.random.randint(0, SIZE - 1, 2)
    z = 0
    xmin, xmax = int(max(xc - r, 0)), int(min(xc + r, SIZE))
    ymin, ymax = int(max(yc - r, 0)), int(min(yc + r, SIZE))
    for x in range(xmin, xmax):
        for y in range(ymin, ymax):
            z = (r ** 2) - ((x - xc) ** 2 + (y - yc) ** 2)
            if z > 0: 
                terrain[x, y] += z
print("total iterations: {:d}".format(iter))
# normalize to unit height
zmin = np.min(terrain)
terrain -= zmin
zmax = np.max(terrain)
terrain /= zmax
# smooth the terrain by squaring the z values
terrain = np.power(terrain, 2)
# multiply by 255 so terrain can be visualized as grayscale image
terrain = terrain * 255
terrain = terrain.astype("uint8")
# convert mountains to valleys since stochastic gradient DESCENT
terrain = 255 - terrain

One possible terrain generated from the above code is shown below. I have also generated a contour map for the terrain.

1
2
3
4
5
6
7
8
9
plt.title("terrain")
image = plt.imshow(terrain, cmap="gray")
plt.ylim(max(plt.ylim()), min(plt.ylim()))
plt.colorbar(image, shrink=0.8)

plt.title("contours")
contour = plt.contour(terrain, linewidths=2)
plt.ylim(max(plt.ylim()), min(plt.ylim()))
plt.colorbar(contour, shrink=0.8, extend='both')






Grid Search


As you can see, the terrain is just a matrix of shape (1024, 1024). Since our budget is 25 experiments, we will do a 5x5 grid search on this terrain. This means choosing 5 equidistant points on the x and y axes and reading the value of the terrain matrix at these (x, y) locations. The code to do this is shown below. We also show the points on the contour map. The best value among these points is circled in blue.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# do a grid search (5x5)
results = []
for x in np.linspace(0, SIZE-1, 5):
    for y in np.linspace(0, SIZE-1, 5):
        xi, yi = int(x), int(y)
        results.append([xi, yi, terrain[xi, yi]])

best_xyz = [r for r in sorted(results, key=operator.itemgetter(2))][0]
grid_best = best_xyz[2]
print(best_xyz)

xvals = [r[0] for r in results]
yvals = [r[1] for r in results]

plt.title("grid search")
contour = plt.contour(terrain, linewidths=2)
plt.ylim(max(plt.ylim()), min(plt.ylim()))
plt.scatter(xvals, yvals, color="b", marker="o")
plt.scatter([best_xyz[0]], [best_xyz[1]], color='b', s=200, facecolors='none', edgecolors='b')
plt.ylim(max(plt.ylim()), min(plt.ylim()))
plt.colorbar(contour)


The best result from the grid search is the point (767, 767) on the hyperparameter space and has a value of 109.

Random Search


Next up is pure random search. Since we have a budget of 25 experiments, we just generate random values of x and y and look up values of the terrain matrix at that point.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# do a random search
results = []
xvals, yvals, zvals = [], [], []
for i in range(25):
    x = np.random.randint(0, SIZE, 1)[0]
    y = np.random.randint(0, SIZE, 1)[0]
    results.append((x, y, terrain[x, y]))

best_xyz = sorted(results, key=operator.itemgetter(2))[0]
rand_best = best_xyz[2]
print(best_xyz)

xvals = [r[0] for r in results]
yvals = [r[1] for r in results]

plt.title("random search")
contour = plt.contour(terrain, linewidths=2)
plt.ylim(max(plt.ylim()), min(plt.ylim()))
plt.scatter(xvals, yvals, color="b", marker="o")
plt.scatter([best_xyz[0]], [best_xyz[1]], color='b', s=200, facecolors='none', edgecolors='b')
plt.colorbar(contour)


In this case the random search did a little better, finding the point (663, 618) whose value is 103.

Batched Random Search


In this approach I decide to split my 25 experiment budget into 5 batches, each with 5 experiments. Looking up the (x, y) values are random within each batch. However, at the end of each batch, the winners (top 2 in my case) are singled out for special treatment. Instead of generating points anywhere in the space, we draw a window around these points and sample only from this space. At each iteration, the window shrinks geometrically. The intuition here is that I am trying to look for points in the neighborhood of the most optimal points found so far in the hope that this search will yield even more optimal points (exploitation). At the same time, I reserve the rest of the experiments to explore the space (exploration) in the hope that I might find another optimal point. The code for that is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def cooling_schedule(n, k):
    """ reduces jitter window size at each run 
        n - number of batches
        k - index (0 based) of current batch
        returns multiplier (0..1) of SIZE
    """
    return (n - k) / n

results, winners = [], []
for bid in range(5):
    print("Batch #: {:d}".format(bid), end="")
    
    # compute window size
    window_size = int(0.25 * cooling_schedule(5, bid) * SIZE)

    # jitter the winners and add their values to results
    # at any point we will only keep the top 2 global winners
    for x, y, _, _ in winners:
        if x < SIZE // 2:
            # left of center
            xleft = max(x - window_size // 2, 0)
            xright = xleft + window_size
        else:
            # right of center
            xright = min(x + window_size // 2, SIZE)
            xleft = xright - window_size
        if y < SIZE // 2:
            # bottom half
            ybot = max(y - window_size // 2, 0)
            ytop = ybot + window_size
        else:
            # top half
            ytop = min(y + window_size // 2, 0)
            ybot = ytop - window_size
        xnew = np.random.randint(xleft, xright, 1)[0]
        ynew = np.random.randint(ybot, ytop, 1)[0]
        znew = terrain[xnew, ynew]
        results.append((xnew, ynew, znew, bid))
        
    # add remaining random points
    for i in range(5 - len(winners)):
        x = np.random.randint(0, SIZE, 1)[0]
        y = np.random.randint(0, SIZE, 1)[0]
        z = terrain[x, y]
        results.append((x, y, z, 2))

    # find the top 2 global winners
    winners = sorted(results, key=operator.itemgetter(2))[0:2]
    print(" best: ", winners[0])
    
best_xyz = sorted(results, key=operator.itemgetter(2))[0]
print(best_xyz)

xvals = [r[0] for r in results]
yvals = [r[1] for r in results]

plt.title("batched random search")
contour = plt.contour(terrain, linewidths=2)
plt.ylim(max(plt.ylim()), min(plt.ylim()))
plt.scatter(xvals, yvals, color="b", marker="o")
plt.scatter([best_xyz[0]], [best_xyz[1]], color='b', s=200, facecolors='none', edgecolors='b')
plt.colorbar(contour)


This run ends up with the global minima in this space at (707, 682) with a value of 20.

Obviously, not all runs are this lucky, it is quite likely that the point found from any of the above approaches is just a local minima. Also there is no reason to assume that grid search might not outperform random search, since it is possible that a random terrain might have its global minima point right under one of the evenly laid out points, and a random search might miss that point.

In order to test that hypothesis, I ran the above code in batch for 1000 random terrains. For each terrain, I run 25 experiments for each of the 3 methods, and find the lowest (most optimal) point for each method. I count how many times one of the versions of random search outperform grid search over the 1000 random terrains. I did this twice just to make sure I didn't get the results by chance.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
TERRAIN_SIZE = 1024
NUM_HILLS = 200

NUM_TRIALS = 1000
NUM_SEARCHES_PER_DIM = 5
NUM_WINNERS = 2

nbr_random_wins = 0
nbr_brand_wins = 0
for i in range(NUM_TRIALS):
    terrain = build_terrain(TERRAIN_SIZE, NUM_HILLS)
    grid_result = best_grid_search(TERRAIN_SIZE, NUM_SEARCHES_PER_DIM)
    random_result = best_random_search(TERRAIN_SIZE,
                                       NUM_SEARCHES_PER_DIM**2)
    batch_result = best_batch_random_search(TERRAIN_SIZE,
                                            NUM_SEARCHES_PER_DIM,
                                            NUM_SEARCHES_PER_DIM,
                                            NUM_WINNERS)
    print(grid_result, random_result, batch_result)
    if random_result < grid_result:
        nbr_random_wins += 1
    if batch_result < grid_result:
        nbr_brand_wins += 1

print("# times random search wins: {:d}".format(nbr_random_wins))
print("# times batch random search wins: {:d}".format(nbr_brand_wins))

Here are my results:

1
2
3
4
5
6
7
8
9
Run-1
------
#-times random search wins: 619
#-times batch random search wins: 638

Run-2
------
#-times random search wins: 640
#-times batch random search wins: 621

So, it looks like random search seems to be slightly better than grid search (since the first number is more than 500). Also it looks like the batched version is not necessarily better since pure random search gave better results in the second run.

So anyway, thats what I had for today. This was mainly an exercise in convincing myself of something that didn't seem intuitively obvious to me. In the process, I learned an interesting algorithm to generate terrain, so that was quite a lot of fun as well. If you would like to run the code for yourself, the visualizations are in this IPython notebook, and the code for the batch run is in this python file.


Saturday, December 10, 2016

Document Similarity using various Text Vectorizing Strategies


Back when I was learning about text mining, I wrote this post titled IR Math with Java: TF, IDF and LSI. A recent comment/question on that post sparked off a train of thought which ended up being a driver for this post. In addition, I also wanted to compare a few text vectorization strategies for something else I was doing, so that was another driver. In contrast to the previous post, where I was exploring ideas described in the Text Mining Application Programming book using toy datasets, in this post I use a larger dataset. In addition, the dataset is (sort of) labeled, so I use that to compare the approaches quantitatively.

The dataset I chose for this exercise is the Reuters-21578 corpus from the UCI Machine Learning Repository. The corpus is a collection of 21,578 news stories that appeared on the Reuters newswire service in 1987. Each document is manually categorized into zero or more category tags. There are 481 unique tags across the documents. The number of tags per document vary from 0 (for about 1,862 documents) to 35. The distribution is heavily right-skewed, with the mean number of tags per document being 2.194 and the median being 2. The top histograms below show the distribution of tag tags per document across the corpus. The bottom chart shows the distribution of the top 20 (by frequency) tags.


In addition to the category tags, each document has a title and a block of text. For our analysis, we will consider the title to be part of the text and treat each document as simply a collection of terms. The longest document has 53 sentences, but the average document contains about 6.67 sentences each.

Since the category tags are manually assigned, we can think of them as ground truth labels. Then the overlap of the category tags for a pair of documents can be considered the true value of the similarity between them. We can now try various vectorization techniques using the title and body of the documents, and compute the similarity between pairs of document vectors. The correlation between the distribution of similarities computed between document vectors and that computed between category tags would then indicate the overall quality of the vectorization technique.

The vectorization techniques I have compared in this post are raw word counts (aka Term Frequency or TF), Term Frequency Inverse Document Frequency (TF-IDF), Latent Semantic Analysis (LSA), Global Vectors for Word Representation (GloVe) and Word2Vec embeddings. The general approach is as follows. We compute (once) the category tag vectors based on raw counts. Then for each vectorization strategy, we generate the document vectors using that strategy. We then compute the category tag similarities and corresponding text similarity between all pairs of documents, and compute the Pearson Correlation coefficient between these two distributions.

Before we do that though, we need to parse out the Reuters-21578 corpus into a format our downstream components can consume easily. Scikit-Learn examples contain a parser for the Reuters-21578 corpus that I adapted in my code (almost verbatim) below, which parses the dataset and writes out the text and tags into two separate files.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# Source: src/parse-input.py
from __future__ import division, print_function
from sklearn.externals.six.moves import html_parser
from glob import glob
import collections
import nltk
import os
import re

class ReutersParser(html_parser.HTMLParser):
    """ Utility class to parse a SGML file and yield documents one at 
        a time. 
    """
    def __init__(self, encoding='latin-1'):
        html_parser.HTMLParser.__init__(self)
        self._reset()
        self.encoding = encoding

    def handle_starttag(self, tag, attrs):
        method = 'start_' + tag
        getattr(self, method, lambda x: None)(attrs)

    def handle_endtag(self, tag):
        method = 'end_' + tag
        getattr(self, method, lambda: None)()

    def _reset(self):
        self.in_title = 0
        self.in_body = 0
        self.in_topics = 0
        self.in_topic_d = 0
        self.title = ""
        self.body = ""
        self.topics = []
        self.topic_d = ""

    def parse(self, fd):
        self.docs = []
        for chunk in fd:
            self.feed(chunk.decode(self.encoding))
            for doc in self.docs:
                yield doc
            self.docs = []
        self.close()

    def handle_data(self, data):
        if self.in_body:
            self.body += data
        elif self.in_title:
            self.title += data
        elif self.in_topic_d:
            self.topic_d += data

    def start_reuters(self, attributes):
        pass

    def end_reuters(self):
        self.body = re.sub(r'\s+', r' ', self.body)
        self.docs.append({'title': self.title,
                          'body': self.body,
                          'topics': self.topics})
        self._reset()

    def start_title(self, attributes):
        self.in_title = 1

    def end_title(self):
        self.in_title = 0

    def start_body(self, attributes):
        self.in_body = 1

    def end_body(self):
        self.in_body = 0

    def start_topics(self, attributes):
        self.in_topics = 1

    def end_topics(self):
        self.in_topics = 0

    def start_d(self, attributes):
        self.in_topic_d = 1

    def end_d(self):
        self.in_topic_d = 0
        self.topics.append(self.topic_d)
        self.topic_d = ""

def stream_reuters_documents(reuters_dir):
    """ Iterate over documents of the Reuters dataset.

    The Reuters archive will automatically be downloaded and uncompressed if
    the `data_path` directory does not exist.

    Documents are represented as dictionaries with 'body' (str),
    'title' (str), 'topics' (list(str)) keys.

    """
    parser = ReutersParser()
    for filename in glob(os.path.join(reuters_dir, "*.sgm")):
        for doc in parser.parse(open(filename, 'rb')):
            yield doc

def maybe_build_vocab(reuters_dir, vocab_file):
    vocab = collections.defaultdict(int)
    if os.path.exists(vocab_file):
        fvoc = open(vocab_file, "rb")
        for line in fvoc:
            word, idx = line.strip().split("\t")
            vocab[word] = int(idx)
        fvoc.close()
    else:
        counter = collections.Counter()
        num_docs_read = 0
        for doc in stream_reuters_documents(reuters_dir):
            if num_docs_read % 100 == 0:
                print("building vocab from {:d} docs"
                    .format(num_docs_read))
            topics = doc["topics"]
            if len(topics) == 0:
                continue
            title = doc["title"]
            body = doc["body"]
            title_body = ". ".join([title, body]).lower()
            for sent in nltk.sent_tokenize(title_body):
                for word in nltk.word_tokenize(sent):
                    counter[word] += 1
            for i, c in enumerate(counter.most_common(VOCAB_SIZE)):
                vocab[c[0]] = i + 1
            num_docs_read += 1
        print("vocab built from {:d} docs, complete"
            .format(num_docs_read))
        fvoc = open(vocab_file, "wb")
        for k in vocab.keys():
            fvoc.write("{:s}\t{:d}\n".format(k, vocab[k]))
        fvoc.close()
    return vocab

##################### main ######################

DATA_DIR = "../data"
REUTERS_DIR = os.path.join(DATA_DIR, "reuters-21578")
VOCAB_FILE = os.path.join(DATA_DIR, "vocab.txt")
VOCAB_SIZE = 5000

vocab = maybe_build_vocab(REUTERS_DIR, VOCAB_FILE)

ftext = open(os.path.join(DATA_DIR, "text.tsv"), "wb")
ftags = open(os.path.join(DATA_DIR, "tags.tsv"), "wb")
num_read = 0
for doc in stream_reuters_documents(REUTERS_DIR):
    # skip docs without specified topic
    topics = doc["topics"]
    if len(topics) == 0:
        continue
    title = doc["title"]
    body = doc["body"]
    num_read += 1
    # concatenate title and body and convert to list of word indexes
    title_body = ". ".join([title, body]).lower()
    title_body = re.sub("\n", "", title_body)
    title_body = title_body.encode("utf8").decode("ascii", "ignore")
    ftext.write("{:d}\t{:s}\n".format(num_read, title_body))
    ftags.write("{:d}\t{:s}\n".format(num_read, ",".join(topics)))

ftext.close()
ftags.close()

The next step is to build the vectors for the category tags. A document can have zero or more tags, but tags are never repeated within a document. So we use a CountVectorizer to build a sparse vector of the same size as the number of unique tags. The vector is mostly zero except for the positions represented by its tags.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Source: src/tag-sims.py
from __future__ import division, print_function
from sklearn.feature_extraction.text import CountVectorizer
import os
import re

import dsutils

DATA_DIR = "../data"
VECTORS_FILE = os.path.join(DATA_DIR, "tag-vecs.mtx")

tags = []
ftags = open(os.path.join(DATA_DIR, "tags.tsv"), "rb")
for line in ftags:
    docid, taglist = line.strip().split("\t")
    taglist = re.sub(",", " ", taglist)
    tags.append(taglist)
ftags.close()

cvec = CountVectorizer()
X = cvec.fit_transform(tags)

dsutils.save_vectors(X, VECTORS_FILE, is_sparse=True)

On the document text side, the baseline vectorizer using raw counts is very similar. The only difference is that we filter out English stop words and we limit our vocabulary to the top 5,000 of the approximately 45,000 terms in the vocabulary.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Source: src/wordcount-sims.py
from __future__ import division, print_function
from sklearn.feature_extraction.text import CountVectorizer
import os

import dsutils

DATA_DIR = "../data"
MAX_FEATURES = 50
VECTORS_FILE = os.path.join(DATA_DIR, 
    "wordcount-{:d}-vecs.mtx".format(MAX_FEATURES))

texts = []
ftext = open(os.path.join(DATA_DIR, "text.tsv"), "rb")
for line in ftext:
    docid, text = line.strip().split("\t")
    texts.append(text)
ftext.close()

cvec = CountVectorizer(max_features=MAX_FEATURES,
                       stop_words="english", 
                       binary=True)
X = cvec.fit_transform(texts)

dsutils.save_vectors(X, VECTORS_FILE, is_sparse=True)

Having generated these files, we can now compute the similarity between all pairs of tag vectors and text vectors. The similarity metric used is Cosine Similarity, chosen because it can be efficiently computed using matrix operations. We then extract the upper triangular matrix from each matrix so we count each pair only once. Further, the diagonal is also excluded so we don't consider similarities between the same vectors. The upper triangular matrices are flattened and the Pearson correlation coefficient calculated.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Source: src/calc-pearson.py
from __future__ import division, print_function
from scipy import stats
import os
import time

import dsutils

DATA_DIR = "../data"

VECTORIZER = "wordcount"
#VECTORIZER = "tfidf"
#VECTORIZER = "lsa"
#VECTORIZER = "glove"
#VECTORIZER = "w2v"

X_IS_SPARSE = True
Y_IS_SPARSE = True
#Y_IS_SPARSE = False

NUM_FEATURES = 5000

XFILE = os.path.join(DATA_DIR, "tag-vecs.mtx")
YFILE = os.path.join(DATA_DIR, "{:s}-{:d}-vecs.{:s}"
    .format(VECTORIZER, NUM_FEATURES, 
            "mtx" if Y_IS_SPARSE else "csv"))

X = dsutils.load_vectors(XFILE, is_sparse=X_IS_SPARSE)
Y = dsutils.load_vectors(YFILE, is_sparse=Y_IS_SPARSE)

XD = dsutils.compute_cosine_sims(X, is_sparse=X_IS_SPARSE)
YD = dsutils.compute_cosine_sims(Y, is_sparse=Y_IS_SPARSE)

XDT = dsutils.get_upper_triangle(XD, is_sparse=X_IS_SPARSE)
YDT = dsutils.get_upper_triangle(YD, is_sparse=Y_IS_SPARSE)

corr, _ = stats.pearsonr(XDT, YDT)
print("Pearson correlation: {:.3f}".format(corr))

Another thing to note is that the CountVectorizer returns a Scipy Sparse Matrix, so the tag vectors and the raw count based text vectors are both sparse. We continue to use sparse matrix operations all the way till we extract the upper triangle from the similarity matrices, ie, until line 36 above. The input vectors to the stats.pearson call are both dense.

However, for some of the later vectorization approaches starting with LSA, the vectors are necessarily dense, so we use Numpy's operations for dense matrices instead. That is the reason we specify the is_sparse parameter for all our parameters. Also, sparse matrices are stored in Matrix Market Format, and dense matrices are stored in Numpy Text (CSV) format, so the setting of the is_sparse parameter can be used to detect the file name suffix as well. Code for the dsutils module is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Source: dsutils.py
from __future__ import division, print_function
from scipy import sparse, io, stats
import matplotlib.pyplot as plt
import numpy as np
import numpy.linalg as LA

def compute_cosine_sims(X, is_sparse=True):
    if is_sparse:
        Xnormed = X / sparse.linalg.norm(X, "fro")
        Xtnormed = X.T / sparse.linalg.norm(X.T, "fro")
        S = Xnormed * Xtnormed
    else:
        Xnormed = X / LA.norm(X, ord="fro")
        Xtnormed = X.T / LA.norm(X.T, ord="fro")
        S = np.dot(Xnormed, Xtnormed)
    return S

def save_vectors(X, filename, is_sparse=True):
    if is_sparse:
        io.mmwrite(filename, X)
    else:
        np.savetxt(filename, X, delimiter=",", fmt="%.5e")

def load_vectors(filename, is_sparse=True):
    if is_sparse:
        return io.mmread(filename)
    else:
        return np.loadtxt(filename, delimiter=",")

def get_upper_triangle(X, k=1, is_sparse=True):
    if is_sparse:
        return sparse.triu(X, k=k).toarray().flatten()
    else:
        return np.triu(X, k=k).flatten()

For word count based vectors, using a vocabulary of the top 5,000 words, correlation of the cosine similarity distribution with the tag vectors was 0.135. Filtering out the English stopwords increased it to 0.276. Binarizing the count vector (so we count each word in a document only once) increased it further to 0.414. Varying the vocabulary size did not change these numbers very significantly.

Generating vectors for TF-IDF vectors is simply a matter of using a different vectorizer, the TfidfVectorizer, also available in Scikit-Learn. Like the CountVectorizer, it generates sparse vectors.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Source: src/tfidf-sims.py
from __future__ import division, print_function
from sklearn.feature_extraction.text import TfidfVectorizer
import os

import dsutils

DATA_DIR = "../data"
MAX_FEATURES = 300
VECTORS_FILE = os.path.join(DATA_DIR, 
    "tfidf-{:d}-vecs.mtx".format(MAX_FEATURES))

texts = []
ftext = open(os.path.join(DATA_DIR, "text.tsv"), "rb")
for line in ftext:
    docid, text = line.strip().split("\t")
    texts.append(text)
ftext.close()

tvec = TfidfVectorizer(max_features=MAX_FEATURES,
                       min_df=0.1, sublinear_tf=True,
                       stop_words="english",
                       binary=True)
X = tvec.fit_transform(texts)

dsutils.save_vectors(X, VECTORS_FILE, is_sparse=True)

With a vocabulary of 5,000 most important terms, the correlation was 0.453. Adding stopwords made it rise to 0.464, and binarizing the vector gave us our best correlation of 0.466.

Next up is using Latent Semantic Analysis (LSA) to rotate the co-ordinate space so that the first few dimensions contain the maximum variances, and reducing the features to the first few dimensions. As you can see from the code below, we use a TfidfVectorizer to generate vectors against the full vocabulary, then use TruncatedSVD to rotate the co-ordinate space and restrict the number of dimensions. The resulting vectors are dense.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Source: src/lsa-sims.py
from __future__ import division, print_function
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import os

import dsutils

DATA_DIR = "../data"
MAX_FEATURES = 50
VECTORS_FILE = os.path.join(DATA_DIR, 
    "lsa-{:d}-vecs.csv".format(MAX_FEATURES))

texts = []
ftext = open(os.path.join(DATA_DIR, "text.tsv"), "rb")
for line in ftext:
    docid, text = line.strip().split("\t")
    texts.append(text)
ftext.close()

tvec = TfidfVectorizer(sublinear_tf=True,
                       stop_words="english",
                       binary=True)
Xraw = tvec.fit_transform(texts)

lsa = TruncatedSVD(n_components=MAX_FEATURES, random_state=42)
X = lsa.fit_transform(Xraw)

dsutils.save_vectors(X, VECTORS_FILE, is_sparse=False)

Unlike textbook examples where the first few dimensions account for 90+ percent of the variance, I needed to go to the top 1000 dimensions to get 44 percent of the variance.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
In [32]: np.sum(lsa.explained_variance_ratio_[0:10])
Out[32]: 0.0465150637457711
In [33]: np.sum(lsa.explained_variance_ratio_[0:300])
Out[33]: 0.24895843614681598
In [34]: np.sum(lsa.explained_variance_ratio_[0:500])
Out[34]: 0.31942420156803719
In [35]: np.sum(lsa.explained_variance_ratio_[0:500])
Out[35]: 0.32257375258317
In [36]: np.sum(lsa.explained_variance_ratio_[0:1000])
Out[36]: 0.44443753062911762

Paradoxically, using a dimension of 1,000 for the text vectors gave me a correlation of 0.424, while reducing the dimension progressively to 500, 300, 200, 100 and 50 gave me correlations of 0.431, 0.437, 0.442, 0.450 and 0.457 respectively. In other words, decreasing the number of dimensions resulted in higher correlation between similarities achieved using category tags and LSA vectors.

The next vectorizing approach I tried uses GloVe embeddings. GloVe uses matrix factorization on a matrix of word co-occurrence statistics from a corpus to generate word representations that include its semantics. The GloVe project has made these embeddings available via their website (see link). We will be using the glove.6B set, which is created out of Wikipedia 2016 and Gigaword 5 corpora, containing 6 billion tokens and a vocabulary of 400,000 words. The zip file contains 4 flat files, containing 50, 100, 200 and 300 dimensional representations of these 400,000 vocabulary words.

In the code below, I use CountVectorizer with a given vocabulary size to generate the count vector from the text, then for each word in a document, get the corresponding GloVe embedding and add it into the document vector, multiplied by the count of the words. I then normalize the resulting document vector by the nuber of words. The resulting dense vector is then written out to file.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# Source: src/glove-sims.py
from __future__ import division, print_function
from sklearn.feature_extraction.text import CountVectorizer
import collections
import numpy as np
import os

import dsutils

DATA_DIR = "../data"
EMBEDDING_SIZE = 200
VOCAB_SIZE = 5000
GLOVE_VECS = os.path.join(DATA_DIR, 
    "glove.6B.{:d}d.txt".format(EMBEDDING_SIZE))
VECTORS_FILE = os.path.join(DATA_DIR, 
    "glove-{:d}-vecs.csv".format(EMBEDDING_SIZE))

texts = []
ftext = open(os.path.join(DATA_DIR, "text.tsv"), "rb")
for line in ftext:
    docid, text = line.strip().split("\t")
    texts.append(text)
ftext.close()

# read glove vectors
glove = collections.defaultdict(lambda: np.zeros((EMBEDDING_SIZE,)))
fglove = open(GLOVE_VECS, "rb")
for line in fglove:
    cols = line.strip().split()
    word = cols[0]
    embedding = np.array(cols[1:], dtype="float32")
    glove[word] = embedding
fglove.close()

# use CountVectorizer to compute vocabulary
cvec = CountVectorizer(max_features=VOCAB_SIZE,
                       stop_words="english",
                       binary=True)
C = cvec.fit_transform(texts)

word2idx = cvec.vocabulary_
idx2word = {v:k for k, v in word2idx.items()}

# compute document vectors. This is just the sum of embeddings for
# individual words. Thus if a document contains the words "u u v"
# then the document vector is 2*embedding(u) + embedding(v).
X = np.zeros((C.shape[0], EMBEDDING_SIZE))
for i in range(C.shape[0]):
    row = C[i, :].toarray()
    wids = np.where(row > 0)[1]
    counts = row[:, wids][0]
    num_words = np.sum(counts)
    if num_words == 0:
        continue
    embeddings = np.zeros((wids.shape[0], EMBEDDING_SIZE))
    for j in range(wids.shape[0]):
        wid = wids[j]
        embeddings[j, :] = counts[j] * glove[idx2word[wid]]
    X[i, :] = np.sum(embeddings, axis=0) / num_words

dsutils.save_vectors(X, VECTORS_FILE, is_sparse=False)

I tried various combinations of GloVe embedding dimension and vocabulary size. The best correlation numbers were 0.457 and 0.458 with GloVe dimension of 200 and a vocabulary size of 5,000 with stopword filtering, for non-binarized and binarized count vectors respectively. Larger GloVe dimensions and larger vocabulary sizes tended to perform better until 200d.

My final vectorizing approach was Word2Vec. Word2Vec achieves a similar semantic representation as GloVe, but it does so by training a model to predict a word given its neighbors. A binary word2vec model, trained on the Google News corpus of 3 billion words is available here, and gensim provides an API to read this binary model in Python.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Source: src/w2v_sims.py
from __future__ import division, print_function
from gensim.models.word2vec import Word2Vec
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import os

import dsutils

DATA_DIR = "../data"
MAX_FEATURES = 300
VOCAB_SIZE = 5000
WORD2VEC_MODEL = os.path.join(DATA_DIR, "GoogleNews-vectors-negative300.bin.gz")
VECTORS_FILE = os.path.join(DATA_DIR, 
    "w2v-{:d}-vecs.csv".format(MAX_FEATURES))

texts = []
ftext = open(os.path.join(DATA_DIR, "text.tsv"), "rb")
for line in ftext:
    docid, text = line.strip().split("\t")
    texts.append(text)
ftext.close()

# read word2vec vectors
word2vec = Word2Vec.load_word2vec_format(WORD2VEC_MODEL, binary=True)

# use CountVectorizer to compute vocabulary
cvec = CountVectorizer(max_features=VOCAB_SIZE,
                       stop_words="english",
                       binary=True)
C = cvec.fit_transform(texts)

word2idx = cvec.vocabulary_
idx2word = {v:k for k, v in word2idx.items()}

# compute document vectors. This is just the sum of embeddings for
# individual words. Thus if a document contains the words "u u v"
# then the document vector is 2*embedding(u) + embedding(v).
X = np.zeros((C.shape[0], 300))
for i in range(C.shape[0]):
    row = C[i, :].toarray()
    wids = np.where(row > 0)[1]
    counts = row[:, wids][0]
    num_words = np.sum(counts)
    if num_words == 0:
        continue
    embeddings = np.zeros((wids.shape[0], MAX_FEATURES))
    for j in range(wids.shape[0]):
        wid = wids[j]
        try:
            emb = word2vec[idx2word[wid]]
            embeddings[j, :] = counts[j] * emb
        except KeyError:
            continue
    X[i, :] = np.sum(embeddings, axis=0) / num_words

dsutils.save_vectors(X, VECTORS_FILE, is_sparse=False)

Since the word2vec model provides vectors of a single dimensionality (300), I tried a few variations of vocabulary size (with stopwords). I see that correlation rises from 0.429 to 0.534 as I increase the vocabulary size from 50 to 5000. Binarizing the text vector results in a drop to 0.522.

The chart below summarizes the spread of correlation numbers against the category tag similarity matrix for document similarity matrices produced by each of the different vectorizers. The top of the blue area represents the best result I got out of that vectorizer with some combination of hyperparameters and the bottom represents the worst result. Obviously, my tests were not that extensive, and its very likely that these vectorizers might yield better results with some other combination of hyperparameters. But it does give an indication of the relative merits of different vectorizers, which is what I was after. Based on this, it looks like TF-IDF is still the best approach for traditional vectorization and word2vec is the best approach for deep learning based vectorization (although I have seen cases where GloVe is clearly better).


So anyway, thats all I had for today. If you enjoyed this post and would like to work with the code, it can be found in my Github project reuters-docsim. If you have ideas for other vectorization approaches for this corpus, do drop me a note or better still, a pull request with the vectorizer code.