Saturday, December 10, 2016

Document Similarity using various Text Vectorizing Strategies


Back when I was learning about text mining, I wrote this post titled IR Math with Java: TF, IDF and LSI. A recent comment/question on that post sparked off a train of thought which ended up being a driver for this post. In addition, I also wanted to compare a few text vectorization strategies for something else I was doing, so that was another driver. In contrast to the previous post, where I was exploring ideas described in the Text Mining Application Programming book using toy datasets, in this post I use a larger dataset. In addition, the dataset is (sort of) labeled, so I use that to compare the approaches quantitatively.

The dataset I chose for this exercise is the Reuters-21578 corpus from the UCI Machine Learning Repository. The corpus is a collection of 21,578 news stories that appeared on the Reuters newswire service in 1987. Each document is manually categorized into zero or more category tags. There are 481 unique tags across the documents. The number of tags per document vary from 0 (for about 1,862 documents) to 35. The distribution is heavily right-skewed, with the mean number of tags per document being 2.194 and the median being 2. The top histograms below show the distribution of tag tags per document across the corpus. The bottom chart shows the distribution of the top 20 (by frequency) tags.


In addition to the category tags, each document has a title and a block of text. For our analysis, we will consider the title to be part of the text and treat each document as simply a collection of terms. The longest document has 53 sentences, but the average document contains about 6.67 sentences each.

Since the category tags are manually assigned, we can think of them as ground truth labels. Then the overlap of the category tags for a pair of documents can be considered the true value of the similarity between them. We can now try various vectorization techniques using the title and body of the documents, and compute the similarity between pairs of document vectors. The correlation between the distribution of similarities computed between document vectors and that computed between category tags would then indicate the overall quality of the vectorization technique.

The vectorization techniques I have compared in this post are raw word counts (aka Term Frequency or TF), Term Frequency Inverse Document Frequency (TF-IDF), Latent Semantic Analysis (LSA), Global Vectors for Word Representation (GloVe) and Word2Vec embeddings. The general approach is as follows. We compute (once) the category tag vectors based on raw counts. Then for each vectorization strategy, we generate the document vectors using that strategy. We then compute the category tag similarities and corresponding text similarity between all pairs of documents, and compute the Pearson Correlation coefficient between these two distributions.

Before we do that though, we need to parse out the Reuters-21578 corpus into a format our downstream components can consume easily. Scikit-Learn examples contain a parser for the Reuters-21578 corpus that I adapted in my code (almost verbatim) below, which parses the dataset and writes out the text and tags into two separate files.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# Source: src/parse-input.py
from __future__ import division, print_function
from sklearn.externals.six.moves import html_parser
from glob import glob
import collections
import nltk
import os
import re

class ReutersParser(html_parser.HTMLParser):
    """ Utility class to parse a SGML file and yield documents one at 
        a time. 
    """
    def __init__(self, encoding='latin-1'):
        html_parser.HTMLParser.__init__(self)
        self._reset()
        self.encoding = encoding

    def handle_starttag(self, tag, attrs):
        method = 'start_' + tag
        getattr(self, method, lambda x: None)(attrs)

    def handle_endtag(self, tag):
        method = 'end_' + tag
        getattr(self, method, lambda: None)()

    def _reset(self):
        self.in_title = 0
        self.in_body = 0
        self.in_topics = 0
        self.in_topic_d = 0
        self.title = ""
        self.body = ""
        self.topics = []
        self.topic_d = ""

    def parse(self, fd):
        self.docs = []
        for chunk in fd:
            self.feed(chunk.decode(self.encoding))
            for doc in self.docs:
                yield doc
            self.docs = []
        self.close()

    def handle_data(self, data):
        if self.in_body:
            self.body += data
        elif self.in_title:
            self.title += data
        elif self.in_topic_d:
            self.topic_d += data

    def start_reuters(self, attributes):
        pass

    def end_reuters(self):
        self.body = re.sub(r'\s+', r' ', self.body)
        self.docs.append({'title': self.title,
                          'body': self.body,
                          'topics': self.topics})
        self._reset()

    def start_title(self, attributes):
        self.in_title = 1

    def end_title(self):
        self.in_title = 0

    def start_body(self, attributes):
        self.in_body = 1

    def end_body(self):
        self.in_body = 0

    def start_topics(self, attributes):
        self.in_topics = 1

    def end_topics(self):
        self.in_topics = 0

    def start_d(self, attributes):
        self.in_topic_d = 1

    def end_d(self):
        self.in_topic_d = 0
        self.topics.append(self.topic_d)
        self.topic_d = ""

def stream_reuters_documents(reuters_dir):
    """ Iterate over documents of the Reuters dataset.

    The Reuters archive will automatically be downloaded and uncompressed if
    the `data_path` directory does not exist.

    Documents are represented as dictionaries with 'body' (str),
    'title' (str), 'topics' (list(str)) keys.

    """
    parser = ReutersParser()
    for filename in glob(os.path.join(reuters_dir, "*.sgm")):
        for doc in parser.parse(open(filename, 'rb')):
            yield doc

def maybe_build_vocab(reuters_dir, vocab_file):
    vocab = collections.defaultdict(int)
    if os.path.exists(vocab_file):
        fvoc = open(vocab_file, "rb")
        for line in fvoc:
            word, idx = line.strip().split("\t")
            vocab[word] = int(idx)
        fvoc.close()
    else:
        counter = collections.Counter()
        num_docs_read = 0
        for doc in stream_reuters_documents(reuters_dir):
            if num_docs_read % 100 == 0:
                print("building vocab from {:d} docs"
                    .format(num_docs_read))
            topics = doc["topics"]
            if len(topics) == 0:
                continue
            title = doc["title"]
            body = doc["body"]
            title_body = ". ".join([title, body]).lower()
            for sent in nltk.sent_tokenize(title_body):
                for word in nltk.word_tokenize(sent):
                    counter[word] += 1
            for i, c in enumerate(counter.most_common(VOCAB_SIZE)):
                vocab[c[0]] = i + 1
            num_docs_read += 1
        print("vocab built from {:d} docs, complete"
            .format(num_docs_read))
        fvoc = open(vocab_file, "wb")
        for k in vocab.keys():
            fvoc.write("{:s}\t{:d}\n".format(k, vocab[k]))
        fvoc.close()
    return vocab

##################### main ######################

DATA_DIR = "../data"
REUTERS_DIR = os.path.join(DATA_DIR, "reuters-21578")
VOCAB_FILE = os.path.join(DATA_DIR, "vocab.txt")
VOCAB_SIZE = 5000

vocab = maybe_build_vocab(REUTERS_DIR, VOCAB_FILE)

ftext = open(os.path.join(DATA_DIR, "text.tsv"), "wb")
ftags = open(os.path.join(DATA_DIR, "tags.tsv"), "wb")
num_read = 0
for doc in stream_reuters_documents(REUTERS_DIR):
    # skip docs without specified topic
    topics = doc["topics"]
    if len(topics) == 0:
        continue
    title = doc["title"]
    body = doc["body"]
    num_read += 1
    # concatenate title and body and convert to list of word indexes
    title_body = ". ".join([title, body]).lower()
    title_body = re.sub("\n", "", title_body)
    title_body = title_body.encode("utf8").decode("ascii", "ignore")
    ftext.write("{:d}\t{:s}\n".format(num_read, title_body))
    ftags.write("{:d}\t{:s}\n".format(num_read, ",".join(topics)))

ftext.close()
ftags.close()

The next step is to build the vectors for the category tags. A document can have zero or more tags, but tags are never repeated within a document. So we use a CountVectorizer to build a sparse vector of the same size as the number of unique tags. The vector is mostly zero except for the positions represented by its tags.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Source: src/tag-sims.py
from __future__ import division, print_function
from sklearn.feature_extraction.text import CountVectorizer
import os
import re

import dsutils

DATA_DIR = "../data"
VECTORS_FILE = os.path.join(DATA_DIR, "tag-vecs.mtx")

tags = []
ftags = open(os.path.join(DATA_DIR, "tags.tsv"), "rb")
for line in ftags:
    docid, taglist = line.strip().split("\t")
    taglist = re.sub(",", " ", taglist)
    tags.append(taglist)
ftags.close()

cvec = CountVectorizer()
X = cvec.fit_transform(tags)

dsutils.save_vectors(X, VECTORS_FILE, is_sparse=True)

On the document text side, the baseline vectorizer using raw counts is very similar. The only difference is that we filter out English stop words and we limit our vocabulary to the top 5,000 of the approximately 45,000 terms in the vocabulary.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Source: src/wordcount-sims.py
from __future__ import division, print_function
from sklearn.feature_extraction.text import CountVectorizer
import os

import dsutils

DATA_DIR = "../data"
MAX_FEATURES = 50
VECTORS_FILE = os.path.join(DATA_DIR, 
    "wordcount-{:d}-vecs.mtx".format(MAX_FEATURES))

texts = []
ftext = open(os.path.join(DATA_DIR, "text.tsv"), "rb")
for line in ftext:
    docid, text = line.strip().split("\t")
    texts.append(text)
ftext.close()

cvec = CountVectorizer(max_features=MAX_FEATURES,
                       stop_words="english", 
                       binary=True)
X = cvec.fit_transform(texts)

dsutils.save_vectors(X, VECTORS_FILE, is_sparse=True)

Having generated these files, we can now compute the similarity between all pairs of tag vectors and text vectors. The similarity metric used is Cosine Similarity, chosen because it can be efficiently computed using matrix operations. We then extract the upper triangular matrix from each matrix so we count each pair only once. Further, the diagonal is also excluded so we don't consider similarities between the same vectors. The upper triangular matrices are flattened and the Pearson correlation coefficient calculated.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Source: src/calc-pearson.py
from __future__ import division, print_function
from scipy import stats
import os
import time

import dsutils

DATA_DIR = "../data"

VECTORIZER = "wordcount"
#VECTORIZER = "tfidf"
#VECTORIZER = "lsa"
#VECTORIZER = "glove"
#VECTORIZER = "w2v"

X_IS_SPARSE = True
Y_IS_SPARSE = True
#Y_IS_SPARSE = False

NUM_FEATURES = 5000

XFILE = os.path.join(DATA_DIR, "tag-vecs.mtx")
YFILE = os.path.join(DATA_DIR, "{:s}-{:d}-vecs.{:s}"
    .format(VECTORIZER, NUM_FEATURES, 
            "mtx" if Y_IS_SPARSE else "csv"))

X = dsutils.load_vectors(XFILE, is_sparse=X_IS_SPARSE)
Y = dsutils.load_vectors(YFILE, is_sparse=Y_IS_SPARSE)

XD = dsutils.compute_cosine_sims(X, is_sparse=X_IS_SPARSE)
YD = dsutils.compute_cosine_sims(Y, is_sparse=Y_IS_SPARSE)

XDT = dsutils.get_upper_triangle(XD, is_sparse=X_IS_SPARSE)
YDT = dsutils.get_upper_triangle(YD, is_sparse=Y_IS_SPARSE)

corr, _ = stats.pearsonr(XDT, YDT)
print("Pearson correlation: {:.3f}".format(corr))

Another thing to note is that the CountVectorizer returns a Scipy Sparse Matrix, so the tag vectors and the raw count based text vectors are both sparse. We continue to use sparse matrix operations all the way till we extract the upper triangle from the similarity matrices, ie, until line 36 above. The input vectors to the stats.pearson call are both dense.

However, for some of the later vectorization approaches starting with LSA, the vectors are necessarily dense, so we use Numpy's operations for dense matrices instead. That is the reason we specify the is_sparse parameter for all our parameters. Also, sparse matrices are stored in Matrix Market Format, and dense matrices are stored in Numpy Text (CSV) format, so the setting of the is_sparse parameter can be used to detect the file name suffix as well. Code for the dsutils module is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Source: dsutils.py
from __future__ import division, print_function
from scipy import sparse, io, stats
import matplotlib.pyplot as plt
import numpy as np
import numpy.linalg as LA

def compute_cosine_sims(X, is_sparse=True):
    if is_sparse:
        Xnormed = X / sparse.linalg.norm(X, "fro")
        Xtnormed = X.T / sparse.linalg.norm(X.T, "fro")
        S = Xnormed * Xtnormed
    else:
        Xnormed = X / LA.norm(X, ord="fro")
        Xtnormed = X.T / LA.norm(X.T, ord="fro")
        S = np.dot(Xnormed, Xtnormed)
    return S

def save_vectors(X, filename, is_sparse=True):
    if is_sparse:
        io.mmwrite(filename, X)
    else:
        np.savetxt(filename, X, delimiter=",", fmt="%.5e")

def load_vectors(filename, is_sparse=True):
    if is_sparse:
        return io.mmread(filename)
    else:
        return np.loadtxt(filename, delimiter=",")

def get_upper_triangle(X, k=1, is_sparse=True):
    if is_sparse:
        return sparse.triu(X, k=k).toarray().flatten()
    else:
        return np.triu(X, k=k).flatten()

For word count based vectors, using a vocabulary of the top 5,000 words, correlation of the cosine similarity distribution with the tag vectors was 0.135. Filtering out the English stopwords increased it to 0.276. Binarizing the count vector (so we count each word in a document only once) increased it further to 0.414. Varying the vocabulary size did not change these numbers very significantly.

Generating vectors for TF-IDF vectors is simply a matter of using a different vectorizer, the TfidfVectorizer, also available in Scikit-Learn. Like the CountVectorizer, it generates sparse vectors.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Source: src/tfidf-sims.py
from __future__ import division, print_function
from sklearn.feature_extraction.text import TfidfVectorizer
import os

import dsutils

DATA_DIR = "../data"
MAX_FEATURES = 300
VECTORS_FILE = os.path.join(DATA_DIR, 
    "tfidf-{:d}-vecs.mtx".format(MAX_FEATURES))

texts = []
ftext = open(os.path.join(DATA_DIR, "text.tsv"), "rb")
for line in ftext:
    docid, text = line.strip().split("\t")
    texts.append(text)
ftext.close()

tvec = TfidfVectorizer(max_features=MAX_FEATURES,
                       min_df=0.1, sublinear_tf=True,
                       stop_words="english",
                       binary=True)
X = tvec.fit_transform(texts)

dsutils.save_vectors(X, VECTORS_FILE, is_sparse=True)

With a vocabulary of 5,000 most important terms, the correlation was 0.453. Adding stopwords made it rise to 0.464, and binarizing the vector gave us our best correlation of 0.466.

Next up is using Latent Semantic Analysis (LSA) to rotate the co-ordinate space so that the first few dimensions contain the maximum variances, and reducing the features to the first few dimensions. As you can see from the code below, we use a TfidfVectorizer to generate vectors against the full vocabulary, then use TruncatedSVD to rotate the co-ordinate space and restrict the number of dimensions. The resulting vectors are dense.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Source: src/lsa-sims.py
from __future__ import division, print_function
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import os

import dsutils

DATA_DIR = "../data"
MAX_FEATURES = 50
VECTORS_FILE = os.path.join(DATA_DIR, 
    "lsa-{:d}-vecs.csv".format(MAX_FEATURES))

texts = []
ftext = open(os.path.join(DATA_DIR, "text.tsv"), "rb")
for line in ftext:
    docid, text = line.strip().split("\t")
    texts.append(text)
ftext.close()

tvec = TfidfVectorizer(sublinear_tf=True,
                       stop_words="english",
                       binary=True)
Xraw = tvec.fit_transform(texts)

lsa = TruncatedSVD(n_components=MAX_FEATURES, random_state=42)
X = lsa.fit_transform(Xraw)

dsutils.save_vectors(X, VECTORS_FILE, is_sparse=False)

Unlike textbook examples where the first few dimensions account for 90+ percent of the variance, I needed to go to the top 1000 dimensions to get 44 percent of the variance.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
In [32]: np.sum(lsa.explained_variance_ratio_[0:10])
Out[32]: 0.0465150637457711
In [33]: np.sum(lsa.explained_variance_ratio_[0:300])
Out[33]: 0.24895843614681598
In [34]: np.sum(lsa.explained_variance_ratio_[0:500])
Out[34]: 0.31942420156803719
In [35]: np.sum(lsa.explained_variance_ratio_[0:500])
Out[35]: 0.32257375258317
In [36]: np.sum(lsa.explained_variance_ratio_[0:1000])
Out[36]: 0.44443753062911762

Paradoxically, using a dimension of 1,000 for the text vectors gave me a correlation of 0.424, while reducing the dimension progressively to 500, 300, 200, 100 and 50 gave me correlations of 0.431, 0.437, 0.442, 0.450 and 0.457 respectively. In other words, decreasing the number of dimensions resulted in higher correlation between similarities achieved using category tags and LSA vectors.

The next vectorizing approach I tried uses GloVe embeddings. GloVe uses matrix factorization on a matrix of word co-occurrence statistics from a corpus to generate word representations that include its semantics. The GloVe project has made these embeddings available via their website (see link). We will be using the glove.6B set, which is created out of Wikipedia 2016 and Gigaword 5 corpora, containing 6 billion tokens and a vocabulary of 400,000 words. The zip file contains 4 flat files, containing 50, 100, 200 and 300 dimensional representations of these 400,000 vocabulary words.

In the code below, I use CountVectorizer with a given vocabulary size to generate the count vector from the text, then for each word in a document, get the corresponding GloVe embedding and add it into the document vector, multiplied by the count of the words. I then normalize the resulting document vector by the nuber of words. The resulting dense vector is then written out to file.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# Source: src/glove-sims.py
from __future__ import division, print_function
from sklearn.feature_extraction.text import CountVectorizer
import collections
import numpy as np
import os

import dsutils

DATA_DIR = "../data"
EMBEDDING_SIZE = 200
VOCAB_SIZE = 5000
GLOVE_VECS = os.path.join(DATA_DIR, 
    "glove.6B.{:d}d.txt".format(EMBEDDING_SIZE))
VECTORS_FILE = os.path.join(DATA_DIR, 
    "glove-{:d}-vecs.csv".format(EMBEDDING_SIZE))

texts = []
ftext = open(os.path.join(DATA_DIR, "text.tsv"), "rb")
for line in ftext:
    docid, text = line.strip().split("\t")
    texts.append(text)
ftext.close()

# read glove vectors
glove = collections.defaultdict(lambda: np.zeros((EMBEDDING_SIZE,)))
fglove = open(GLOVE_VECS, "rb")
for line in fglove:
    cols = line.strip().split()
    word = cols[0]
    embedding = np.array(cols[1:], dtype="float32")
    glove[word] = embedding
fglove.close()

# use CountVectorizer to compute vocabulary
cvec = CountVectorizer(max_features=VOCAB_SIZE,
                       stop_words="english",
                       binary=True)
C = cvec.fit_transform(texts)

word2idx = cvec.vocabulary_
idx2word = {v:k for k, v in word2idx.items()}

# compute document vectors. This is just the sum of embeddings for
# individual words. Thus if a document contains the words "u u v"
# then the document vector is 2*embedding(u) + embedding(v).
X = np.zeros((C.shape[0], EMBEDDING_SIZE))
for i in range(C.shape[0]):
    row = C[i, :].toarray()
    wids = np.where(row > 0)[1]
    counts = row[:, wids][0]
    num_words = np.sum(counts)
    if num_words == 0:
        continue
    embeddings = np.zeros((wids.shape[0], EMBEDDING_SIZE))
    for j in range(wids.shape[0]):
        wid = wids[j]
        embeddings[j, :] = counts[j] * glove[idx2word[wid]]
    X[i, :] = np.sum(embeddings, axis=0) / num_words

dsutils.save_vectors(X, VECTORS_FILE, is_sparse=False)

I tried various combinations of GloVe embedding dimension and vocabulary size. The best correlation numbers were 0.457 and 0.458 with GloVe dimension of 200 and a vocabulary size of 5,000 with stopword filtering, for non-binarized and binarized count vectors respectively. Larger GloVe dimensions and larger vocabulary sizes tended to perform better until 200d.

My final vectorizing approach was Word2Vec. Word2Vec achieves a similar semantic representation as GloVe, but it does so by training a model to predict a word given its neighbors. A binary word2vec model, trained on the Google News corpus of 3 billion words is available here, and gensim provides an API to read this binary model in Python.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Source: src/w2v_sims.py
from __future__ import division, print_function
from gensim.models.word2vec import Word2Vec
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import os

import dsutils

DATA_DIR = "../data"
MAX_FEATURES = 300
VOCAB_SIZE = 5000
WORD2VEC_MODEL = os.path.join(DATA_DIR, "GoogleNews-vectors-negative300.bin.gz")
VECTORS_FILE = os.path.join(DATA_DIR, 
    "w2v-{:d}-vecs.csv".format(MAX_FEATURES))

texts = []
ftext = open(os.path.join(DATA_DIR, "text.tsv"), "rb")
for line in ftext:
    docid, text = line.strip().split("\t")
    texts.append(text)
ftext.close()

# read word2vec vectors
word2vec = Word2Vec.load_word2vec_format(WORD2VEC_MODEL, binary=True)

# use CountVectorizer to compute vocabulary
cvec = CountVectorizer(max_features=VOCAB_SIZE,
                       stop_words="english",
                       binary=True)
C = cvec.fit_transform(texts)

word2idx = cvec.vocabulary_
idx2word = {v:k for k, v in word2idx.items()}

# compute document vectors. This is just the sum of embeddings for
# individual words. Thus if a document contains the words "u u v"
# then the document vector is 2*embedding(u) + embedding(v).
X = np.zeros((C.shape[0], 300))
for i in range(C.shape[0]):
    row = C[i, :].toarray()
    wids = np.where(row > 0)[1]
    counts = row[:, wids][0]
    num_words = np.sum(counts)
    if num_words == 0:
        continue
    embeddings = np.zeros((wids.shape[0], MAX_FEATURES))
    for j in range(wids.shape[0]):
        wid = wids[j]
        try:
            emb = word2vec[idx2word[wid]]
            embeddings[j, :] = counts[j] * emb
        except KeyError:
            continue
    X[i, :] = np.sum(embeddings, axis=0) / num_words

dsutils.save_vectors(X, VECTORS_FILE, is_sparse=False)

Since the word2vec model provides vectors of a single dimensionality (300), I tried a few variations of vocabulary size (with stopwords). I see that correlation rises from 0.429 to 0.534 as I increase the vocabulary size from 50 to 5000. Binarizing the text vector results in a drop to 0.522.

The chart below summarizes the spread of correlation numbers against the category tag similarity matrix for document similarity matrices produced by each of the different vectorizers. The top of the blue area represents the best result I got out of that vectorizer with some combination of hyperparameters and the bottom represents the worst result. Obviously, my tests were not that extensive, and its very likely that these vectorizers might yield better results with some other combination of hyperparameters. But it does give an indication of the relative merits of different vectorizers, which is what I was after. Based on this, it looks like TF-IDF is still the best approach for traditional vectorization and word2vec is the best approach for deep learning based vectorization (although I have seen cases where GloVe is clearly better).


So anyway, thats all I had for today. If you enjoyed this post and would like to work with the code, it can be found in my Github project reuters-docsim. If you have ideas for other vectorization approaches for this corpus, do drop me a note or better still, a pull request with the vectorizer code.


11 comments (moderated to prevent spam):

Venkat Nagaswamy said...

Nice one. Some other vectorizing options to try: doc2vec, thought vectors, autoencoders (of various kinds). If possible, I'll try it over the break next week.

Sujit Pal said...

Thanks. I had thought about doing this with an autoencoder. Another thing I want to try is topic modeling. If you end up implementing the 3 things you mentioned, please send me a link and I will link to your post.

Blue Monk said...

I am currently also investigating ways to represent documents (tweets in my case) for the purpose of tweet clustering. Interesting that you chose not to apply pooling functions (e.g. averaging or summing) for extracting your word embedding vectors. .

Sujit Pal said...

Actually, for the last two approaches (using GloVe and word2vec), I am using a form of average pooling. Words in the document are replaced with their embeddings, and the embedding vectors summed then divided by the number of words. Although I am not using it in the supervised learning sense yet.

Blue Monk said...

Ah ok, didn't see that :)

Thanks !

Unknown said...

Can we perform semantic analysis of two texts in Java?

Sujit Pal said...

The "semantic" part is driven by using word embeddings, where similar words tend to clump together, and hopefully results in a sentence or document representation that reflects that. The only Java API to word2vec I know of is from deeplearning4j, looks like they also have a way to load data from the pretrained Google word2vec. GLoVe is supplied as flat files, so you can use plain Java to read it, no special API needed.

Arieswonder said...

Hi Sujit,
What do you suggest to use for sentence similarity that can use both semantic and structural similarity of two sentences?

Sujit Pal said...

Do you mean a metric that combines both similarities? If you have access to both numbers, may make sense to train a regression model to learn the coefficients of how they could be combined to come up with a mixed metric.

Robin S said...

Hi,

In your Glove approach, you made a mistake by calling CountVectorizer with the parameter binary to True! By doing so, each document tokens are all summed to 1! Hence your final output does not take into consideration the number of tokens per document.

Best,
Robin S

Sujit Pal said...

First off, sorry about the delay in responding, my comment notification was broken. To answer your question, that was kind of what I was going for, kind of... I found that binarizing my counts gave me better results. By binarizing, I mean treating a token as appeared or not-appeared in a document, so if it doesn't occur, count is 0, and if it appears, regardless of the number of times it appears, it is a 1.