Sunday, June 18, 2017

Trip Report: Graph Day SF 2017


Yesterday I attended the Graph Day SF 2017 conference. Lately, my interest in graphs have been around Knowledge Graphs. Last year, I worked on a project that used an existing knowledge graph and entity-pair versus relations co-occurrences across a large body of text to predict new relations from the text. Although we modeled the co-occurrence as a matrix instead of a graph, I was hoping to learn techniques that I could apply to graphs. One other area of recent interest is learning how to handle large graphs.

So anyway, that was why I went. In this post, I describe the talks I attended. The conference was 1 day only, and there were 4 parallel tracks of very deep, awesome talks, and there were at least 2 that I would have liked to attend but couldn't because I had to make a choice.

Keynote - Marko Rodriguez, Datastax.

I have always thought of graph people as being somewhat more intellectual than mere programmers, starting with the classes I took at college. The keynote kind of confirms this characterization. The object of the talk was to refute the common assertion by graph people that everything is a graph. The speaker does this by showing that a graph can be thought of structurally, as a collection of vertices and edges, and also as process, as a collection of functions and streams. Differentiating a graph repeatedly oscillates between the two representation, leading to the conclusion that a graph is infinitely differentiable. Here is the paper on which the talk is based, and here are the slides.

Time for a new Relation: going from RDBMS to graph - Patrick McFadin, Datastax

This talk was decidedly less highbrow compared to the keynote, focusing on why one might want to move from relational to the graph paradigm. The speaker has lots of experience in RDBMS and Tabular NoSQL databases (Cassandra), and is currently making the shift to graph databases. One key insight is that he classifies the different types of database technology in a continuum - Key Value stores, Tabular NoSQL databases, Document NoSQL databases, RDBMS, graph databases. Also, he differentiates bwtween the strengths of an RDBMS and a that of a graph databases as follows - the RDBMS makes it easy to describe relations, but the graph database makes it easy to find relations. He also looks at Property Graphs as possible drop-in replacements for RDBMS tables. He also pointed out a free learning resource DS330: DataStax Enterprise Graph, which seems likely to be product specific, although the introductory video suggests that there is some product agnostic content around data modeling.

Comparing Giraph and GraphX - Jenny Zhao, Drawbridge

Drawbridge's business is to disambiguate your devices from other people's, using their activity logs. In this particular presentation, they describe how they switched from using map-reduce for their feature selection process to using Apache Giraph and saved about 8 hours of processing time. Instead of writing out the pair data, then doing a pairwise compare followed by a pairwise join, they ingest the paired data as a graph and compute distances on the edges to find the best pairs for their downstream process. They also tried Spark GraphX but they found it doesn't scale as well to large data volumes. Code using GraphX and Giraph are also shown to highlight an important difference between the two.

Graphs in Genomics - Jason Chin, Pacific Biosciences

Interesting presentation about the use of graphs in the field of genomics. The human genome is currently not readable in its entirety, so it cut into many peieces of random length and resequenced. One possibility is to represent it as 23 bipartite graphs, one for each of our 23 chromosomes. Presentation then focuses on how researchers use graph theory to fill in gaps between the peices of the genome. Here is a link to an older presentation by the same presenter which covers much of the same material as this talk, I will update with the current presentation when it becomes available.

Knowledge Graph in Watson Discovery - Anshu Jain and Nidhi Rajshree, IBM

The talk focuses on lessons learned while the presenters were building the knowledge graph for IBM Watson. I thought this was a good mix of practical ideas and theory. Few things that I found particularly noteworthy was including suprise as a parameter - the user can specify a parameter that indicates his willingness to see serendipitous results. Another one is keeping the Knowledge Graph lighter and using it to finetune queries at runtime (local context) rather than baking it in during creating time (global context). Thus you are using the Knowledge graph itself as a context vector. Yet another idea is using Mutual Information as a similarity metric for the element of surprise (useful in intelligence and legal work) since it treats noise equally for both documents. Here is a link to one of the presenter's old slides, I will update the link with the latest one once it is updated.

A Cognitive Knowledge Base as an Enterprise Database - Haikal Pribadi, GRAKN.AI

The presenter showcases his product GRAKN.AI (sounds like Kraken), which is a distributed knowledge base with a reasoning query language. It was awarded product of the year for 2017 by University of Cambridge Computer Lab. It has a unified syntax that allows you to define and populate a graph and then query it. The query language feels a bit like Prolog, but much more readable. It is open source and free to use. I was quite impressed with this product and hope to try it soon. One other thing I noted in his presentation was the use of the DeepDive project for knowledge acquisition, which is a nice confirmation since I am looking at it's sister project snorkel for a similar use case.

Graph Based Taxonomy Generation - Rob McDaniel, LiveStories

The presenter describes building taxonomy from queries. The resulting taxonomies are focused on a small area of knowledge, and can be useful for building custom taxonomies for applications focused on a specific domain. Examples mentioned in the presentation were “health care costs” and “poisoning deaths”, produced as a result of using his approach. The idea is to take a group of (manually created) seed queries about a given subject and hit some given search engine using an API and collect the top N documents for each query. You then do topic modeling on these documents and generate a document-topic co-ocurrence graph (using only topics that have p(topic|document) above a certain threshold). You then partition the graph into subgraphs using an iterative partitioning strategy of coarsening, bisecting and un-coarsening. The graph partitioning algorithm covered in the presentation was Heavy Edge Matching, but other partitioning algorithms could be used as well. Once the partitions are stable, the node with the highest degree of connectedness in each partition becomes the root level element in the taxonomy. This node is then removed from the subgraph and the subgraph partitioned recursively again into its own subgraphs, until the number of topics in a partition is less than some threshold. The presentation slides and code are available.

Project Konigsburg: A Graph AI - Gunnar Kleemann and Denis Vrdoljak, Berkeley Data Science Group

The presenters describe a similarity metric based on counting triangles and wedges (subgraph motifs) that seems to work better with connected elements in a graph than more traditional metrics. They use this approach to rank features for feature selection. They have used this metric to build a and rank academics from a citation network extracted from Pubmed. They have also used this metric in several applications that focus on recruiting from the applicant side (resume building, finding the job that best suits your profile, etc).

Knowledge Graph Platform: Going beyond the database - Michael Grove, Stardog

This was a slightly high level talk by the CTO of Stardog. He outlined what people generally think about when they say Enterprise Knowledge Graph Platforms and the common fallacies in these definitions.

Two presentations I missed because there were 4 tracks going on at the same time, and I had to choose between two awesome presentations going on at the same time.

  • DGraph: A native, distributed graph database - Manish Jain, Dgraph Labs.
  • Start Flying with Apache and Tinkerpop - Jason Plurad, IBM

Overall, I thought the conference had really good talks, the venue was excellent, and the event was very well organized. There was no breakfast or snacks, but there was coffee and tea, and the lunch was delicious. One thing I noticed was the absence of video recording, so unfortunately there is not going to be any videos of these talks. There were quite a few booths, mostly graph database vendors. I learned quite a few things here, although I might have learned more if the conference was spread over 2 days and had 2 parallel tracks instead of 4.


Saturday, May 20, 2017

Evaluating a Simple but Tough to Beat Embedding via Text Classification



Recently, a colleague and a reader of this blog independently sent me a link to the Simple but Tough-to-Beat Baseline for Sentence Embeddings (PDF) paper by Sanjeev Arora, Yingyu Liang, and Tengyu Ma. My reader also mentioned that the paper was selected for a mini-review in Lecture 2 of the Natural Language Processing and Deep Learning (CS 224N) course taught at Stanford University by Prof Chris Manning and Richard Socher. For those of you who have taken Stanford's earlier Deep Learning and NLP (CS 224d) course taught by Socher, or the very first course Coursera on Natural Language Processing by Profs Dan Jurafsky and Chris Manning, you will find elements from both in here. There are also some things I think are new or that I might have missed earlier.

The paper introduces an unsupervised scheme for generating sentence embeddings that has been shown to consistently outperform a simple Bag of Words (BoW) approach in a number of evaluation scenarios. The evaluation scenarios considered are both intrinsic (correlating computed similarities of sentence embeddings with human estimates of similarity) as well as extrinsic (using the embeddings for a downstream classification task). I thought the idea was very exciting, since all the techniques I have used to convert word embeddings to sentence embeddings have given results consistent with the complexity used to produce them. At the very low end is the BoW approach, which adds up the embedding vectors for the individual words and averages them over the sentence length. At the other end of the scale is to generate sentence vectors from a sequence of word vectors by training a LSTM and then using it, or by looking up sentence vectors using a trained skip-thoughts encoder.

The Smooth Inverse Frequency (SIF) embedding approach suggested by the paper is only slightly more complicated than the BoW approach, and promises consistently better results than BoW. So for those of us who used the BoW as a baseline, this suggests that we should now use SIF embedding instead. So instead of just averaging the component word vectors as suggested by this equation for BoW:



We generate the sentence vector vs by multiplying each component vector vw by the inverse of its probability of occurrence. Here α is a smoothing constant, its default value as suggested in the paper is 0.001. We then sum these normalized smoothed word vectors and divide by the number of words.



Since we do this for all the sentences in our corpus, we now have a matrix where the number of rows is the number of sentences and the number of columns is the embedding size (typically 300). Removing the first principal component from this matrix gives us our sentence embedding. There is also an implementation of this embedding scheme in the YingyuLiang/SIF GitHub repository.

For my experiment, I decided to compare BoW and SIF vectors by how effective they are when used for text classification. My task is to classify images as compound (i.e, composed of multiple sub-images) versus non-compound (single image, no sub-images) using only the captions. The data comes from the ImageCLEF 2016 (Medical) competition, where Compound Figure Detection is the very first task in the task pipeline. The provided dataset has 21,000 training captions, each about 92 words long on average, and split roughly equally between the two classes. The dataset also contains 3,456 test captions (labels provided for validation purposes).

The label and captions are provided as two separate files, for both training and test datasets. Here is an example of what the labels file looks like:

1
2
3
4
5
6
7
8
11373_2007_9226_Fig1_HTML,COMP
12178_2007_9002_Fig3_HTML,COMP
12178_2007_9003_Fig1_HTML,COMP
12178_2007_9003_Fig3_HTML,COMP
rr188-1,NOCOMP
rr199-1,NOCOMP
rr36-5,NOCOMP
scrt3-1,NOCOMP

and the captions files look like this:

1
2
3
12178_2007_9003_Fig1_HTML       An 64-year-old female with symptoms of bilateral lower limb neurogenic claudication with symptomatic improvement with a caudal epidural steroid injection. An interlaminar approach could have been considered appropriate, as well. ( a ) Sagittal view of a T2-weighted MRI of the lumbar spine. Note the grade I spondylolisthesis of L4 on L5 with severe central canal stenosis. ( b ) and ( c ) Axial views of a T2-weighted MRI through L4 รข<80><93> 5. Note the diffuse disc bulge in ( b ) and the marked ligamentum flavum hypertophy in ( c ), both contributing to the severe central stenosis. ( d ) The L5-S1 level showing no evidence of stenosis
12178_2007_9003_Fig3_HTML       Fluoroscopic images of an L3-4 interlaminar approach. ( a ) AP view, pre-contrast, ( b ) Lateral view, pre-contrast, and ( c ) Lateral view, post-contrast
12178_2007_9003_Fig5_HTML       Fluoroscopic images of a right L5-S1 transforaminal approach targeting the right L5 nerve root. ( a ) AP view, pre-contrast and ( b ) AP view, post-contrast

I built BoW and SIF vectors for the entire dataset, using GloVe word vectors. I then used these vectors as inputs to stock Scikit-Learn Naive Bayes and Support Vector Machine classifiers, and measured the test accuracy for various vocabulary sizes. For the word probabilities, I used both native probabilities (i.e, computed from the combined caption dataset) and outside probabilities (computed from Wikipedia, and available in the YingyuLiang/SIF GitHub repository). I then built vocabularies out of the most common N words, computed BoW sentence embeddings, SIF sentence embeddings with native word frequencies, and SIF sentence embeddings with external probabilities (SIF+EP), and recorded the accuracy reported for the two class classification task from the Naive Bayes and Support Vector Machine (SVM) classifiers. Below I provide a breakdown of the steps wtih code.

The first step is to parse the files and generate a list of training and test captions with their labels.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def parse_caption_and_label(caption_file, label_file, sep=" "):
    filename2label = {}
    flabel = open(label_file, "rb")
    for line in flabel:
        filename, label = line.strip().split(sep)
        filename2label[filename] = LABEL2ID[label]
    flabel.close()
    fcaption = open(caption_file, "rb")
    captions, labels = [], []
    for line in fcaption:
        filename, caption = line.strip().split("\t")
        captions.append(caption)
        labels.append(filename2label[filename])
    fcaption.close()
    return captions, labels

TRAIN_CAPTIONS = "/path/to/training-captions.tsv"
TRAIN_LABELS = "/path/to/training-labels.csv"
TEST_CAPTIONS = "/path/to/test-captions.tsv"
TEST_LABELS = "/path/to/test-labels.csv"
LABEL2ID = {"COMP": 0, "NOCOMP": 1}

captions_train, labels_train = parse_caption_and_label(
    TRAIN_CAPTIONS, TRAIN_LABELS, ",")
captions_test, labels_test = parse_caption_and_label(
    TEST_CAPTIONS, TEST_LABELS, " ")

Next I build the word count matrix using the captions. For this we use the Scikit-Learn CountVectorizer to do the heavy lifting. We have removed stopwords from the counting using the stopwords parameter. At this point Xc is a matrix of word counts of shape (number of training records + number of test records, VOCAB_SIZE). The VOCAB_SIZE is a hyperparameter which we will vary during our experiments.

1
2
3
4
5
6
7
8
from sklearn.feature_extraction.text import CountVectorizer

VOCAB_SIZE = 10000
counter = CountVectorizer(strip_accents=unicode, 
                          stop_words="english",
                          max_features=VOCAB_SIZE)
caption_texts = captions_train + captions_test
Xc = counter.fit_transform(caption_texts).todense().astype("float")

At this point, we can capture the sentence length vector S (see the formulae for vs as the sum across the columns of this matrix).

1
2
3
4
import numpy as np

sent_lens = np.sum(Xc, axis=1).astype("float")
sent_lens[sent_lens == 0] = 1e-14  # prevent divide by zero

Next we read the pretrained word vectors from the provided GloVe embedding file. We use the version built with Wikipedia 2014 + Gigaword 5 (6B tokens, 400K words and dimensionality 300). The following snippet extracts the vectors for the words in our vocabulary and collects them into a dictionary.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
E = np.zeros((VOCAB_SIZE, 300))
fglove = open(GLOVE_EMBEDDINGS, "rb")
for line in fglove:
    cols = line.strip().split(" ")
    word = cols[0]
    try:
        i = counter.vocabulary_[word]
        E[i] = np.array([float(x) for x in cols[1:]])
    except KeyError:
        pass
fglove.close()

We are now ready to build our BoW vectors. Replacing the term counts with the appropriate vector is just a matrix multiplication, and averaging by word length means an element-wise divide by the S vector. Finally we split our BoW sentence embeddings into training and test splits.

1
2
3
4
Xb = np.divide(np.dot(Xc, E), sent_lens)

Xtrain, Xtest = Xb[0:len(captions_train)], Xb[-len(captions_test):]
ytrain, ytest = np.array(labels_train), np.array(labels_test)

The regularity of the Scikit-Learn API means that we can build some functions that can be used to cross-validate our classifier during training and evaluate it with the test data.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

def cross_val(Xtrain, ytrain, clf):
    best_clf = None
    best_score = 0.0
    num_folds = 0
    cv_scores = []
    kfold = KFold(n_splits=10)
    for train, val in kfold.split(Xtrain):
        Xctrain, Xctest, yctrain, yctest = Xtrain[train], Xtrain[val], ytrain[train], ytrain[val]
        clf.fit(Xctrain, yctrain)
        score = clf.score(Xctest, yctest)
        if score > best_score:
            best_score = score
            best_clf = clf
        print("fold {:d}, score: {:.3f}".format(num_folds, score))
        cv_scores.append(score)
        num_folds += 1
    return best_clf, cv_scores

def test_eval(Xtest, ytest, clf):
    print("===")
    print("Test set results")
    ytest_ = clf.predict(Xtest)
    accuracy = accuracy_score(ytest, ytest_)
    print("Accuracy: {:.3f}".format(accuracy))

We now invoke these functions to instantiate a Naive Bayes and SVM classifier, train it with 10-fold cross validation on the training split, and evaluate it with the test data to produce, among other things, a test accuracy. The following code shows the call for doing this with a Naive Bayes classifier. The code for doing this with an SVM classifier is similar.

1
2
3
4
5
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
best_clf, scores_nb = cross_val(Xtrain, ytrain, clf)
test_eval(Xtest, ytest, best_clf)

The SIF sentence embeddings also start with the count matrix generated by the CountVectorizer. In addition, we need to compute the word probabilities. If we want to use the word probabilities from the dataset, we can do so by computing the row sum of the count matrix as follows:

1
2
3
# compute word probabilities from corpus
freqs = np.sum(Xc, axis=0).astype("float")
probs = freqs / np.sum(freqs)

We could also get these word probabilities from some external source such as a file. So given the probs vector, we can create a vector representing the coefficient for each word. Something like this:

1
2
ALPHA = 1e-3
coeff = ALPHA / (ALPHA + probs)

We can then compute the raw sentence embedding matrix in a manner similar to the BoW matrix.

1
2
Xw = np.multiply(Xc, coeff)
Xs = np.divide(np.dot(Xw, E), sent_lens)

In order to remove the first principal component, we first compute it using the TruncatedSVD class from Scikit-Learn, and subtract it from the raw SIF embedding Xs.

1
2
3
4
5
6
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=1, n_iter=20, random_state=0)
svd.fit(Xs)
pc = svd.components_
Xr = Xs - Xs.dot(pc.T).dot(pc)

As with the BoW sentence embeddings, we split it back to a training and test set, and train the two classifiers and evaluate them.

The full code used for this post is available in this GitHub gist as a Jupyter notebook. The results of running the three sentence embeddings - BoW, SIF and SIF with External Word Probabilities (SIF+EP) - through the two stock Sciki-Learn classifiers for different vocabulary sizes are shown below.



As you can see, I get conflicting results for the two classifiers. For the Naive Bayes classifier, SIF sentence embeddings with native word probabilities narrowly beats out the BoW embeddings, whereas in case of SVM, the SIF embeddings with external word probabilities are slightly better than the BoW results for some vocabulary sizes. Also, accuracies from the other SIF embedding trails the ones from BoW in both cases. Finally, the differences are really minor - if you look at the y-axis on the charts, you will see that the difference is on the third decimal place. So at least based on my experiment, there does not seem to be a significant utility to use the SIF embeddings over the BoW.

My use case does differ from the ideal case in that my captions can be long (longer than a typical sentence) and/or multi-sentence. Also, for the embedding I used the GloVe vectors computed against the 6B corpus, the YingyuLiang/SIF implementation used vectors generated from the 84B corpus. I don't believe these should make too much difference, but I may be wrong. I have tried to follow the paper recommendations as closely as possible when replicating this experiment, but it is possible I may have made a mistake somewhere - in case you spot it please let me know. The code is included, both on this post as well as in the GitHub gist if you want to verify that it works like I described. As a user of word and sentence embeddings, my primary use case is to use them to encode text input to classifiers. If you have gotten results that indicate SIF sentence embeddings are significantly better than BoW sentence embeddings for this or a similar use case, please let me know.



Saturday, May 13, 2017

Trying out various Deep Learning frameworks


The Deep Learning toolkit I am most familiar with is Keras, having used it to build some models around text classification, question answering and image similarity/classification in the past, as well as the examples for our book Deep Learning with Keras that I co-authored with Antonio Gulli. Before that, I have worked with Caffe to evaluate its pre-trained image classification models and to use one of them as a feature extractor for one of my own image classification pipelines. I have also worked with Tensorflow, learning it first from the awesome Udacity course taught by Vincent Vanhoucke of Google, and then using it to replicate the Grammar as a Foreign Language paper using our own data.

Lately, I have been curious about some of the other DL frameworks that are available, and whether it might make sense to explore them as well. So I decided to build a fully connected (FCN) and a convolutional (CNN) model to classify handwritten digits from the MNIST dataset, for each of Keras, Tensorflow, PyTorch, MXNet and Theano. Unlike the MNIST examples that are available for some of these frameworks, I read the data from CSV files and try to follow a similar coding style (the one I use for Keras) across all the different frameworks, so they are easy to compare. Both networks are also quite simple and training them is quick, so it is easy to run. All examples are provided as Jupyter notebooks, so you can just read them like you would one of my more code-heavy blog posts. The code is on my sujitpal/polydlot repository on GitHub.

My inspiration for the work was this chart posted on Twitter in May 2016 by Francois Chollet, creator of Keras. The first 3 charts show the top DL frameworks on GitHub ranked by number of forks, number of contributors and number of open issues. The fourth one weights these three features and produces an overall ranking that shows Keras at #3. I don't know the reasoning for the weights chosen in the fourth chart, although the rankings do line up with my own experience, and I would intuitively place similar importance on these three features as well. However, more importantly, even though it's somewhat dated, the chart gives an idea of the DL frameworks people are looking at, as well as a rough indication of their popularity.



In this post, I explain why I chose the DL frameworks that I did and share what I learned about each of these frameworks from the exercise. For those of you who know a subset of these frameworks, hopefully this will give you a glimpse of what it is like in the other framework. To those who are just starting out, I hope this comparison gives you some idea of where to start.

I chose Keras because I am comfortable with it. The very first DL framework I learned was Tensorflow. Soon after, I came across Keras when trying to read some Lasagne code (another library to build networks in Theano). While it didn't help with the Lasagne work, I got very excited about Keras, and set about building Keras implementations of the Tensorflow models I had built so far, and really got to appreciate how its object-oriented API made it easy to build useful models with very few lines of code. So anyway, I did the Keras examples mainly to figure out a base configuration and how many epochs to train each network to get reasonable results.

For those of you who are reading this to decide whether to learn Keras - learning Keras has one other advantage. In addition to the two backends (Theano and Tensorflow) it already supports, the Microsoft Cognitive Toolkit (CNTK) project and the MXNet project (supported by Amazon) are also considering Keras APIs. So once these APIs are in place, knowing Keras automatically gives you the skills to work with these frameworks as well.

My next candidate was Tensorflow. While not as fluent with Tensorflow as with Keras, I have written code using it in the past. I haven't kept up with the high level libraries that are tightly integrated with Tensorflow such as skflow and tensorflow-slim, since they looked like they were still evolving when I saw them.

Tensorflow (like Theano) programs require you to define your sequence of operations (i.e, the computation graph), "compile" it, and then run it with your variables. During the definition, the operands in the computation graph are represented using container objects called Tensors. At run-time, you pass in actual values to these container objects from your application. This is done mainly for performance, the network can optimize itself when it knows the sequence of operations up-front, and it is easier to distribute computations across different machines in a distributed environment. The process is called "Define and Run". Tensorflow is also a fairly low level library, its abstraction is at the operation level, compared to Keras, which is at the layer level. Consequently, Tensorflow code tends to be more verbose than comparable Keras code, and it often helps to modularize Tensorflow code for readability.

Keras, like the good high-level library that it is, tries to hide the separation implied by the "Define and Run" approach. However, there are times when it becomes necessary to extend Keras to do things it wasn't designed to do. Keras offers a backend API where it exposes operations on the backend, with which you can do some extensions such as a new loss function or new layer, and still remain within Keras. More complex extensions, such as adding an attention mechanism, can require setups where Keras and Theano or Tensorflow code must co-exist in the same code base, and figuring out how to make them interoperate can be a challenge. For this reason, I was quite excited to learn from Francois's talk on Integrating Keras and Tensorflow at Tensorflow Dev Summit 2017, that Keras will become the official API for Tensorflow starting with version 1.2. This will allow cleaner interoperability between Tensorflow and the Keras API, while at the same time allow you to write code that is less verbose than pure Tensorflow and more flexible than pure Keras.

For completeness, I also looked at Theano. Theano seems to be even more low level than Tensorflow, and lacks many of the convenience functions that Tensorflow provides. However, its computation graph definition is simpler and more intuitive (at least to me) compared to Tensorflow - you define Variables and functions, which you then populate with values from your application and run the function. I didn't do too much here as I don't expect to do too much work with Theano at this time.

One other framework I looked at was MXNet. Recently I attended a webinar organized by Amazon Web Service where they demonstrated the distributed training capabilities of MXNet on an AWS cluster, which I thought was quite cool, and which prompted me to look at MXNet further. Unlike Keras, MXNet is built on a C/C++ shared library and exposes a Python API. It also exposes APIs in various other languages, including Scala and R. In that respect, it is similar to Caffe. The Python API is similar to Keras, at least the level of abstraction, although there are some undocumented features that are set up by convention. I think this may be a good fit for shops that prefer Scala over Python, although Python seems to be quite ubuiquitous in the DL space.

Finally I looked at PyTorch, initially at the advice of a friend who works for Salesforce Research. PyTorch is the Python version of Torch, a DL framework written in Lua and used at Facebook Research. PyTorch seems to be the adopted as standard at Salesforce Research. The abstraction and code looks similar to Keras, but there is one important difference.

Unlike "Define and Run" frameworks such as Theano and Tensorflow (and by extension Keras), PyTorch (and Torch) is "Define by Run". So there is no additional code required to define the network and then run it. Because of that, the code is also more readable, and resembles Keras as well. The graph is built as you define it. This allows you to do certain things that cannot be done with "Define and Run" frameworks, especially with certain use cases in NLP. Like MXNet and Caffe, PyTorch is backed by a C/C++ shared library, and the Python and Lua front ends both use the same shared library. So in the long run, PyTorch seems to be worth learning as well.

Overall, I think the two advantages that this work has given me is an appreciation of how different DL frameworks work, and the ability to decide the next steps in my learning. Another advantage has been the advantage of polygot-ism, after which the project is named. Just like knowing a language of a country enables you to appreciate the culture better, knowing another DL framework allows you to understand the examples provided by each of these frameworks, some of which are quite interesting. It also allows you to read code written by others using these frameworks.

Well, that's all I have for today, hope you enjoyed it. I have tried to share what I learned from this brief exercise in comparing how to build fully connected and convolutional networks to classify MNIST digits. I found that reading the data from CSV files is more representative of real world situations and forces you to think about the input, something you wouldn't normally do if the data came from some built-in function. Also, while almost every DL framework comes with their own MNIST examples, their coding styles are very different and it is hard to compare implementations across frameworks. So I feel that the work I did might be helpful to you as well.


Saturday, May 06, 2017

Deep Learning with Keras published!


Just wanted to let you all know that Deep Learning with Keras, a book I co-authored with Antonio Gulli, was published by PackT on April 26, 2017. For those of you who follow me on social media such as LinkedIn and Twitter, and for family and friends on Facebook, this is old news, but to others I apologize for the delay. Although if you're still reading my blog after all these years, I guess you accept (and forgive, thank you) that delays and apologies are somewhat par for the course here.



The book is targeted at the Data Scientist / Engineer staring out with Neural Networks. It contains a mix of theory and examples, but the focus is on the code, since we believe that the best way to learn something in this field is through looking at examples. All examples are in Keras, our favorite Deep Learning toolkit. By the time you are finished with the book, you should be comfortable building your own networks in Keras.

This book is also available on Amazon. If you end up reading it, do leave us a review and tell us what you liked and how we could have done better.

Yesterday, Antonio posted an image where it showed our book at #5 on Amazon. We thought initially that it was ranked by sales and were very thrilled that people like our book so much, until someone pointed out that the ranking is most likely by query relevance. Oh, well! Good feeling while it lasted though.



Today, I thought it might be interesting to share the story behind the book, and thank the people who made it possible. For those of you looking for technical content, fair warning - this post has none.

While I read a lot of books, I have never considered writing one. Like many other people in software engineering, I have switched fields multiple times, and books have been the way to gain (almost) instant expertise to help make the transition. But the authors I read were all quite accomplished, almost experts in their fields. I was neither, just a programmer who caught (and took advantage of) a few lucky breaks in his career, so end of story.

When Antonio asked me if I was interested in co-authoring a book on Deep Learning using Keras with him, I was undecided for a while. I felt that if I accepted, I was implicitly claiming expertise on subjects at which I wasn't one. On the flip side, I had been working with Deep Learning models with Caffe, Tensorflow and Keras for a while, so while I was definitely not an expert, I did have knowledge that could benefit people who were not as far in their journey as I was. That last bit convinced me that I did have some value to add to a book, so I accepted.

Once I overcame my initial hesitation about being an author, I began to see it as a new experience, one that I enjoyed thoroughly during the process of writing the chapters. Antonio wrote the first half of the book (Chapters 1-4) and I wrote the second half (Chapters 5-8) but we reviewed each others work as well before it went out for review by others. Since Antonio works for Google, he had Googlers internally review his chapters as part of their official process, and I was fortunate to have some of them review my work as well and provide valuable feedback. In addition, our technical reviewer from PackT, Nick McClure, also provided valuable suggestions. The book has benefited a great deal from the thoroughness of these reviews.

The speed at which our industry moves means that people in it have to adapt quickly as well, and I am no exception. Often, when I pick up a new technology, I spend just enough time on the theory so I can build something that works. If I don't fully understand something that isn't central to what I am building, I just "accept" it and move on. Unfortunately, this doesn't work when you are writing a book - while I have tried to limit the theory to be just enough to explain the model that I build in code, the explanation needed to be accurate and complete. For that I had to revisit some basic concepts in order to clarify them for myself, things I had neglected to do while learning about it the first time. So in a sense, writing this book actually forced me to fill gaps in my own knowledge, so I am really grateful I did it.

From an engineering standpoint, I thought PackT's publication pipeline was quite cool. I had imagined that we would provide the manuscripts electronically over email and it would go back and forth, using the built in comment mechanism supported by Microsoft Word or similar. At least that has been my experience with PackT as reviewer in the past. Instead, they now have a Content Development Platform (CDP), a CMS system (similar to Joomla or Drupal) customized to the publishing task. Authors enter their chapters into an online editor that supports code blocks, quotations, images, info boxes, etc, as well as version control. Reviewers make comments using the same interface, and the EBook and print copies are generated automatically off the updated content.

Our own process was somewhat hybrid, since we started writing before we learned about the CDP, so we started off using Google Docs, which turned out to be a good choice since it could be shared easily with Google reviewers. We ended up building all our chapters on Google docs, and then copying them over to the CDP after the Google reviews, at which point all comments and changes happened only on the CDP.

The editors from PackT were awesome to work with as well - many thanks to Divya Poojari (Acquisition editor), Cheryl Dsa (Content editor) and Dinesh Thakur (Publishing editor) for all their help guiding us through the various steps of the publishing process.

One thing that hit us towards the end, I think about a week before our originally scheduled release date, was the Keras2 upgrade. Because it was so late in the process, we debated a bit about launching as-is and providing an upgrade guide to help readers upgrade the provided code to Keras2, but in the end we decided that the right thing to do was to upgrade our code before release. This did push back the schedule a bit, but the upgrade process went relatively smoothly, thanks in large part to the very informative deprecation warnings that Keras provides.

Looking back, I am really grateful to Antonio for having confidence in my skills and offering me the opportunity to co-author the book with him. Writing the book was an extremely valuable experience for me. Quick shout-out also to two of my colleagues here at Elsevier, Ron Daniel and Bradley P Allen, both of whom have been working on Deep Learning for longer than I have, and whose experiences led me to investigate the subject further in the first place. Also, the last four months were pretty hectic, trying to balance work, the book and home, and I am grateful to my family for their patience.

Antonio and I have put in a lot of thought and effort into this book. For the explanations, we have tried to strike a balance, trying to present just enough detail to be complete yet not inundating you with math. For the code, we have tried to keep it simple enough to understand but not so simple that it ends up implementing something trivial. But all things considered, the true litmus test for the book is whether you the reader find it useful. We look forward to hearing back from you.

Monday, April 24, 2017

Predicting Image Similarity using Siamese Networks


In my previous post, I mentioned that I want to use Siamese Networks to predict image similarity from the INRIA Holidays Dataset. The Keras project on Github has an example Siamese network that can recognize MNIST handwritten digits that represent the same number as similar and different numbers as different. This got me all excited and eager to try this out on the Holidays dataset, which contains 1491 photos from 500 different vacations.

My Siamese network is somewhat loosely based on the architecture in the Keras example. The main idea behind a Siamese network is that it takes two inputs which need to be compared to each other, so we reduce it to a denser and hopefully more "semantic" vector representation and compare it using some standard vector arithmetic. Each input undergoes a dimensionality reduction transformation implemented as a neural network. Since we want the two images to be transformed in the same way, we train the two networks using shared weights. The output of the dimensionality reduction is a pair of vectors, which are compared in some way to yield a metric that can be used to predict similarity between the inputs.

The Siamese network I built is shown in the diagram below. It differs from the Keras example in two major ways. First, the Keras example uses Fully Connected Networks (FCNs) as the dimensionality reduction transformation component, whereas I use a Convolutional Neural Network (CNN). Second, the example computes the Euclidean distance between the two output vectors, and attempts to minimize the contrastive loss between them to produce a number in the [0,1] range that is thresholded to return a binary similar/dissimilar prediction. In my case, I use a FCN that combines the output vectors using element-wise dot product, use cross-entropy as my loss function, and predict a 0/1 to indicate similar/dissimilar.


For the CNN, I tried various different configurations. Unfortunately, I started running out of memory on the g2.2xlarge instance when I started trying large CNNs, and ended up migrating to a p2.xlarge. Even then, I had to either cut down the size of the input image or the network complexity, and eventually settled on a LeNet configuration for my CNN, which seemed a bit underpowered for the data. For the current configuration, shown in 02-holidays-siamese-network notebook, the network pretty much refused to learn anything. In other tries, the best test set accuracy I was able to get was about 60%, but all of them involved compromising on the input size or the complexity of the CNN, so I gave up and started looking at other approaches.

I have had success with transfer learning in the past, where you take large networks pre-trained on some external corpus such as ImageNet, chop off the classification head, and expose the vector from the layer prior to the head layer(s). So the pre-trained network acts as the vectorizer or dimension reducer component. I used the following pre-trained networks that are available in Keras applications, to generate vectors from. The code to do this can be found in the 03-pretrained-nets-vectorizers notebook.

  • VGG-16
  • VGG-19
  • ResNet
  • InceptionV3
  • xCeption


The diagram above shows the general setup of this approach. The first step is to just run the predict method on the pre-trained models to generate the vectors for each image. These vectors then need to be combined and fed to another classifier component. Some strategies I tried were element-wise dot product, absolute difference and squared (Euclidean) distance. In case of dot product, corresponding elements of the two vectors that are both high end up becoming higher, and elements that differ end up getting smaller. In case of absolute and squared differences, elements that are different tend to become larger. In case of squared difference, large differences are highlighted better than small differences.

The classifier component (shown as FCN in my previous diagram) can be any kind of classifier, including non neural network based ones. As a baseline, I tried several common classifiers from the Scikit-Learn and XGBoost packages. You can see the code in the 04-pretrained-vec-dot-classifier, 05-pretrained-vec-l1-classifier, and 06-pretrained-vec-l2-classifier notebooks. The resulting accuracies for each (vectorizer, merge strategy, classifier) combination on the held out test set are summarized below.








Generally speaking, XGBoost seems to do the best across all merge strategies and vectorization schemes. Among these, Inception and ResNet vectors seem to be the best overall. We also now have a pretty high baseline for accuracy, about 96.5% for Inception vectors merged using dot product and classified with XGBoost. The code for this can be found in the 07-pretrained-vec-nn-classifier notebook. The figure below shows the accuracies for different merge strategies for ResNet and Inception.


The next step was to see if I could get even better performance by replacing the classifier head with a neural network. I ended up using a simple 3 layer FCN that gave a 95.7% accuracy with Inception vectors and using dot product for a merge strategy. Not quite as good as the XGBoost classifier, but quite close.

Finally, I decided to merge the two approaches. For the vectorization, I chose a pre-trained Inception network with its classification head removed. Input to this network would be images, and I would use the Keras ImageDataGenerator to augment my dataset, using the mechanism I described in my previous post. I decided to keep all the pre-trained weights fixed. For the classification head, I decided to start with the FCN I trained in the previous step and fine tune its weights during training. The code for that is in the 08-holidays-siamese-finetune notebook.


Unfortunately, this did not give me the stellar results I was hoping for, my best result was about 88% accuracy in similarity prediction. In retrospect, it may make sense to experiment with a simpler pre-trained model such as VGG and fine tune some of the later layer weights instead of keeping them all frozen. There is also a possibility that my final network is not getting the benefits of a fine tuned model from the previous steps. One symptom is that the accuracy after the first epoch is only around 0.6 - I would have expected it to be higher with a well trained model. In another project where a similar thing happened, a colleague discovered that I was doing extra normalization with ImageDataGenerator that I hadn't been doing with the vectorization step - this doesn't seem to be the case here though.

Overall, I got the best results from the transfer learning approach, with Inception vectors, dot product merge strategy and XGBoost classifier. Nice thing about transfer learning is that it is relatively cheap in terms of resources compared to the fine tuning or even the from-scratch training approach. While XGBoost does take some time to train, you can do the whole thing on your laptop. This is also true if you replace the XGBoost classifier with an FCN. You can also do inline Image Augmentation (i.e, without augmenting and saving) using the Keras ImageDataGenerator if you use the random_transform call.


Saturday, February 18, 2017

Using the Keras ImageDataGenerator with a Siamese Network


I have been looking at training a Siamese network to predict if two images are similar or different. Siamese networks are a type of Neural network that contain a pair of identical sub-networks that share the same parameters and weights. During training, the parameters are updated identically across both subnetworks. Siamese networks were first proposed in 1993 by Bromley, et al in their paper Signature Verification using a Siamese Time Delay Neural Network. Keras provides an example of a Siamese network as part of the distribution.

My dataset is the INRIA Holidays Dataset, a set of 1491 photos from 500 different vacations. The photos have a naming convenition from which the groups can be derived. Each photo is numbered with six digits - the first 4 refer to the vacation and the last two is a unique sequence number within the vacation. For example, a photo named 100301.jpg is from vacation 1003 and is the first photo in that group.

The input to my network consist of image pairs and the output is either 1 (similar) or 0 (different). Similar image pairs are from the same vacation group. For example, the code snippet displays three photos - the first two are from the same group and the last one is different.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from __future__ import division, print_function
from keras.preprocessing.image import ImageDataGenerator
from keras.utils import np_utils
from scipy.misc import imresize
import itertools
import matplotlib.pyplot as plt
import numpy as np
import random
import os

DATA_DIR = "../data"
IMAGE_DIR = os.path.join(DATA_DIR, "holiday-photos")

ref_image = plt.imread(os.path.join(IMAGE_DIR, "100301.jpg"))
sim_image = plt.imread(os.path.join(IMAGE_DIR, "100302.jpg"))
dif_image = plt.imread(os.path.join(IMAGE_DIR, "127202.jpg"))

def draw_image(subplot, image, title):
    plt.subplot(subplot)
    plt.imshow(image)
    plt.title(title)
    plt.xticks([])
    plt.yticks([])
    
draw_image(131, ref_image, "reference")
draw_image(132, sim_image, "similar")
draw_image(133, dif_image, "different")
plt.tight_layout()
plt.show()


The following code snippet loops through the image directory and uses the file naming convention to create all pairs of similar images and a corresponding pair of different images. Similar image pairs are generated by considering all combination of image pairs within a group. Dissimilar image pairs are generated by pairing the left hand image of the similar pair with a random image from some other group. This gives us 2072 similar image pairs and 2072 different image pairs, ie, a total of 4144 image pairs for our training data.

Fearing that this might not be nearly enough to train my network adequately, I decided to use the Keras ImageDataGenerator to augment the dataset. Before Keras, when I was working with Caffe, I would manually augment my input with a fixed number of standard transformations, such as rotation, flipping, zooming and affine transforms (these are all just matrix transforms). The Keras ImageDataGenerator is much more sophisticated, you instantiate it with the range of transformations you will allow on your dataset, and it returns you a generator containing transformations on your input images images from a directory.

I have used the ImageDataGenerator previously to augment my dataset to train a simple classification CNN, where the input was an image and the output was a label. This is the default case the component is built to handle, so its actually very simple to use this. My problem this time was a litle different - my input is a pair of image names from a triple, and I wanted that the identical transformation be applied to both imaages. (This is not strictly necessary in my case, but can't hurt, and in any case I wanted to learn how to do this for another upcoming project).

It seems to be something that others have been looking for as well, and there is some discussion in Keras Issue 3059. In addition, the ImageDataGenerator documentation covers some cases where this can be done, using a pair of ImageDataGenerator instances that are instantiated with the same parameters. However, all these seem to require that you either enumerate the LHS and RHS images in the pair as 4-dimensional tensors (using flow()) or store them in two parallel directories with identical names (using flow_from_directory()). The first seems a bit wasteful, and the second seems incredibly complicated for my use case.

So I went digging into the code and found a private (in the sense of undocumented) method called random_transform(). It applies a random sequence of the transformations you have specified in the ImageDataGenerator constructor to your input image. In this post, I will describe an image generator that I built for my Siamese network using the random_transform() method.

We start with a basic generator that returns a batch of image triples per invocation. The generator is instantiated at each epoch, and the next() method is called to get the next batch of triples.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def image_triple_generator(image_triples, batch_size):
    while True:
        # loop once per epoch
        num_recs = len(image_triples)
        indices = np.random.permutation(np.arange(num_recs))
        num_batches = num_recs // batch_size
        for bid in range(num_batches):
            # loop once per batch
            batch_indices = indices[bid * batch_size : (bid + 1) * batch_size]
            yield [image_triples[i] for i in batch_indices]
            
triples_batch_gen = image_triple_generator(image_triples, 4)
triples_batch_gen.next()

This gives us a batch of 4 triples as shown:

[('149601.jpg', '149604.jpg', 1),
 ('144700.jpg', '106201.jpg', 0),
 ('103304.jpg', '111701.jpg', 0),
 ('133200.jpg', '128100.jpg', 0)]

Calling next() returns the next 4 triples. This is what happens after each batch.

1
triples_batch_gen.next()

[('135104.jpg', '122601.jpg', 0),
 ('137700.jpg', '137701.jpg', 1),
 ('136005.jpg', '105501.jpg', 0),
 ('132500.jpg', '132511.jpg', 1)]

Next, we apply the ImageDataGenerator.random_transform() to a single image to see if it does indeed do what I think it does. My fear was that there needs to e some upstream initialization before I could call the random_transform() method. As you can see from the output, the random_transform() augments the original image into variants that are quite close and could legitimately have been real photos.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
datagen_args = dict(rotation_range=10,
                    width_shift_range=0.2,
                    height_shift_range=0.2,
                    shear_range=0.2,
                    zoom_range=0.2,
                    horizontal_flip=True)
datagen = ImageDataGenerator(**datagen_args)

sid = 150
np.random.seed(42)
image = plt.imread(os.path.join(IMAGE_DIR, "115201.jpg"))
sid += 1
draw_image(sid, image, "orig")
for j in range(4):
    augmented = datagen.random_transform(image)
    sid += 1
    draw_image(sid, augmented, "aug#{:d}".format(j + 1))

plt.tight_layout()
plt.show()


Next I wanted to see if I could take two images and apply the same transformation to both the images. I now take a pair of ImageDataGenerators configured the same way. The individual transformations that are applied to the image in the random_transform() method are all driven using numpy random number generators, so one way to make them do the same thing was to initialize the random number seed to the same random value for each ImageGenerator at the start of each batch. As you can see from the photos below, this strategy seems to be working.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
image_pair = ["108103.jpg", "112003.jpg"]

datagens = [ImageDataGenerator(**datagen_args),
            ImageDataGenerator(**datagen_args)]

sid = 240
for i, image in enumerate(image_pair):
    image = plt.imread(os.path.join(IMAGE_DIR, image_pair[i]))
    sid += 1
    draw_image(sid, image, "orig")
    # make sure the two image data generators generate same transformations
    np.random.seed(42)
    for j in range(3):
        augmented = datagens[i].random_transform(image)
        sid += 1
        draw_image(sid, augmented, "aug#{:d}".format(j + 1))

plt.tight_layout()
plt.show()


Finally, we are ready to build our final generator that can be plugged in to the Siamese network. I haven't built that yet, so there might be some changes once I try to integrate it in, but here is the first cut. The caching is because I noticed that it takes a while to generate the batches, so caching is hopefully going to spped it up.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
RESIZE_WIDTH = 300
RESIZE_HEIGHT = 300

def cached_imread(image_path, image_cache):
    if not image_cache.has_key(image_path):
        image = plt.imread(image_path)
        image = imresize(image, (RESIZE_WIDTH, RESIZE_HEIGHT))
        image_cache[image_path] = image
    return image_cache[image_path]

def preprocess_images(image_names, seed, datagen, image_cache):
    np.random.seed(seed)
    X = np.zeros((len(image_names), RESIZE_WIDTH, RESIZE_HEIGHT, 3))
    for i, image_name in enumerate(image_names):
        image = cached_imread(os.path.join(IMAGE_DIR, image_name), image_cache)
        X[i] = datagen.random_transform(image)
    return X

def image_triple_generator(image_triples, batch_size):
    datagen_args = dict(rotation_range=10,
                        width_shift_range=0.2,
                        height_shift_range=0.2,
                        shear_range=0.2,
                        zoom_range=0.2,
                        horizontal_flip=True)
    datagen_left = ImageDataGenerator(**datagen_args)
    datagen_right = ImageDataGenerator(**datagen_args)
    image_cache = {}
    
    while True:
        # loop once per epoch
        num_recs = len(image_triples)
        indices = np.random.permutation(np.arange(num_recs))
        num_batches = num_recs // batch_size
        for bid in range(num_batches):
            # loop once per batch
            batch_indices = indices[bid * batch_size : (bid + 1) * batch_size]
            batch = [image_triples[i] for i in batch_indices]
            # make sure image data generators generate same transformations
            seed = np.random.randint(low=0, high=1000, size=1)[0]
            Xleft = preprocess_images([b[0] for b in batch], seed, 
                                      datagen_left, image_cache)
            Xright = preprocess_images([b[1] for b in batch], seed,
                                       datagen_right, image_cache)
            Y = np_utils.to_categorical(np.array([b[2] for b in batch]))
            yield Xleft, Xright, Y

Here is a little snippet to call my data generator and verify that it returns the right shaped data.

1
2
3
triples_batch_gen = image_triple_generator(image_triples, 32)
Xleft, Xright, Y = triples_batch_gen.next()
print(Xleft.shape, Xright.shape, Y.shape)

which returns the expected shapes.

(32, 300, 300, 3) (32, 300, 300, 3) (32, 2)

So anyway, this is all I have so far. Once I have my Siamese network coded up and running, I will talk about it in a subsequent post. I haven't heard about anyone using the ImageDataGenerator.random_transform() directly before, so I thought that it might be interesting to describe my experience. Currently the enhancements seem to be aimed at trying to continue to allow folks to use the flow() and flow_from_directory() methods. I am not sure if more specialized requirements will come up in the future, but I think using the random_transform() method instead might a good choice for many situations. Of course, it is quite likely that I may be missing something, so in case you know of problems with this approach, please let me know.