Wednesday, July 12, 2017

Trip Report: PyData Seattle 2017


Last week I attended (and presented at) PyData Seattle 2017. Over time, Python has morphed from a scripting language, to a library for scientific computing, and lately pretty much a standard language for most aspects of Machine Learning (ML) and Artificial Intelligence (AI), including Deep Learning (DL). PyData conferences cater mostly to the last demographic. Even though it's not really a place where you go to learn about the state of the art, it's still a great place to catch up with what others in industry are doing with Python and ML/AI/DL. PyData conferences are usually 2-3 day affairs, and they happen multiple times a year, at different places all over the world, organized by local groups of Python/ML enthusiasts.

The conference was 3 days long, one day of tutorials followed by 2 days of presentations. It was held at the Microsoft Campus - along with the conference center, Microsoft also sponsored the food. I stayed at the Hyatt Regency Bellevue, their "preferred" hotel - initially I thought that it meant they would have a shuttle service to and from the conference, but it was because of lower negotiated room rates for conference attendees. But thanks to ridesharing services such as Lyft, I had no problems getting around.

So anyway, here is my trip report... there were 4 simultaneous tracks so there are things I missed because I wanted to attend something even better. Fortunately, there are videos of the talks and the organizers are collecting the slides from the speakers, and all of this will be made available in 2-3 weeks. I will update the links to slides and video when that happens.

Day 1 (July 5, 2017)


pomegranate: fast and flexible probabilisting modeling in python - Maxwell W Libbrecht

I first came across the pomegranate library at PyData Amsterdam last year, where it was mentioned as a package containing several probabilistic graphical models (PGMs), and specifically to build Bayesian Networks. I read the docs in advance, and it turns out that it also contains Hidden Markov Models, General Mixture Models and Factor Graphs. The talk itself was mostly a walkthrough of its capabilites using this notebook. Like many other ML packages in the Python (and other) ecosystem, pomegranate's API is modeled after that of scikit-learn. The examples were very cool, and left me itching for a problem that I might be able to solve using pomegranate :-).

Vocabulary Analysis of Job Descriptions - Alex Thomas

Alex Thomas of indeed.com led us through analyzing the vocabulary of job descriptions. The objective is to extract attributes from the text of job descriptions, which might be used as structured features for these descriptions for downstream tasks. He started with basic ideas like TF-IDF, finding multi-word candidates, using stopwords and extending them, stemming and lemmatizing. Evaluation is done by manually segmenting the job description dataset into different levels of experience required, and building word clouds of the analyzed vocabulary to see if they line up with expectations. All in all, a very useful refresher for things we tend to take for granted with readily available text processing toolkits. What I liked most about this tutorial is that I came away with a subset of tools and ideas that I could use to analyze a vocabulary end-to-end. The github repository for the tutorial is available in case you want to follow along.

Day 2 (July 6, 2017)


Morning Keynote - Katrina Reihl

Katrina Riehl of HomeAway.com gave the keynote. As an experienced data scientist, she recounted her time in defense and other industries before she arrived at HomeAway, and the ML challenges she works on here. One thing she touched upon are problems with deployment of ML solutions - their main problem is that Python programs are generally not as performant as Java or C/C++ based solutions. Initially they would build and train a Python model, then hand convert to Java or C/C++. Later they looked at PMML - the idea was to train a Python model, then specify its weights and structure and use PMML to instantiate the identical model in Java for production. But this didn't work because of limited availability of PMML aware models and because models in different toolkits have minor differences which break the interop. So finally they settled on microservices - build models in Python, wrap them in microservices, load balance multiple microservices, and consume them from Java based production code. They use protocol buffers for high performance RPC.

Using Scattertext and the Python NLP Ecosystem for Text Visualization - Jason Kessler

Jason Kessler talks about his Scattertext package, which is designed to visualize how different words are used in different classes in a dataset. The visualization is for words or phrases mentioned by Democrats or Republicans during the 2012 elections. He uses a measure called scaled F-score, which achieves a very nice separation of words. He shows other ways you can drill down deeper into the word associations using the Scattertext API. Overall quite an interesting way to visualize word associations. Jason will also present ScatterText at ACL 2017, here is a link to his paper.

Automatic Citation generation with Natural Language Processing - Claire Kelley and Sarah Kelley

The presenters described two methods for finding similar patents using the US Patent database. The first approach vectorizes the documents using TF-IDF and uses cosine similarity as the similarity metric to find the 20 most similar patents for each patent. The results are compared with the patents already cited, and on average, 10 of the 20 suggested similar patents are already cited in the original patent. The second approach tries to build a recommender by factorizing a patent/citation co-occurrence matrix using the Alternating Least Squares (ALS) method for Collaborative Filtering to generate a ranked list of patent recommendations for each patent. The number of latent factors used was 60. Because recommendations are not necessarily similar documents, an objective evaluation using cited patents is not possible, but the recommendations were found to be quite good when spot checked for a small set of patents. Both approaches work on subsets of the patent dataset that is optimized for the category under investigation. Most of the data ingestion and pre-processing was done using Google BigQuery and the ML work was done using Spark ML.

Scan Statistics with Spark Streaming: Distribution based Real-time Anomaly Detection - Michal Monselise

The presenter describes a method that she used to find temperature anomalies in a stream of temperature data. The idea is to look at a fixed size window and fit a distribution to it. An anomaly is detected when a window is encountered whose distribution does not have the same parameters as the fitted distribution of the previous window.

Afternoon Keynote - Jake Vanderplas

Jake Vanderplas is well known in the open source Python community, and his keynote covered the evolution of Python from a scripting language (replacement for bash), a platform for scientific computing (replacement for MATLAB) and now a platform for data science (replacement for R). He also covered the tools that someone who wants to do data science in Python should look at. Many of these are familiar - numpy, pandas, scikit-learn, numba, cython, etc, and some that were not, for example dask. He also briefly touched upon Python as the de-facto language for most deep learning toolkits. I thought this talk was both interesting and useful. Even though I was familiar with many of the packages he listed, I came away learning about a couple I didn't and that I think might be good for me to check out.

In-database Machine Learning with Python in SQL Server - Sumit Kumar

Sumit Kumar of Microsoft showed off new functionality in Microsoft SQL Server that allows you to embed a trained Python machine learning model inside a stored procedure. Unlike the traditional model of pulling data out of the database and then running your trained model on it and writing back the predictions, this approach allows you to run the model on the same server as the database, minimizing network traffic. Microsoft also has tooling in its IDE that loads/reloads the model code automatically during development. The Python code is run in its own virtual machine separate from the database, so problems with the model will not crash the server.

Applying the four step "Embed, Encode, Attend, Predict" framework for text classification and similarity - Sujit Pal

This was my presentation. I spoke about the 4-step recipe for Natural Language Processing (NLP) proposed by Matthew Honnibal, creator of the SpaCy NLP toolkit, and described three applications around document classification, document similarity and sentence similarity, where I used this recipe. I also covered Attention in some depth. You can find the code and the slides for the talk at these links. I used the Keras deep learning toolkit for my models, so of the four steps, only the Attend step does not correspond directly to a Keras provided layer. I plan to write in more detail about my Attention implementations in a subsequent blog post.

I was pleasantly surprised at the many insightful questions I got from the audience during and after the talk. I also made a few friends and had detailed conversations around transfer learning, among other things. I also got a very nice demo of a system which automatically learns taxonomies from text which I thought was very interesting.

Chatbots - Past, Present and Future - Dr Rutu Mulkar-Mehta

This was a fairly high level talk but very interesting to me, since I know almost nothing about chatbots and because it may be one of the most obvious places to use NLP in. Typically chatbot designers tend to outsource the NLP analysis and concentrate on the domain expertise, so a number of chatbot platforms have come up that cater to this need, with varying degrees of sophistication. Some examples are Chatterbot, API.AI, motion.ai, etc. She talked about the need to extract features from the incoming text in order to feed machine learning classifiers at each stage in the chatbot pipeline for it to decide how to respond. In all, a nice introduction to chatbots, seen as a pipeline of NLP and ML components.

PyData "Pub" Quiz - James Powell

To end the day, we had a 6-part quiz on Python, conducted by the inimitable James Powell. Many of the questions had to do with Python 3 features, Monty Python, and esoteric aspects of the Python language, so not surprisingly, I did not do too well. About the only thing I could answer were the two features of Python 3 that I always import from __future__ - the print_function and Python 3 style division, and some calls in matplotlib and scikit-learn. But I did learn a lot of things that I didn't know before, always a good thing.

There was a social event after this with food and drink. I hung around for a while and had some interesting conversations, then decided to call it a day.

Day 3 (July 7, 2017)


Morning Keynote - Joseph Sirosh

Joseph Sirosh of Microsoft talked about all the cool things that Microsoft is doing with Python and Machine Learning. He brought in various Microsoft employees to do brief demos of their work.

Medical Image Processing using Microsoft Deep Learning Framework (CNTK) - Naoto Usuyama and Jessica Lundin

Jessica started the presentation off by talking about a newly created Health division inside Microsoft that works as a startup within the parent company. This startup is doing many cool things in the Health and Medical spaces. After that Naoto talked about how he used CNTK to train models for the Diabetes Retinopathy and Lung Cancer challenges from Kaggle. His notebooks for both challenges are available on his pydata-medical-image repository on Github. I had been curious about CNTK but had never seen it in action, so this was interesting to me. Also I found the approach to preprocessing of the Lung Cancer dataset (DICOM) images interesting.

Learn to be a painter using Neural Style Painting - Pramit Choudhary

I find the whole idea of using neural networks to produce images that look like LSD induced hallucinations quite fascinating. However, while building models that generate these images, there have been certain aspects which I have kind of glossed over. One of the things the presenter briefly touched upon were the transformations on the content and style images before they are merged - this was one of those things I had glossed over earlier, so I figured I will ask him, but there was no time, so I decided to look it up myself, and this video of Leon Gatys's presentation at CVPR 2016 provides the clearest description I have seen so far. The presenter went on to explain how he used Spark to optimize the hyperparameters for the style transfer. He also gave me a DataScience.com jacket (his employer) for answering a very simple question correctly.

Scaling Scikit-Learn - Stephen Hoover

The presenter described the data science toolbox his company Civis Analytics markets and the reasoning behind various features. The toolbox is based on the AWS components Redshift, S3 and EC2. They have made some custom GridSearch components that leverages these different components. He also briefly described how one can build their own joblib based custom implementation of parallel_backend. The solution appears to be optimized for parallel batch operation against trained models during prediction. I thought it might be somewhat narrow in its scope, and based on a question from the audience at the end of the talk, think that it may be worth taking a look at Dask to see if some scaling issues may be solved more generally using it.

There were some more talks after lunch which I had to miss because I needed to catch my flight home, since I had already committed (before the talk proposal was accepted) to my family that I would be somewhere else starting Friday evening and through the weekend. So I will watch the videos for the ones I missed once the organizers release them. I will also update the links on this post once that happens.