Saturday, February 23, 2013

Checking for Word Equivalence using NLTK and Wordnet


The small script I describe here came about as a result of a suggestion from a colleague. Some time ago I had built a Lucene analyzer that converted from British spellings to American spellings (for example, "colour" to "color"), based on a set of prefix, suffix and infix regular expressions. Unfortunately, the same regex that converts colour correctly also converts "four" to "for". Since our search is backed by a taxonomy, we can treat the synonyms defined in it as a controlled vocabulary, so my colleague suggested running the transformer against all the words in the (possibly multi-word) synonyms, then for the words matching a regex, checking against a dictionary that the original and transformed words mean the same.

When I built the Lucene analyzer, I was not as handy with NLTK as I am now, so this time around, I almost immediately thought about NLTK's Wordnet interface. The idea is to pass the two words to Wordnet. Each word can potentially result in one or more synsets (depending on its part of speech (POS)). We can conclude that they are the same word if one of the pairs of synsets have a path_similarity of 1 (the path_similarity varies from 0 to 1, 1 being identical). Here is the code:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from __future__ import division
from nltk.corpus import wordnet as wn
import sys

def similarity(w1, w2, sim=wn.path_similarity):
  synsets1 = wn.synsets(w1)
  synsets2 = wn.synsets(w2)
  sim_scores = []
  for synset1 in synsets1:
    for synset2 in synsets2:
      sim_scores.append(sim(synset1, synset2))
  if len(sim_scores) == 0:
    return 0
  else:
    return max(sim_scores)

def main():
  f = open(sys.argv[1], 'rb')
  for line in f:
    (word1, word2) = line.strip().split("\t")
    if similarity(word1, word2) != 1.0:
      print word1
  f.close()

if __name__ == "__main__":
  main()

The similarity is calculated in the similarity() method as the maximum similarity between any two synset pairs. The main() method just reads a file of word pairs and writes out words that don't convert to an equivalent word. For example, the following input file:

1
2
3
4
favour     favor
favourite  favorite
four       for
colour     color

results in "four" being printed to the console, indicating that "four" should be treated as an exclusion for the analyzer.

I was also using the analyzer to normalize some Greek and Latin plurals to their corresponding singulars. These words are quite common in (English) medical text and normal stemming does not handle it correctly, so I used a set of (suffix) rules to do the conversion in the same analyzer. As with the British to American words, there are exceptions here also, which can be handled using the same code. Turns out that the plural and singular words map to the same node in Wordnet, so the task is even simpler. In any case, an input file like this:

1
2
3
4
humeri    humerus
femora    femur
insomnia  insomnium
media   medium

results in "insomnia" being printed on the console, once more indicating that "insomnia" should be treated as an exclusion.

And thats all I have for today. Just goes to show how simple some problems can become when you have the right tools.


Friday, February 15, 2013

My Solution to the Kaggle Event Recommendation Challenge


The Event Recommendation Engine Challenge on Kaggle asks for a model that can match events to users given user and event metadata and some demographic information. I've been a Kaggle member for a while, but this was the very first time I actually submitted a solution. I came in 87th on the leaderboard when I submitted, with a MAP (Mean Average Precision) of 0.56252, compared to the baseline solution with MAP 0.51382 at position 125-126, and the top solution with MAP 0.72809.

So if you are looking for a better model, you should probably read this post by dolaameng. For my part, I did not expect to do too well, and I was pretty stoked that I actually reached the finish line - there were times in the last 20 days when I had doubts about that. In any case, I describe my solution here, hopefully some of you will help me by pointing out where I could do better and some others will find it helpful as a source of ideas.

The tools I used for the solution are described in my previous post - Python, Numpy, Scipy, Scikit-Learn, Pandas and Cython. There were quite a few false starts, and at one point my code was running so slow that I would have missed the public leaderboard deadline. However, alls well that ends well, and here I am. So on to the code...

Data


The challenge page has links to the data, but in case you just want to read about it, I briefly describe the data files that were supplied.

1
2
3
4
5
6
7
8
train.csv (user, event, invited, timestamp, interested, not_interested)
test.csv (user, event, invited, timestamp)
users.csv (user_id, locale, birthyear, gender, joinedAt, location, 
           hometown, timezone)
events.csv (event_id, user_id, start_time, city, state, zip, country,
           lat, lng, c_1, c_2, ..., c_100, c_other)
user_friends.csv (user, friends)
event_attendees.csv (event, yes, maybe, invited, no)

Some columns referenced above need some additional explanation. For example, the user/user_id and event/event_id both refer to numeric ids, the c_1, ..., c_100 are the frequencies of the top 100 words in the event titles, and c_other is the frequency of everything else. In user_friends.csv and event_attendees.csv, the friends, yes, maybe and no fields are space delimited collection of user_ids.

Given this data, the output that needs to be submitted looks like this:

1
2
3
User,Events
1776192,"[2877501688, 3025444328, 4078218285, 1024025121, 2972428928]"
...

where the Events column is a list of events in descending order of attractiveness to the user.

Method Overview


As you can see, there is a wide variety of data available, suitable for building both user-based and item-based recommenders. In addition, there are a numberof other recommendation factors that can come from the user's social graph. I decided to build up the following recommenders:

  • User based recommender - uses the preferences of other users who have a preference for the same event and their similarity to this user to compute a user_reco score.
  • Item based recommender - there are actually two of these. Both use the preference the user has expressed for other events and the similarity between that event and this one. There are two scores generated for the different measures of event similarity, one based on event metadata and one based on event content. These two recommenders return the evt_p_score and evt_c_score respectively.
  • User popularity - measured by the number of friends a user has. The idea here is that people with more friends are more outgoing, and hence are more likely to attend events. This generates the user_pop score.
  • Friend Influence - the idea here is that if your friends are going to an event, you are too. The score measures the number of your friends that are going to this event, and generates the frnd_infl score.
  • Event Popularity - the more popular an event is, measured by the people that are going to it, the more likely the user will go to it. This produces the event_pop score.

Each of the recommendation scores described above become new features to my dataset. Because the computation of the scores would benefit from having some data as matrices, I first generate the matrices and a couple of dictionaries and serialize them to disk for the next stage. Two of these data structures are just pickled {user_id: index} and {event_id: index} maps and the others are sparse matrices serialized into MatrixMarket files. The user similarity matrix and the first event similarity matrices use correlation as the similarity measure and the second event similarity matrix (built off the c_* values) use cosine similarity as the similarity measure.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
# Source: BaseData.py
from __future__ import division

import itertools
import cPickle
import datetime
import hashlib
import locale
import numpy as np
import pycountry
import scipy.io as sio
import scipy.sparse as ss
import scipy.spatial.distance as ssd

from collections import defaultdict
from sklearn.preprocessing import normalize

class DataCleaner:
  """
  Common utilities for converting strings to equivalent numbers
  or number buckets.
  """
  def __init__(self):
    # load locales
    self.localeIdMap = defaultdict(int)
    for i, l in enumerate(locale.locale_alias.keys()):
      self.localeIdMap[l] = i + 1
    # load countries
    self.countryIdMap = defaultdict(int)
    ctryIdx = defaultdict(int)
    for i, c in enumerate(pycountry.countries):
      self.countryIdMap[c.name.lower()] = i + 1
      if c.name.lower() == "usa":
        ctryIdx["US"] = i
      if c.name.lower() == "canada":
        ctryIdx["CA"] = i
    for cc in ctryIdx.keys():
      for s in pycountry.subdivisions.get(country_code=cc):
        self.countryIdMap[s.name.lower()] = ctryIdx[cc] + 1
    # load gender id map
    self.genderIdMap = defaultdict(int, {"male":1, "female":2})

  def getLocaleId(self, locstr):
    return self.localeIdMap[locstr.lower()]

  def getGenderId(self, genderStr):
    return self.genderIdMap[genderStr]

  def getJoinedYearMonth(self, dateString):
    dttm = datetime.datetime.strptime(dateString, "%Y-%m-%dT%H:%M:%S.%fZ")
    return "".join([str(dttm.year), str(dttm.month)])

  def getCountryId(self, location):
    if (isinstance(location, str)
        and len(location.strip()) > 0
        and location.rfind("  ") > -1):
      return self.countryIdMap[location[location.rindex("  ") + 2:].lower()]
    else:
      return 0

  def getBirthYearInt(self, birthYear):
    try:
      return 0 if birthYear == "None" else int(birthYear)
    except:
      return 0

  def getTimezoneInt(self, timezone):
    try:
      return int(timezone)
    except:
      return 0

  def getFeatureHash(self, value):
    if len(value.strip()) == 0:
      return -1
    else:
      return int(hashlib.sha224(value).hexdigest()[0:4], 16)

  def getFloatValue(self, value):
    if len(value.strip()) == 0:
      return 0.0
    else:
      return float(value)


class ProgramEntities:
  """
  Creates reference sets for the entity instances we care about
  for this exercise. The train and test files contain a small
  subset of the data provided in the auxillary files.
  """
  def __init__(self):
    # count how many unique uesers and events are in the training file
    uniqueUsers = set()
    uniqueEvents = set()
    eventsForUser = defaultdict(set)
    usersForEvent = defaultdict(set)
    for filename in ["../Data/train.csv", "../Data/test.csv"]:
      f = open(filename, 'rb')
      f.readline().strip().split(",")
      for line in f:
        cols = line.strip().split(",")
        uniqueUsers.add(cols[0])
        uniqueEvents.add(cols[1])
        eventsForUser[cols[0]].add(cols[1])
        usersForEvent[cols[1]].add(cols[0])
      f.close()
    self.userEventScores = ss.dok_matrix((len(uniqueUsers), len(uniqueEvents)))
    self.userIndex = dict()
    self.eventIndex = dict()
    for i, u in enumerate(uniqueUsers):
      self.userIndex[u] = i
    for i, e in enumerate(uniqueEvents):
      self.eventIndex[e] = i
    ftrain = open("../Data/train.csv", 'rb')
    ftrain.readline()
    for line in ftrain:
      cols = line.strip().split(",")
      i = self.userIndex[cols[0]]
      j = self.eventIndex[cols[1]]
      self.userEventScores[i, j] = int(cols[4]) - int(cols[5])
    ftrain.close()
    sio.mmwrite("../Models/PE_userEventScores", self.userEventScores)
    # find all unique user pairs and event pairs that we should
    # look at. These should be users who are linked via an event
    # or events that are linked via a user in either the training
    # or test sets. This is to avoid useless calculations
    self.uniqueUserPairs = set()
    self.uniqueEventPairs = set()
    for event in uniqueEvents:
      users = usersForEvent[event]
      if len(users) > 2:
        self.uniqueUserPairs.update(itertools.combinations(users, 2))
    for user in uniqueUsers:
      events = eventsForUser[user]
      if len(events) > 2:
        self.uniqueEventPairs.update(itertools.combinations(events, 2))
    cPickle.dump(self.userIndex, open("../Models/PE_userIndex.pkl", 'wb'))
    cPickle.dump(self.eventIndex, open("../Models/PE_eventIndex.pkl", 'wb'))
      

class Users:
  """
  Build the user/user similarity matrix for program users
  """
  def __init__(self, programEntities, sim=ssd.correlation):
    cleaner = DataCleaner()
    nusers = len(programEntities.userIndex.keys())
    fin = open("../Data/users.csv", 'rb')
    colnames = fin.readline().strip().split(",")
    self.userMatrix = ss.dok_matrix((nusers, len(colnames) - 2))
    for line in fin:
      cols = line.strip().split(",")
      # consider the user only if he exists in train.csv
      if programEntities.userIndex.has_key(cols[0]):
        i = programEntities.userIndex[cols[0]]
        self.userMatrix[i, 0] = cleaner.getLocaleId(cols[1])
        self.userMatrix[i, 1] = cleaner.getBirthYearInt(cols[2])
        self.userMatrix[i, 2] = cleaner.getGenderId(cols[3])
        self.userMatrix[i, 3] = cleaner.getJoinedYearMonth(cols[4])
        self.userMatrix[i, 4] = cleaner.getCountryId(cols[5])
        self.userMatrix[i, 5] = cleaner.getTimezoneInt(cols[7])
    fin.close()
    # normalize the user matrix
    self.userMatrix = normalize(self.userMatrix, norm="l1", axis=0, copy=False)
    sio.mmwrite("../Models/US_userMatrix", self.userMatrix)
    # calculate the user similarity matrix and save it for later
    self.userSimMatrix = ss.dok_matrix((nusers, nusers))
    for i in range(0, nusers):
      self.userSimMatrix[i, i] = 1.0
    for u1, u2 in programEntities.uniqueUserPairs:
      i = programEntities.userIndex[u1]
      j = programEntities.userIndex[u2]
      if not self.userSimMatrix.has_key((i, j)):
        usim = sim(self.userMatrix.getrow(i).todense(),
          self.userMatrix.getrow(j).todense())
        self.userSimMatrix[i, j] = usim
        self.userSimMatrix[j, i] = usim
    sio.mmwrite("../Models/US_userSimMatrix", self.userSimMatrix)


class UserFriends:
  """
  Returns the friends of the specified user. The idea is
  that (a) people with more friends are more likely to attend
  events and (b) if your friend is going, its more likely for
  you to go as well
  """
  def __init__(self, programEntities):
    nusers = len(programEntities.userIndex.keys())
    self.numFriends = np.zeros((nusers))
    self.userFriends = ss.dok_matrix((nusers, nusers))
    fin = open("../Data/user_friends.csv", 'rb')
    fin.readline()                # skip header
    ln = 0
    for line in fin:
#      if ln % 100 == 0:
#        print "Loading line: ", ln
      cols = line.strip().split(",")
      user = cols[0]
      if programEntities.userIndex.has_key(user):
        friends = cols[1].split(" ")
        i = programEntities.userIndex[user]
        self.numFriends[i] = len(friends)
        for friend in friends:
          if programEntities.userIndex.has_key(friend):
            j = programEntities.userIndex[friend]
            # the objective of this score is to infer the degree to
            # and direction in which this friend will influence the
            # user's decision, so we sum the user/event score for
            # this user across all training events.
            eventsForUser = programEntities.userEventScores.getrow(j).todense()
            score = eventsForUser.sum() / np.shape(eventsForUser)[1]
            self.userFriends[i, j] += score
            self.userFriends[j, i] += score
      ln += 1
    fin.close()
    # normalize the arrays
    sumNumFriends = self.numFriends.sum(axis=0)
    self.numFriends = self.numFriends / sumNumFriends
    sio.mmwrite("../Models/UF_numFriends", np.matrix(self.numFriends))
    self.userFriends = normalize(self.userFriends, norm="l1", axis=0, copy=False)
    sio.mmwrite("../Models/UF_userFriends", self.userFriends)


class Events:
  """
  Builds the event-event similarity matrix and event content-content
  similarity matrix for program events.
  """
  def __init__(self, programEntities, psim=ssd.correlation, csim=ssd.cosine):
    cleaner = DataCleaner()
    fin = open("../Data/events.csv", 'rb')
    fin.readline() # skip header
    nevents = len(programEntities.eventIndex.keys())
    self.eventPropMatrix = ss.dok_matrix((nevents, 7))
    self.eventContMatrix = ss.dok_matrix((nevents, 100))
    ln = 0
    for line in fin.readlines():
#      if ln > 10:
#        break
      cols = line.strip().split(",")
      eventId = cols[0]
      if programEntities.eventIndex.has_key(eventId):
        i = programEntities.eventIndex[eventId]
        self.eventPropMatrix[i, 0] = cleaner.getJoinedYearMonth(cols[2]) # start_time
        self.eventPropMatrix[i, 1] = cleaner.getFeatureHash(cols[3]) # city
        self.eventPropMatrix[i, 2] = cleaner.getFeatureHash(cols[4]) # state
        self.eventPropMatrix[i, 3] = cleaner.getFeatureHash(cols[5]) # zip
        self.eventPropMatrix[i, 4] = cleaner.getFeatureHash(cols[6]) # country
        self.eventPropMatrix[i, 5] = cleaner.getFloatValue(cols[7]) # lat
        self.eventPropMatrix[i, 6] = cleaner.getFloatValue(cols[8]) # lon
        for j in range(9, 109):
          self.eventContMatrix[i, j-9] = cols[j]
        ln += 1
    fin.close()
    self.eventPropMatrix = normalize(self.eventPropMatrix,
        norm="l1", axis=0, copy=False)
    sio.mmwrite("../Models/EV_eventPropMatrix", self.eventPropMatrix)
    self.eventContMatrix = normalize(self.eventContMatrix,
        norm="l1", axis=0, copy=False)
    sio.mmwrite("../Models/EV_eventContMatrix", self.eventContMatrix)
    # calculate similarity between event pairs based on the two matrices    
    self.eventPropSim = ss.dok_matrix((nevents, nevents))
    self.eventContSim = ss.dok_matrix((nevents, nevents))
    for e1, e2 in programEntities.uniqueEventPairs:
      i = programEntities.eventIndex[e1]
      j = programEntities.eventIndex[e2]
      if not self.eventPropSim.has_key((i,j)):
        epsim = psim(self.eventPropMatrix.getrow(i).todense(),
          self.eventPropMatrix.getrow(j).todense())
        self.eventPropSim[i, j] = epsim
        self.eventPropSim[j, i] = epsim
      if not self.eventContSim.has_key((i,j)):
        ecsim = csim(self.eventContMatrix.getrow(i).todense(),
          self.eventContMatrix.getrow(j).todense())
        self.eventContSim[i, j] = epsim
        self.eventContSim[j, i] = epsim
    sio.mmwrite("../Models/EV_eventPropSim", self.eventPropSim)
    sio.mmwrite("../Models/EV_eventContSim", self.eventContSim)


class EventAttendees():
  """
  Measures event popularity by the number of people attended vs not.
  """
  def __init__(self, programEvents):
    nevents = len(programEvents.eventIndex.keys())
    self.eventPopularity = ss.dok_matrix((nevents, 1))
    f = open("../Data/event_attendees.csv", 'rb')
    f.readline() # skip header
    for line in f:
      cols = line.strip().split(",")
      eventId = cols[0]
      if programEvents.eventIndex.has_key(eventId):
        i = programEvents.eventIndex[eventId]
        self.eventPopularity[i, 0] = \
          len(cols[1].split(" ")) - len(cols[4].split(" "))
    f.close()
    self.eventPopularity = normalize(self.eventPopularity, norm="l1",
      axis=0, copy=False)
    sio.mmwrite("../Models/EA_eventPopularity", self.eventPopularity)


def main():
  """
  Generate all the matrices and data structures required for further
  calculations.
  """
  print "calculating program entities..."
  pe = ProgramEntities()
  print "calculating user metrics..."
  Users(pe)
  print "calculating user friend metrics..."
  UserFriends(pe)
  print "calculating event metrics..."
  Events(pe)
  print "calculating event popularity metrics..."
  EventAttendees(pe)

if __name__ == "__main__":
  main()

The next step deserializes these matrices and data structures, and uses them to compute the features described. The code below has functions that take one or more columns from each row and produce one or more values representing the new features.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# Source: RegressionData.py
from __future__ import division

import cPickle
import numpy as np
import scipy.io as sio

class DataRewriter:
  def __init__(self):
    self.userIndex = cPickle.load(open("../Models/PE_userIndex.pkl", 'rb'))
    self.eventIndex = cPickle.load(open("../Models/PE_eventIndex.pkl", 'rb'))
    self.userEventScores = sio.mmread("../Models/PE_userEventScores").todense()
    self.userSimMatrix = sio.mmread("../Models/US_userSimMatrix").todense()
    self.eventPropSim = sio.mmread("../Models/EV_eventPropSim").todense()
    self.eventContSim = sio.mmread("../Models/EV_eventContSim").todense()
    self.numFriends = sio.mmread("../Models/UF_numFriends")
    self.userFriends = sio.mmread("../Models/UF_userFriends").todense()
    self.eventPopularity = sio.mmread("../Models/EA_eventPopularity").todense()
    
  def userReco(self, userId, eventId):
    """
    for item i
      for every other user v that has a preference for i
        compute similarity s between u and v
        incorporate v's preference for i weighted by s into running aversge
    return top items ranked by weighted average
    """
    i = self.userIndex[userId]
    j = self.eventIndex[eventId]
    vs = self.userEventScores[:, j]
    sims = self.userSimMatrix[i, :]
    prod = sims * vs
    try:
      return prod[0, 0] - self.userEventScores[i, j]
    except IndexError:
      return 0

  def eventReco(self, userId, eventId):
    """
    for item i 
      for every item j tht u has a preference for
        compute similarity s between i and j
        add u's preference for j weighted by s to a running average
    return top items, ranked by weighted average
    """
    i = self.userIndex[userId]
    j = self.eventIndex[eventId]
    js = self.userEventScores[i, :]
    psim = self.eventPropSim[:, j]
    csim = self.eventContSim[:, j]
    pprod = js * psim
    cprod = js * csim
    pscore = 0
    cscore = 0
    try:
      pscore = pprod[0, 0] - self.userEventScores[i, j]
    except IndexError:
      pass
    try:
      cscore = cprod[0, 0] - self.userEventScores[i, j]
    except IndexError:
      pass
    return pscore, cscore

  def userPop(self, userId):
    """
    Measures user popularity by number of friends a user has. People
    with more friends tend to be outgoing and are more likely to go
    to events
    """
    if self.userIndex.has_key(userId):
      i = self.userIndex[userId]
      try:
        return self.numFriends[0, i]
      except IndexError:
        return 0
    else:
      return 0

  def friendInfluence(self, userId):
    """
    Measures friends influence by the friends who are known (from the
    training set) to go or not go to an event. The average of scores across
    all friends of the user is the influence score.
    """
    nusers = np.shape(self.userFriends)[1]
    i = self.userIndex[userId]
    return (self.userFriends[i, :].sum(axis=0) / nusers)[0,0]

  def eventPop(self, eventId):
    """
    Measures event popularity by the number attending and not attending.
    """
    i = self.eventIndex[eventId]
    return self.eventPopularity[i, 0]

  def rewriteData(self, start=1, train=True, header=True):
    """
    Create new features based on various recommender scores. This
    is so we can figure out what weights to use for each recommender's
    scores.
    """
    fn = "train.csv" if train else "test.csv"
    fin = open("../Data/" + fn, 'rb')
    fout = open("../NewData/" + fn, 'wb')
    # write output header
    if header:
      ocolnames = ["invited", "user_reco", "evt_p_reco",
        "evt_c_reco", "user_pop", "frnd_infl", "evt_pop"]
      if train:
        ocolnames.append("interested")
        ocolnames.append("not_interested")
      fout.write(",".join(ocolnames) + "\n")
    ln = 0
    for line in fin:
      ln += 1
      if ln < start:
        continue
      cols = line.strip().split(",")
      userId = cols[0]
      eventId = cols[1]
      invited = cols[2]
      print "%s:%d (userId, eventId)=(%s, %s)" % (fn, ln, userId, eventId)
      user_reco = self.userReco(userId, eventId)
      evt_p_reco, evt_c_reco = self.eventReco(userId, eventId)
      user_pop = self.userPop(userId)
      frnd_infl = self.friendInfluence(userId)
      evt_pop = self.eventPop(eventId)
      ocols = [invited, user_reco, evt_p_reco,
        evt_c_reco, user_pop, frnd_infl, evt_pop]
      if train:
        ocols.append(cols[4]) # interested
        ocols.append(cols[5]) # not_interested
      fout.write(",".join(map(lambda x: str(x), ocols)) + "\n")
    fin.close()
    fout.close()

  def rewriteTrainingSet(self):
    self.rewriteData(True)

  def rewriteTestSet(self):
    self.rewriteData(False)

# When running with cython, the actual class will be converted to a .so
# file, and the following code (along with the commented out import below)
# will need to be put into another .py and this should be run.

#import CRegressionData as rd

def main():
  dr = DataRewriter()
  print "rewriting training data..."
  dr.rewriteData(train=True, start=2, header=False)
  print "rewriting test data..."
  dr.rewriteData(train=False, start=2, header=True)
  
    
if __name__ == "__main__":
  main()

These features replace the original train.csv and test.csv files, so now they look like this:

1
2
3
4
train.csv(invited, user_reco, evt_p_reco, evt_c_reco, user_pop, 
          frnd_infl, evt_pop, interested, not_interested)
test.csv(invited, user_reco, evt_p_reco, evt_c_reco, user_pop, 
          frnd_infl, evt_pop)

Incidentally, this was the part that was dog-slow. After suffering through two days during which it wrote out the new train.csv, I finally figured out the cause - the first is that I was using sparse matrices for my reference matrices and having to build up the column and row for the matrix multiplication was taking too much time, so the first step was to replace it with (regular) dense NumPy matrices during deserialization. I also ended up converting it to a shared object using Cython. As a speed comparison, the resulting executable ran through the 10K record test set in about a minute, compared to two days for processing the 15K record training set. I describe the Cython conversion in a separate section below.

Now I train a Stochastic Gradient Descent classifier from Scikits-Learn with my modified training set and build a one-vs-all classifier model to predict the value of the "interested" outcome.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
# Source: RecoWeights.py
from __future__ import division

import math

import numpy as np
import pandas as pd

from sklearn.cross_validation import KFold
from sklearn.linear_model import SGDClassifier

def train():
  """
  Trains a classifier on the entire (modified) training dataset.
  Since our objective is to predict only interested users, we
  only consider the outcome 1=interested and 0=not.
  """
  trainDf = pd.read_csv("../NewData/train.csv")
  X = np.matrix(pd.DataFrame(trainDf, index=None,
    columns=["invited", "user_reco", "evt_p_reco", "evt_c_reco",
    "user_pop", "frnd_infl", "evt_pop"]))
  y = np.array(trainDf.interested)
  clf = SGDClassifier(loss="log", penalty="l2")
  clf.fit(X, y)
  return clf

def validate():
  """
  Runs a 10-fold cross validation on the classifier, reporting
  accuracy.
  """
  trainDf = pd.read_csv("../NewData/train.csv")
  X = np.matrix(pd.DataFrame(trainDf, index=None,
    columns=["invited", "user_reco", "evt_p_reco", "evt_c_reco",
    "user_pop", "frnd_infl", "evt_pop"]))
  y = np.array(trainDf.interested)
  nrows = len(trainDf)
  kfold = KFold(nrows, 10)
  avgAccuracy = 0
  run = 0
  for train, test in kfold:
    Xtrain, Xtest, ytrain, ytest = X[train], X[test], y[train], y[test]
    clf = SGDClassifier(loss="log", penalty="l2")
    clf.fit(Xtrain, ytrain)
    accuracy = 0
    ntest = len(ytest)
    for i in range(0, ntest):
      yt = clf.predict(Xtest[i, :])
      if yt == ytest[i]:
        accuracy += 1
    accuracy = accuracy / ntest
    print "accuracy (run %d): %f" % (run, accuracy)
    avgAccuracy += accuracy
    run += 1
  print "Average accuracy", (avgAccuracy / run)

def test(clf):
  """
  Reads the X values from the dataframe provided, then uses the
  trained classifier to write an array of outcomes.
  """
  origTestDf = pd.read_csv("../Data/test.csv")
  users = origTestDf.user
  events = origTestDf.event
  testDf = pd.read_csv("../NewData/test.csv")
  fout = open("../NewData/result.csv", 'wb')
  fout.write(",".join(["user", "event", "outcome", "dist"]) + "\n")
  nrows = len(testDf)
  Xp = np.matrix(testDf)
  yp = np.zeros((nrows, 2))
  for i in range(0, nrows):
    xp = Xp[i, :]
    yp[i, 0] = clf.predict(xp)
    yp[i, 1] = clf.decision_function(xp)
    fout.write(",".join(map(lambda x: str(x), 
      [users[i], events[i], yp[i, 0], yp[i, 1]])) + "\n")
  fout.close()

def main():
#  validate()
  clf = train()
  test(clf)

if __name__ == "__main__":
  main()

Running a 10-fold cross validation yields an accuracy number of 0.676043972845. Running the classifier against the modified test set yields another temporary file. The user and event columns come from the original test.csv and the outcome and dist are the predicted outcome from the classifier and the distance of the actual point from the predicted hyperplane. So for (user, event) pairs with a predicted outcome of 1, higher values of dist imply a better match.

1
2
3
user,event,outcome,dist
1776192,2877501688,0.0,-1.10395723304
...

The final step is to convert this file to the required output format. This is done using the following code:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Source: ResultFormat.py
from __future__ import division

import pandas as pd

def byDist(x, y):
  return int(y[1] - x[1])

def main():
  # output file
  fout = open("../NewData/final_result.csv", 'wb')
  fout.write(",".join(["User", "Events"]) + "\n")
  resultDf = pd.read_csv("../NewData/result.csv")
  # group remaining user/events
  grouped = resultDf.groupby("user")
  for name, group in grouped:
    user = str(name)
    tuples = zip(list(group.event), list(group.dist), list(group.outcome))
#    tuples = filter(lambda x: x[2]==1, tuples)
    tuples = sorted(tuples, cmp=byDist)
    events = "\"" + str(map(lambda x: x[0], tuples)) + "\""
    fout.write(",".join([user, events]) + "\n")
  fout.close()

if __name__ == "__main__":
  main()

My first attempt was to take both interested and non-interested users and assign them the list of interesting events. This is probably what is expected (because when I put the filter in to remove the interested==0 rows, my score dropped).

All the code in this post is available on my GitHub repository.